Open lcolladotor opened 7 years ago
Hi again,
We ran more tests using the 20 worst samples by overall mapping rate with HISAT2
against hg38
. You can see them here (now the X and Y axis have the same range):
We checked that re-running our code on those 20 samples against hg19
produced the same results. This verified that the issue was not in our code for running HISAT2
. I had described this before using the first 20 samples instead of the new set of 20 samples (the worst by... see above).
We re-built the genome indexes for hg19
and hg38
by doing a fresh download of the genome fasta files and running:
MAINDIR=/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/HISAT2_index_test
### hg19
gunzip < ${MAINDIR}/GRCh37.primary_assembly.genome.fa.gz > ${MAINDIR}/GRCh37.primary_assembly.genome.fa
/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Software/hisat2-2.0.4/hisat2-build -p 6 \
${MAINDIR}/GRCh37.primary_assembly.genome.fa ${MAINDIR}/hisat2_GRCh37primary
### hg38
gunzip < ${MAINDIR}/GRCh38.primary_assembly.genome.fa.gz > ${MAINDIR}/GRCh38.primary_assembly.genome.fa
/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Software/hisat2-2.0.4/hisat2-build -p 6 \
${MAINDIR}/GRCh38.primary_assembly.genome.fa ${MAINDIR}/hisat2_GRCh38primary
The results are consistent, so the issue is not about the genome files or how we built our original genome indexes. In any case, now we have the exact code for building the indexes and associated log files.
We also ran STAR
(version 2.5.3a) with genome indexes that use the known annotation. Here are the commands for creating the STAR
indexes and running it:
## hg19 STAR index
/dcl01/lieber/ajaffe/lab/alt_recount/star_setup/STAR/bin/Linux_x86_64/STAR --runMode genomeGenerate --genomeDir /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/genome_hg19 --genomeFastaFiles /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh37_hg19/GRCh37.primary_assembly.genome.fa --sjdbGTFfile /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh37_hg19/gencode.v25lift37.annotation.gtf --sjdbOverhang 71 --runThreadN 8 --outFileNamePrefix /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/star_index_hg19_
## hg38 STAR index
/dcl01/lieber/ajaffe/lab/alt_recount/star_setup/STAR/bin/Linux_x86_64/STAR --runMode genomeGenerate --genomeDir /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/genome --genomeFastaFiles /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh38_hg38/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh38_hg38/gencode.v25.annotationGRCh38.gtf --sjdbOverhang 71 --runThreadN 8 --outFileNamePrefix /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/star_index.txt
## Example call for STAR with hg19
/dcl01/lieber/ajaffe/lab/alt_recount/star_setup/STAR/bin/Linux_x86_64/STAR --runThreadN 6 --genomeDir /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/genome_hg19 --readFilesIn ${FILE1} ${FILE2} --outFileNamePrefix /dcl01/lieber/ajaffe/lab/libd_alzheimers/hg19_newidx/STAR_out/${ID}_ --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c --limitBAMsortRAM 50000000000
You can see below that hg19 with HISAT2
vs hg38
with STAR
have similar overall mapping rates.
hg38
with HISAT2
versus hg38
with STAR
does present differences like before:
and hg19
vs hg38
with STAR
are fairly similar.
HISAT2
with hg19
vs STAR
with hg38
also looks decent:
while HISAT2
with hg38
vs STAR
with hg19
looks bad again (as expected by the previous plots).
In conclusion, this issue does seem HISAT2
specific and it looks like the hg19
genome fasta file is ok. Note that with STAR
we do see a greater number of reads mapped to chrs 1-22, X and Y with hg38
than hg19
. So maybe there's a parameter (or multiple) that are important to change when running HISAT2
vs hg38
(regions PRI
from http://www.gencodegenes.org/releases/25.html).
For completeness, here is a zip file with a HTML (R Markdown) file with all the comparison plots: compare_hg.html.zip.
We'd love to hear what you think about this!
Best, Leonardo
Hi Leonardo,
Just wondering if you had done any parameter tuning with HiSAT2 for better hg38 alignment. We see the same effect when aligning the same samples to hg38 (compared to hg19, previously), as well.
Best,
Justin
We've met a similar problem, and here it is. We had two versions of a plant genome with similar quality assembled by two different groups, and we mapped several sample RNA-Seq reads onto them using hisat2-2.1.0 with default parameters, but got very different mapping rate, one is extremlely low (30-40%), the other one is more reasonable(80-90%). However, when we then tried blastn search for the unmapped reads against the first one genome, large fraction of the reads could find a perfect match. So we don't know where is the problem.
Additional sequences introduced in hg38 increase the chance of reads being mapped to many more locations compared to when using hg37. HISAT2 might stop aligning a read when it could be possibly mapped to more than a certain number of locations. A temporary solution is to use an option -k with a higher value (perhaps -k 20). We'll look into the problem and hopefully find a good solution to it.
Hi,
We ran
HISAT2
(2.0.4) on a 422 sample dataset against hg19, then against hg38. Then usingfeatureCounts
we built some count matrices and analyzed them for differential expression and got very different results. This prompted us to look at the alignment metrics reported byHISAT2
.hg19 vs hg38 alignment metrics
The following PDF shows 2 plots for several variables. The first plot in each pair is like an MA-plot where the x axis is the mean and the y axis is the difference with a read line on y=0. The second plot in the pair is a scatter of hg19 (y-axis) vs hg38 metrics (x-axis) with a read line on y=x. The metrics are:
derfinder::getTotalMapped()
on the resulting BAM files (we convert theHISAT2
SAM file to BAM).compare_pheno.pdf
You can see on page 8 that there are very large differences in the overall mapping rate between hg19 and hg38. To highlight it, I'm including it again:
Metrics details
For more details on how we compute these metrics, I'm including an example HISAT2 alignment summary file and the R code used to extract the numbers.
Here's the HISAT2 output for hg38 for the same sample:
hg19 vs a re-run of hg19 (20 samples only)
There is about a 1-2 month difference between the hg19 and hg38 (most recent) alignments and I wanted to check that nothing in our scripts for running
HISAT2
had any effect. So I re-ranHISAT2
with hg19 for 20 samples. The resulting plots show that the metrics are nearly identical, so that was not the issue.compare_pheno_hg19subset.pdf
HISAT2 command:
hg19 (re-run):
hg38:
Indexes used
Talking with others, the question of how the indexes were created popped up. We created them using the default parameters using the primary fasta files (regions:
PRI
) from http://www.gencodegenes.org/releases/25.html and http://www.gencodegenes.org/releases/25lift37.html. More specifically, we used ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh38.primary_assembly.genome.fa.gz and ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz.The output from
hisat2-inspect -s
is included in the sheet below. It also includes the sum of the contigs lengths.hisat2-inspect.xlsx
Going forward
Something similar to this has been previously reported in https://biostar.usegalaxy.org/p/19910/ and http://seqanswers.com/forums/showthread.php?p=188200. It's also been partially discussed in https://github.com/alexdobin/STAR/issues/39 and https://github.com/infphilo/hisat2/issues/30. One possible option would be to build indexes using just chrs 1 to 22, X, Y and M and ignoring the other contigs (there's about 90 total in hg19 and about 190 in hg38 primary assemblies). We've also looked at the
HISAT2
options (and those for creating the index) and don't see anything obvious we should change.We could try running another RNA-seq aligner (say STAR) on the same 20 samples versus hg19 and hg38 and see if the alignment metrics change that much between the two reference genomes just like they did with
HISAT2
.Ultimately, it could be that simply hg19 is not good enough and that we should ignore the hg19 results. But I'm sure that our colleagues will be very confused about the wide differences in the differential expression results where the only change is the reference genome.
If there's other information that could be useful to you, please let me know.
Best, Leonardo