DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
478 stars 119 forks source link

Large differences observed between hg19 and hg38 mapping results with HISAT2 (2.0.4) #101

Open lcolladotor opened 7 years ago

lcolladotor commented 7 years ago

Hi,

We ran HISAT2 (2.0.4) on a 422 sample dataset against hg19, then against hg38. Then using featureCounts we built some count matrices and analyzed them for differential expression and got very different results. This prompted us to look at the alignment metrics reported by HISAT2.

hg19 vs hg38 alignment metrics

The following PDF shows 2 plots for several variables. The first plot in each pair is like an MA-plot where the x axis is the mean and the y axis is the difference with a read line on y=0. The second plot in the pair is a scatter of hg19 (y-axis) vs hg38 metrics (x-axis) with a read line on y=x. The metrics are:

compare_pheno.pdf

You can see on page 8 that there are very large differences in the overall mapping rate between hg19 and hg38. To highlight it, I'm including it again:

compare_pheno_page_08

Metrics details

For more details on how we compute these metrics, I'm including an example HISAT2 alignment summary file and the R code used to extract the numbers.

$ more hg19_subset/HISAT2_out/align_summaries/R2835_C1ULMACXX_summary.txt
76295625 reads; of these:
  76295625 (100.00%) were paired; of these:
    12262929 (16.07%) aligned concordantly 0 times
    46114677 (60.44%) aligned concordantly exactly 1 time
    17918019 (23.48%) aligned concordantly >1 times
    ----
    12262929 pairs aligned concordantly 0 times; of these:
      1718559 (14.01%) aligned discordantly 1 time
    ----
    10544370 pairs aligned 0 times concordantly or discordantly; of these:
      21088740 mates make up the pairs; of these:
        11518338 (54.62%) aligned 0 times
        6038487 (28.63%) aligned exactly 1 time
        3531915 (16.75%) aligned >1 times
92.45% overall alignment rate
> library('jaffelab')
>
> hisatStats <- function(logFile) {
+ y = scan(logFile, what = "character", sep= "\n",
+     quiet = TRUE, strip=TRUE)
+ ## 100% of reads paired
+ reads = as.numeric(ss(y[1], " ")) * 2
+ unaligned = as.numeric(ss(y[12], " "))
+ o = data.frame(trimmed = FALSE,
+ numReads = reads,
+ numMapped = reads - unaligned,
+ numUnmapped = unaligned,
+ overallMapRate = as.numeric(ss(y[15], "\\%"))/100,
+         concordMapRate = (as.numeric(ss(ss(y[4], "\\(",2), "%"))+as.numeric(ss(ss(y[5], "\\(",2), "%")))/100,
+         stringsAsFactors = FALSE)
+     return(o)
+ }
>
> hisatStats('R2835_C1ULMACXX_summary.txt')
  trimmed  numReads numMapped numUnmapped overallMapRate concordMapRate
1   FALSE 152591250 141072912    11518338         0.9245         0.8392

Here's the HISAT2 output for hg38 for the same sample:

$ more hg38/HISAT2_out/align_summaries/R2835_C1ULMACXX_summary.txt
76295625 reads; of these:
  76295625 (100.00%) were paired; of these:
    15977551 (20.94%) aligned concordantly 0 times
    45965835 (60.25%) aligned concordantly exactly 1 time
    14352239 (18.81%) aligned concordantly >1 times
    ----
    15977551 pairs aligned concordantly 0 times; of these:
      1692144 (10.59%) aligned discordantly 1 time
    ----
    14285407 pairs aligned 0 times concordantly or discordantly; of these:
      28570814 mates make up the pairs; of these:
        19804899 (69.32%) aligned 0 times
        6035584 (21.12%) aligned exactly 1 time
        2730331 (9.56%) aligned >1 times
87.02% overall alignment rate

hg19 vs a re-run of hg19 (20 samples only)

There is about a 1-2 month difference between the hg19 and hg38 (most recent) alignments and I wanted to check that nothing in our scripts for running HISAT2 had any effect. So I re-ran HISAT2 with hg19 for 20 samples. The resulting plots show that the metrics are nearly identical, so that was not the issue.

compare_pheno_hg19subset.pdf

HISAT2 command:

hg19 (re-run):

/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Software/hisat2-2.0.4/hisat2 -p 8    -x /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh37_hg19/hisat2_GRCh37primary -1 ${FILE1} -2 ${FILE2}   -S /dcl01/lieber/ajaffe/lab/libd_alzheimers/hg19_subset/HISAT2_out/${ID}_hisat_out.sam --rna-strandness RF --phred33    2>/dcl01/lieber/ajaffe/lab/libd_alzheimers/hg19_subset/HISAT2_out/align_summaries/${ID}_summary.txt

hg38:

/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Software/hisat2-2.0.4/hisat2 -p 8    -x /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh38_hg38/hisat2_GRCh38primary -1 ${FILE1} -2 ${FILE2}   -S /dcl01/lieber/ajaffe/lab/libd_alzheimers/hg38/HISAT2_out/${ID}_hisat_out.sam --rna-strandness RF --phred33   2>/dcl01/lieber/ajaffe/lab/libd_alzheimers/hg38/HISAT2_out/align_summaries/${ID}_summary.txt

Indexes used

Talking with others, the question of how the indexes were created popped up. We created them using the default parameters using the primary fasta files (regions: PRI) from http://www.gencodegenes.org/releases/25.html and http://www.gencodegenes.org/releases/25lift37.html. More specifically, we used ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh38.primary_assembly.genome.fa.gz and ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz.

The output from hisat2-inspect -s is included in the sheet below. It also includes the sum of the contigs lengths.

hisat2-inspect.xlsx

Going forward

Something similar to this has been previously reported in https://biostar.usegalaxy.org/p/19910/ and http://seqanswers.com/forums/showthread.php?p=188200. It's also been partially discussed in https://github.com/alexdobin/STAR/issues/39 and https://github.com/infphilo/hisat2/issues/30. One possible option would be to build indexes using just chrs 1 to 22, X, Y and M and ignoring the other contigs (there's about 90 total in hg19 and about 190 in hg38 primary assemblies). We've also looked at the HISAT2 options (and those for creating the index) and don't see anything obvious we should change.

We could try running another RNA-seq aligner (say STAR) on the same 20 samples versus hg19 and hg38 and see if the alignment metrics change that much between the two reference genomes just like they did with HISAT2.

Ultimately, it could be that simply hg19 is not good enough and that we should ignore the hg19 results. But I'm sure that our colleagues will be very confused about the wide differences in the differential expression results where the only change is the reference genome.

If there's other information that could be useful to you, please let me know.

Best, Leonardo

lcolladotor commented 7 years ago

Hi again,

We ran more tests using the 20 worst samples by overall mapping rate with HISAT2 against hg38. You can see them here (now the X and Y axis have the same range):

unnamed-chunk-2-8

We checked that re-running our code on those 20 samples against hg19 produced the same results. This verified that the issue was not in our code for running HISAT2. I had described this before using the first 20 samples instead of the new set of 20 samples (the worst by... see above).

We re-built the genome indexes for hg19 and hg38 by doing a fresh download of the genome fasta files and running:

MAINDIR=/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/HISAT2_index_test

### hg19
gunzip < ${MAINDIR}/GRCh37.primary_assembly.genome.fa.gz > ${MAINDIR}/GRCh37.primary_assembly.genome.fa

/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Software/hisat2-2.0.4/hisat2-build -p 6 \
${MAINDIR}/GRCh37.primary_assembly.genome.fa ${MAINDIR}/hisat2_GRCh37primary

### hg38
gunzip < ${MAINDIR}/GRCh38.primary_assembly.genome.fa.gz > ${MAINDIR}/GRCh38.primary_assembly.genome.fa

/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Software/hisat2-2.0.4/hisat2-build -p 6 \
${MAINDIR}/GRCh38.primary_assembly.genome.fa ${MAINDIR}/hisat2_GRCh38primary

The results are consistent, so the issue is not about the genome files or how we built our original genome indexes. In any case, now we have the exact code for building the indexes and associated log files.

We also ran STAR (version 2.5.3a) with genome indexes that use the known annotation. Here are the commands for creating the STAR indexes and running it:

## hg19 STAR index
/dcl01/lieber/ajaffe/lab/alt_recount/star_setup/STAR/bin/Linux_x86_64/STAR --runMode genomeGenerate --genomeDir /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/genome_hg19 --genomeFastaFiles /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh37_hg19/GRCh37.primary_assembly.genome.fa --sjdbGTFfile /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh37_hg19/gencode.v25lift37.annotation.gtf --sjdbOverhang 71 --runThreadN 8 --outFileNamePrefix /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/star_index_hg19_

## hg38 STAR index
/dcl01/lieber/ajaffe/lab/alt_recount/star_setup/STAR/bin/Linux_x86_64/STAR --runMode genomeGenerate --genomeDir /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/genome --genomeFastaFiles /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh38_hg38/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation/GENCODE/GRCh38_hg38/gencode.v25.annotationGRCh38.gtf --sjdbOverhang 71 --runThreadN 8 --outFileNamePrefix /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/star_index.txt

## Example call for STAR with hg19
/dcl01/lieber/ajaffe/lab/alt_recount/star_setup/STAR/bin/Linux_x86_64/STAR --runThreadN 6   --genomeDir /dcl01/lieber/ajaffe/lab/alt_recount/star_setup/genome_hg19 --readFilesIn ${FILE1} ${FILE2}     --outFileNamePrefix /dcl01/lieber/ajaffe/lab/libd_alzheimers/hg19_newidx/STAR_out/${ID}_     --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c     --limitBAMsortRAM 50000000000

You can see below that hg19 with HISAT2 vs hg38 with STAR have similar overall mapping rates.

unnamed-chunk-7-8

hg38 with HISAT2 versus hg38 with STAR does present differences like before:

unnamed-chunk-10-8

and hg19 vs hg38 with STAR are fairly similar.

unnamed-chunk-11-8

HISAT2 with hg19 vs STAR with hg38 also looks decent:

unnamed-chunk-8-8

while HISAT2 with hg38 vs STAR with hg19 looks bad again (as expected by the previous plots).

unnamed-chunk-9-8

In conclusion, this issue does seem HISAT2 specific and it looks like the hg19 genome fasta file is ok. Note that with STAR we do see a greater number of reads mapped to chrs 1-22, X and Y with hg38 than hg19. So maybe there's a parameter (or multiple) that are important to change when running HISAT2 vs hg38 (regions PRI from http://www.gencodegenes.org/releases/25.html).

unnamed-chunk-11-10

For completeness, here is a zip file with a HTML (R Markdown) file with all the comparison plots: compare_hg.html.zip.

We'd love to hear what you think about this!

Best, Leonardo

jjlinscientist commented 6 years ago

Hi Leonardo,

Just wondering if you had done any parameter tuning with HiSAT2 for better hg38 alignment. We see the same effect when aligning the same samples to hg38 (compared to hg19, previously), as well.

Best,

Justin

cabnine commented 6 years ago

We've met a similar problem, and here it is. We had two versions of a plant genome with similar quality assembled by two different groups, and we mapped several sample RNA-Seq reads onto them using hisat2-2.1.0 with default parameters, but got very different mapping rate, one is extremlely low (30-40%), the other one is more reasonable(80-90%). However, when we then tried blastn search for the unmapped reads against the first one genome, large fraction of the reads could find a perfect match. So we don't know where is the problem.

infphilo commented 6 years ago

Additional sequences introduced in hg38 increase the chance of reads being mapped to many more locations compared to when using hg37. HISAT2 might stop aligning a read when it could be possibly mapped to more than a certain number of locations. A temporary solution is to use an option -k with a higher value (perhaps -k 20). We'll look into the problem and hopefully find a good solution to it.