DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 113 forks source link

Excessive memory use and extreme mapping times for subset of samples #297

Open rstewa03 opened 3 years ago

rstewa03 commented 3 years ago

Hi, I am having an interesting issue while using hisat2 to map reads from different larval tissues to a chromosome-level genome assembly. The genome index was mad with hisat2-build splice sites were identified with extract_splice_sites.py, and exons were identified with hisat2_extract_exons.py. Reads were mapped with:

$TOOLS/hisat2-2.2.1/hisat2 -p 20 --rg-id=$I -x $VC_REFS_DIR/GCA_905220365.1_ilVanCard2.1_genomic_mod -1 $VC_DATA_CLEAN/${I}_1.fastq.filt.fq.gz -2 $VC_DATA_CLEAN/${I}_2.fastq.filt.fq.gz -S $VC_ALIGN_DIR/${I}.sam --dta --no-mixed --no-discordant --summary-file $VC_ALIGN_DIR/${I}_summaryfile.txt 

The majority of the samples mapped in 7-10 minutes using max. 3-4% of our server's 1008G RAM. For one of the tissues, however, mapping is taking between 5-14 hours per sample, and up to 90% of the available RAM (... and rising for the current sample). Although these samples tend to have fewer reads, I previously used STAR to map all the reads and found these samples tended to have considerably more unmapped reads (for reasons other than mismatches or readlength, but otherwise unidentified). I would like to continue to use hisat2 so that I can use stringtie for novel transcript discovery (hence, --dta), but I am unsure how to mitigate the stress it is putting on our system. Any suggestions would be very much appreciated!

parkchanhee commented 3 years ago

@rstewa03 Thank you for your reporting.

Could you run again the hisat2 with '--no-temp-splicesite' option? It has been reported that there is a problem of excessive memory usage during RNA read alignment in some case. HISAT2 uses splice sites found during the alignment of earlier reads when aligning later reads. This option disables the use of the splice sites found during alignment. An alignment rate may decrease slightly. We are currently working on this issue.

rstewa03 commented 3 years ago

@parkchanhee

Thanks for the fast response. Yes, adding '--no-temp-splicesite' drastically decreases the mapping time (a sample that previously took 4h and 30% memory mapped in under 10 minutes and with negligible RAM.

I'm specifically interested in alternative splicing among samples. Do you have a sense of how this option affects the reliability of mapping across splice junctions?

parkchanhee commented 3 years ago

@rstewa03 It doesn't affect the mapping to splice junction if the index is built with a splice site. This option may affect read alignment that spans multiple exons with a small anchor(1~7bp) at the novel splice sites.