DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 113 forks source link

Hisat3n | Masked genome #325

Closed ezecalvo closed 2 years ago

ezecalvo commented 2 years ago

Hi, I'm trying to build a hisat3 masked genome but takes >500gb memory which is my node limit. This happens even when setting options like bmax or dcv to the minimum.

Is there a way to maybe break the genome into subsections and then merge them? Thought about doing that for each chromosome but not sure how to do that once I get the ht2 files.

Thanks!

imzhangyun commented 2 years ago

Hello,

I believe the large memory problem may caused by the graph index building process. Maybe the SNP or SS database is too large for hisat-3n. Could you decrease the number of SNP (try common SNP only) and try again?

Thanks, Leo

ezecalvo commented 2 years ago

Hi,

I'm not using any SNP file, just masking annotated SNPs. Is that what you were referring to with "common"?

Thanks

imzhangyun commented 2 years ago

Could you show me the script you used for index building? Also, could you give me the database(file) you used for index building? Then we can try it on our side.

Thanks, Leo

imzhangyun commented 2 years ago

Also, can I have the reference (fast) file? 40gb memory usage for chr1 is also too big for me.

ezecalvo commented 2 years ago

Sure: https://www.dropbox.com/sh/jaerase0es7ygr8/AADJDylN-AWB5nwvg6qZSgXDa?dl=0

My code: hisat-3n-build --base-change T,C --noauto --bmax 2 --dcv 512 -p 1 --ss mm10.ss --exon mm10.exon mm10_masked.fasta output/hisat3_genome

If it helps, this is the report from a job submission using -p 20, when reaching 500gb it gets killed:

Screen Shot 2021-10-04 at 8 29 59 PM

Also:, I tried using a smaller genome (for example just chr1) it works just fine and uses ~40gb max memory.

ezecalvo commented 2 years ago

Also, can I have the reference (fast) file? 40gb memory usage for chr1 is also too big for me.

Not sure what you mean with reference, but just added the full fasta file (non-masked) and a VCF with all the positions I masked. I'm not using the entire fasta for this!

imzhangyun commented 2 years ago

Hello,

I try to build the graph index with masked genome, it also failed on my side. Because the masked genome makes the graph index very complicated, HISAT2 (HISAT-3N) cannot handle it. However, there is an alternative method let you incorporate the splice site information with your index.

  1. Build the linear hisat-3n index (without --ss or --exon).
  2. Align your reads with liner index (from step 1) with the option --known-splicesite-infile <path> (please check the HISAT2 manual for more information). HISAT-3N could use the splice site information during alignment process and increase the alignment accuracy.

Best, Leo

ezecalvo commented 2 years ago

That works like a charm.

Thanks!