DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
470 stars 114 forks source link

massive memory use by hisat2-build when attempting to index the rat genome #123

Open drtjpemberton opened 7 years ago

drtjpemberton commented 7 years ago

I am trying to index the rat genome (Ensembl release 89) using the "make_rnor6_tran.sh" script in the hisat2 installation folder (this script includes known transcript structure in the index) on a workstation with 512 Gb RAM and 28 cores running CentOS 7. The program is consistently being killed by the kernel due to exhaustion of system memory, which is over twice your recommended amount for the human genome when including known SNPs, splice-sites, and exons in the index. The rat genome is comparable to that of humans, but the number of SNPs and transcripts is much lower, so I am at a loss as to why this keeps happening.

One possible thought is that there appears to be a bug in how hisat2-build assesses available system memory. On systems with 256 Gb RAM or less returns it throws an out of memory, trying more friendly settings, message as the program continues to search the parameter space (albeit ultimately unsuccessfully) while on systems with >256 Gb of RAM it gets exhausts the memory without a second thought and its killed by the kernel.

Are you able to provide the relevant settings you used when indexing the human genome? Or since indexing is a relatively quick process, can you index Ensembl release 89 of the rat genome, including known SNPs and transcripts in the index, and post the "rnor6_snp_tran" index on your groups hisat2 web page?

Thanks in advance,

Trevor

Kapeel commented 6 years ago

Hi, I am facing a similar issue indexing a Human Genome. Below is the error I get

hisat2-indexing Homo_sapiens.GRCh38.dna.toplevel
Settings:
  Output files: "Homo_sapiens.GRCh38.dna.toplevel.*.ht2l"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  Homo_sapiens.GRCh38.dna.toplevel.fa
Reading reference sizes
  Time reading reference sizes: 00:02:49
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:06:15
  Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 576132141 --dcv 1024
  Doing ahead-of-time memory usage test
Executing: hisat2 -I 0 --min-intronlen 20 --max-intronlen 500000 --dta -X 500 --dta-cufflinks -x Homo_sapiens.GRCh38.dna.toplevel -U SRR849504.fastq -p 4 | samtools view -bS - > SRR849504.fastq.bam

Error reading _rstarts[] array: 36448, 42288
Error: Encountered internal HISAT2 exception (#1)
Command: /hisat2/hisat2-align-l --wrapper basic-0 -I 0 --min-intronlen 20 --max-intronlen 500000 --dta -X 500 --dta-cufflinks -x Homo_sapiens.GRCh38.dna.toplevel -p 4 -U SRR849504.fastq 
(ERR): hisat2-align exited with value 1
[samopen] no @SQ lines in the header.
[sam_read1] missing header? Abort!
[bam_header_read] EOF marker is absent. The input is probably truncated.

What settings are recomended. Thanks Kapeel

drtjpemberton commented 6 years ago

Kapeel,

You can download an index for the GRCh38 release of the human reference sequence from the authors website (look on the right-hand side as you scroll down). This will save you the frustration of trying to do this yourself!

Trevor

Lee211 commented 6 years ago

I am facing the same issue. I have download "R. norvegicus, UCSC rn6 ,genome index" from hisat2 website, but it not include split site and exon. I think my results is not believing, becsuse some genes map genome,but FPKM is 0.

snsansom commented 6 years ago

Also have this issue - can't build a genome_trans index (using version 2.1.0) for mm10 with Ensembl 91 annotations due to lack of memory on a node with 1TB of RAM.