DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 113 forks source link

Hisat-3N index not building #321

Closed ezecalvo closed 2 years ago

ezecalvo commented 2 years ago

Hi,

I'm using hisat3n to build a hg38 index using an LSF cluster that has been running for 20 days without any changes in the log file. Is this normal? What could be the problem? My files work fine when building an index in hisat2.

My code using 30 threads and 10gb memory for each:

bsub -q long -n 30 -R rusage[mem=10000] -R span[hosts=1] -W 720:00 hisat-3n-build --base-change T,C -p 30 --ss hg38.ss --exon hg38.exon hg38.fa hg38_hisat3n/hisat3_genome

My log file:

Settings:
  Output files: "g38.fasta hg38_hisat3n/hisat3_genome.3n.*.ht2"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  hg38.fa
Reading reference sizes
  Time reading reference sizes: 00:02:02
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:25

Thanks!

imzhangyun commented 2 years ago

Hello,

Since we are not very familiar to bsub, we don't know why this happened. Here is one thing we believed could cause this problem.hisat-3n-build write temporary files when building the hisat-3n-index. If bsub run multiple hisat-3n-build and they write information to the same temporary file, it could cause error. Could you open one host and one task for hisat-3n-build with -p 30 multithreading?

Best, Leo

ezecalvo commented 2 years ago

Hi,

Thanks for the fast response.

I'm running with one host and task. Do you have any command you use for queuing systems? It could be sbatch or something similar, just want to translate it into bsub.

I should also mention that using a smaller genome it works just fine in ~4hrs. I did this using just chromosome 1 from the fasta file for example.

imzhangyun commented 2 years ago

Hello,

I just run hisat-3n-build on a 256GB memory cluster, and it looks OK. Here is my script: ./hisat-3n/hisat-3n-build --base-change T,C -p 30 --ss ../data/reference/genome.ss --exon ../data/reference/genome.exon ../data/reference/genome.fa ../tmp/hisat-3n_genome

Could you pull and make the newest hisat-3n and try it again? The graph index building process may use more than 100GB memory. You should see many temporary files with .rf suffix in your output directory after 10min of the index building started.

Thanks, Leo

ezecalvo commented 2 years ago

Hi,

That didn't work!

I made it work (either using bsub or not) when not using -ss and --exon. I checked the obvious like chromosome names being consistent in the fasta file and ss/exon and that looks fine!

This is how the ss and exon files look like:

hg38.ss 1 12056 12178 + 1 12226 12612 + 1 12696 12974 + 1 12720 13220 + 1 13051 13220 + 1 13373 13452 + 1 14500 15004 - 1 15037 15795 - 1 15946 16606 - 1 16764 16857 -

hg38.exon 1 11868 12226 + 1 12612 12720 + 1 12974 13051 + 1 13220 14500 + 1 15004 15037 - 1 15795 15946 - 1 16606 16764 - 1 16857 17054 -

imzhangyun commented 2 years ago

Hello,

I tested the hisat-3n-build with --ss, --exon, and -p 30 option. It takes about 2 hours to finish the building process without error. To help you troubleshooting, could you tell me the link that you downloaded hg38.fa, hg38.ss, and hg38.exon? Then we can test on our side.

Best, Leo

ezecalvo commented 2 years ago

Hi,

Here are the files: https://www.dropbox.com/sh/fu901c8p79x5y15/AADNS3pSAHWYFJws4bdnBE85a?dl=0

I built hg38.ss and hg38.exon following the instructions in hisat2 manual.

Thanks!

imzhangyun commented 2 years ago

Hello,

I just checked your file. Your hg38.ss and hg38.exon have some redundant information at the end of file. You can use tail -n 50 hg38.ss to check the extra information. It looks related to your the job submission output. Could you re-build the hg38.ss and hg38.exon file then build the hisat-3n-index?

Thanks, Leo

ezecalvo commented 2 years ago

Ouch, I'm deeply sorry for such a silly mistake. Just removed it and the index is building normally. Thanks a lot for the patience and the help!