COMBINE-lab / pufferfish

An efficient index for the colored, compacted, de Bruijn graph
GNU General Public License v3.0
107 stars 19 forks source link

Building an index using `salmon index` seems stuck at twopaco step #12

Closed dombraccia closed 4 years ago

dombraccia commented 4 years ago

Hello COMBINE-lab, I am trying to use salmon to quantify genes from a gene catalogue (Integrative Gene Catalogue for human gut microbe genes) using shotgun reads from HMP2, but my pipeline seems to be stuck at the index building phase, specifically, the twopaco step of the index. The only line of code I have run so far is:

salmon index -t data/IGC/IGC.fa -i processed/IGC_index --threads 16 -f 37

where IGC.fa is a fasta file with ~9,870,000 genes, and takes up 7.7gb (ntHll estimated 6,091,444,578 distinct k-mers). the recommended filter size was 2^37, so on a subsequent run, I just included the -f 37 flag to save time calculating that.

The only issue now is that it seems to be stuck at this building step for a while (program has been running for 24 hrs) so I wanted to check if that was normal, or if I should modify my parameters to somehow get past this point. I am running this on a HPC and I requested 500gb of RAM, so I don't think memory is the problem.

Here is the current state of the error log:

Version Info: This is the most recent version of salmon.

index ["processed/IGC_index"] did not previously exist . . . creating it [2020-05-29 11:14:16.148] [jLog] [warning] The salmon index is being built without any decoy sequences. It is recommended that decoy sequence (either computed auxiliary decoy sequence or the genome of the organism) be provided during indexing. Further details can be found at https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode. [2020-05-29 11:14:16.148] [jLog] [info] building index out : processed/IGC_index [2020-05-29 11:14:16.149] [puff::index::jointLog] [info] Running fixFasta

[Step 1 of 4] : counting k-mers counted k-mers for 9870000 transcripts [2020-05-29 11:19:53.983] [puff::index::jointLog] [info] Replaced 85,915 non-ATCG nucleotides [2020-05-29 11:19:53.983] [puff::index::jointLog] [info] Clipped poly-A tails from 13 transcripts wrote 9879896 cleaned references

And here is the current state of the output log:

Threads = 16 Vertex length = 31 Hash functions = 5 Filter size = 137438953472 Capacity = 2 Files: processed/IGC_index/ref_k31_fixed.fa

Round 0, 0:137438953472 Pass Filling Filtering 1 6242 7686 2 2040 30 True junctions count = 51430096 False junctions count = 26751117 Hash table size = 78181213 Candidate marks count = 222470897

Reallocating bifurcations time: 80

Let me know if there is anything else I should provide you with

fataltes commented 4 years ago

Thank you @dombraccia , I think unfortunately this is an issue that happens when we have a high number of distinct kmers regardless of the size of the fasta file. I think in such cases even choosing the large number for the bloomfilter based on the approximate kmer distribution, still the final hash-table is gonna be so large that we still need to deal with the slowness and growth in memory caused by it.

I think in those cases it might be worth trying to find the sweet spot between the size of the final hash-table and the bloom filter.

What do you think @rob-p ?

dombraccia commented 4 years ago

Update: Pufferfish ended up finishing the index! It took ~3.3 days in total, but it finished.

I will close the issue, but I'll also run the indexing step again with a bigger bloom filter size of 2^38 or 2^39 bits and report back the time here to potentially create a better recommendation of filter size of reference material.