antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
12 stars 4 forks source link

Haystac Threads Sleeping #11

Closed Pkaps25 closed 2 years ago

Pkaps25 commented 3 years ago

Hello,

I am building a database with 800k entries and am running into the following issue: I specify --cores 30 to haystac-database with a sequences-file argument, and a look at top shows that there are 35 threads running in the haystac process, but only 1 is running (34 are sleeping). It is impractical to build a database this large with only 1 thread - could you advise as to how to debug this?

Building off of this, I see that there is a --batch options for large database runs? The step moving super slow is entrez_custom_sequences, but would batching help this rule? Finally, I subset from 800k sequences down to 1k, and the script moved much faster, in the sense that each individual job completed more quickly.

I will also add that I’ve built a database with the same references, but with the files concatenated together to give a total number of ~2500 files, and Haystac built the database in a few hours. Thank you for your help.

Haystac version is 0.3.2

antonisdim commented 3 years ago

Hello Peter,

I hope you are doing great !

Indeed that is true entrez_custom_sequences can be quite slow, if you have many records in a single fasta file. The reason is that this rule checks if the input fasta file for a taxon is bgzip compressed, and if not it converts it to the right format, so that can be time consuming for big files. Unfortunately we have not yet been able to come up with something that could mutli-thread this compression process. This rule though can be run in parallel, so multiple sequence/taxon pairs specified in the --sequences-file can be converted concurrently. We have not yet tested if batching that rule would help, but we could give it a hot in the future.

The other that thing that could make it look like the threads are sleeping during the DB building process is that when bowtie2 is building the DB wide index, as it does not always use the max amount of cores that are provided to it by the user.

Please let me know if that makes sense, and I'll definitely run a test to see how I could optimise this step.

Again thank you for your comment and patience !

Best, Antony