Xinglab / espresso

Other
48 stars 4 forks source link

Optimal number of threads to run ESPRESSO_C with samples 10-25M reads #61

Open pclavell opened 2 days ago

pclavell commented 2 days ago

Hello, I've been struggling to run this ESPRESSO_C for a while. Previously I used 100 threads but 10-25M samples wouldn't end in 30h. Then I saw in another issue that the threads are actually competing so more threads make it slower and that 5 was faster than 20 threads. So I decided to run ESPRESSO_C with 5 threads. However after 16h it is still loading splice junction info. Surprisingly a sample with 4 M reads ended within 3h. What do you recommend doing?

I've tried running the snakemake pipeline as shown in the documentation but it is not finding the Snakefile, nor the profiles unless I provide their full path, but even doing that, the pipeline doesn't start. Thanks

EricKutschera commented 2 days ago

This is the issue that mentions C step threads competing for the filesystem: https://github.com/Xinglab/espresso/issues/29#issuecomment-1699202865

I usually run 5 threads per C job and have many C jobs running at the same time on different nodes in a compute cluster (using the snakemake). That issue also mentions different BLAST versions being slower than others. BLAST Versions 2.10.1, 2.12.0, and 2.14.1 all had good performance. I've recently been running with version 2.15.0 which also seems ok

The code between Loading splice junction info and the next print is reading the input files: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_C.pl#L168

If the code is actually spending 16h loading those files then it seems like there is a filesystem issue. It could be that the code is past that part and the prints are just being buffered. You could check to see if anything is being written to the output directory for that job

The snakemake needs to be run from the snakemake directory of the source code in order to find the Snakefile, snakemake_config.yaml, and also the scripts/ directory. Are you getting a specific error message?

The C step can use 1 thread per read group. The S step separates the alignments into groups of alignments that have overlapping coordinates to define the read groups. After running the S step there should be one C directory created per input file. Each default C directory would work on all the read groups for that input file. If using the snakemake then https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/scripts/split_espresso_s_output_for_c.py is used to create C directories according to read groups. Since the threads work on read groups that can be more efficient. The split also uses --target-reads-per-c which will limit the number of reads assigned to a C job even if a particular read group has a large number of reads

With the default split, if there is a read group that has a very large number of reads then the thread assigned to that group ends up doing almost all the work. My guess is that the C job that is taking a long time has some large read group. The thread for that group essentially determines the total time and adding more threads wouldn't help

You could run the split and combine scripts manually instead of using the snakemake:

python snakemake/scripts/split_espresso_s_output_for_c.py --orig-work-dir out_dir_s --new-base-dir out_dir_c --target-reads-per-c 500000 --num-threads-per-c 5 --genome-fasta /path/to/genome.fasta

perl src/ESPRESSO_C.pl -I out_dir_c/0 -F out_dir_c/fastas/0.fa -X 0 -T 5
perl src/ESPRESSO_C.pl -I out_dir_c/1 -F out_dir_c/fastas/1.fa -X 0 -T 5
...
perl src/ESPRESSO_C.pl -I out_dir_c/n -F out_dir_c/fastas/n.fa -X 0 -T 5

python snakemake/scripts/combine_espresso_c_output_for_q.py --c-base-work-dir out_dir_c --new-base-dir out_dir_q

perl src/ESPRESSO_Q.pl -A /path/to/anno.gtf -L out_dir_q/samples.tsv.updated

I'm working on a new version of ESPRESSO that addresses some of these issues and hopefully that version will be released soon

pclavell commented 2 days ago

I'll try to implement your solution and let you know. Thanks a lot for the quick answer.