Open pclavell opened 2 days ago
This is the issue that mentions C step threads competing for the filesystem: https://github.com/Xinglab/espresso/issues/29#issuecomment-1699202865
I usually run 5 threads per C job and have many C jobs running at the same time on different nodes in a compute cluster (using the snakemake). That issue also mentions different BLAST versions being slower than others. BLAST Versions 2.10.1, 2.12.0, and 2.14.1 all had good performance. I've recently been running with version 2.15.0 which also seems ok
The code between Loading splice junction info
and the next print is reading the input files: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_C.pl#L168
If the code is actually spending 16h loading those files then it seems like there is a filesystem issue. It could be that the code is past that part and the prints are just being buffered. You could check to see if anything is being written to the output directory for that job
The snakemake needs to be run from the snakemake directory of the source code in order to find the Snakefile, snakemake_config.yaml, and also the scripts/ directory. Are you getting a specific error message?
The C step can use 1 thread per read group. The S step separates the alignments into groups of alignments that have overlapping coordinates to define the read groups. After running the S step there should be one C directory created per input file. Each default C directory would work on all the read groups for that input file. If using the snakemake then https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/scripts/split_espresso_s_output_for_c.py is used to create C directories according to read groups. Since the threads work on read groups that can be more efficient. The split also uses --target-reads-per-c
which will limit the number of reads assigned to a C job even if a particular read group has a large number of reads
With the default split, if there is a read group that has a very large number of reads then the thread assigned to that group ends up doing almost all the work. My guess is that the C job that is taking a long time has some large read group. The thread for that group essentially determines the total time and adding more threads wouldn't help
You could run the split and combine scripts manually instead of using the snakemake:
python snakemake/scripts/split_espresso_s_output_for_c.py --orig-work-dir out_dir_s --new-base-dir out_dir_c --target-reads-per-c 500000 --num-threads-per-c 5 --genome-fasta /path/to/genome.fasta
perl src/ESPRESSO_C.pl -I out_dir_c/0 -F out_dir_c/fastas/0.fa -X 0 -T 5
perl src/ESPRESSO_C.pl -I out_dir_c/1 -F out_dir_c/fastas/1.fa -X 0 -T 5
...
perl src/ESPRESSO_C.pl -I out_dir_c/n -F out_dir_c/fastas/n.fa -X 0 -T 5
python snakemake/scripts/combine_espresso_c_output_for_q.py --c-base-work-dir out_dir_c --new-base-dir out_dir_q
perl src/ESPRESSO_Q.pl -A /path/to/anno.gtf -L out_dir_q/samples.tsv.updated
I'm working on a new version of ESPRESSO that addresses some of these issues and hopefully that version will be released soon
I'll try to implement your solution and let you know. Thanks a lot for the quick answer.
Hello, I've been struggling to run this ESPRESSO_C for a while. Previously I used 100 threads but 10-25M samples wouldn't end in 30h. Then I saw in another issue that the threads are actually competing so more threads make it slower and that 5 was faster than 20 threads. So I decided to run ESPRESSO_C with 5 threads. However after 16h it is still loading splice junction info. Surprisingly a sample with 4 M reads ended within 3h. What do you recommend doing?
I've tried running the snakemake pipeline as shown in the documentation but it is not finding the Snakefile, nor the profiles unless I provide their full path, but even doing that, the pipeline doesn't start. Thanks