Xinglab / espresso

Other
48 stars 4 forks source link

Error with ESPRESSO_C #53

Open Oliverfeudj opened 1 month ago

Oliverfeudj commented 1 month ago

Hello @EricKutschera

I have been getting the following error with ESPRESSO_C and I can't figure out why, can you please help me

Failed to run "grep -A1 --no-group-separator -e 6:99401630:99402539:1/-1/1/ -e 6:99401630:99402539:1/-2/0/ -e 6:99404702:99406030:1/-1/1/ -e 6:99404702:99406030:1/-2/0/ Stress1_out/1/blast_135368//lost_end_group_1.fa | nhmmer --dna -T 3 --max --cpu 2 --tblout Stress1_out/1/blast_135368//lost_end_group_1_0.hmmer1 --qformat fasta --qsingle_seqs - Stress1_out/1/blast_135368//current_flankingSJ.fa > /dev/null". Exit code is -1 at /opt/conda/envs/nextflow_env/bin/ESPRESSO_C.pl line 1200. Failed to run "blastn -task blastn -db Stress1_out/1/blast_92252//current_db -query Stress1_out/1/blast_92252//read_group_1.fa.blast.tmp -word_size 4 -reward 5 -penalty -4 -gapopen 8 -gapextend 6 -num_threads 1 -evalue 10 -dust no -soft_masking false -outfmt "6 std btop" >> Stress1_out/1/blast_92252//read_SJ_group_1.blast". Exit code is -1 at /opt/conda/envs/nextflow_env/bin/ESPRESSO_C.pl line 2015, <$blast_sj_in_handle> line 4. Failed to run "grep -A1 --no-group-separator -e 1:32229208:32230926:0/-1/0/ -e 1:32229208:32230926:0/-2/1/ -e 1:32232184:32234951:1/-1/1/ -e 1:32232184:32234951:1/-2/0/ Stress1_out/1/blast_3329//lost_end_group_1.fa | nhmmer --dna -T 3 --max --cpu 2 --tblout Stress1_out/1/blast_3329//lost_end_group_1_0.hmmer1 --qformat fasta --qsingle_seqs - Stress1_out/1/blast_3329//current_flankingSJ.fa > /dev/null". Exit code is -1 at /opt/conda/envs/nextflow_env/bin/ESPRESSO_C.pl line 1200.

Also, the process takes forever and still throws an error like this

Thank you for your help

EricKutschera commented 1 month ago

These are the lines for those errors: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_C.pl#L1200 https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_C.pl#L2015

In the first command the output is redirected to /dev/null which could have redirected useful error output. You could try running the commands yourself from the commandline to see if there are any other error messages. For the second command the output is redirected to Stress1_out/1/blast_92252//read_SJ_group_1.blast. Are there any error messages in that file?

Are you able to run the example from the README without any errors?: https://github.com/Xinglab/espresso/tree/v1.4.0?tab=readme-ov-file#example

ESPRESSO_C is known to take a long time. Ideally you could use the snakemake workflow in a cluster environment to speed things up: https://github.com/Xinglab/espresso/issues/5

Oliverfeudj commented 1 month ago

Hello @EricKutschera and thank you for your reply I am able to run the test data without any problem, I thought the problem with ESPRESSO_C was with the machine I was running the scripts on since it has only few CPUS so I tried to run on a cluster and Now I have an error of ESPRESSO_Q:

No valid read_final.list can be found in Stress4_out/1. [Fri May 10 15:39:25 2024] Loading annotation [Fri May 10 15:40:00 2024] Summarizing annotated isoforms [Fri May 10 15:40:05 2024] Loading corrected splice junctions and alignment information by ESPRESSO Perl exited with active threads: 16 running and unjoined 0 finished and unjoined 0 running and detached And when I look into the files of the output I don't see any read_final.list , I see temporary files like: 10.read_final.tmp, 11.read_final.tmp and so on...

Regarding the speed of ESPRESSO_C, I am using Nextflow to run my scripts, maybe there is a way to adapt the snakemake method to Nextflow?

Thank you again for your help!!

EricKutschera commented 1 month ago

Here's the line for that error: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_Q.pl#L389

It's looking for files like {chr}_read_final.txt in each of the subdirectories that a C step was run for. Basically at this point in the code the Q step is looking for all the results from the C steps so it can aggregate the results. The error is saying that for one of the C steps there aren't any reads in the output which suggests that there was an error in that C step (maybe the same errors from the original post)

I'm not very familiar with Nextflow, but the main thing the snakemake does to address the C step running time is to split the C step up into smaller jobs with these two scripts: split_espresso_s_output_for_c.py: https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/Snakefile#L458 combine_espresso_c_output_for_q.py: https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/Snakefile#L552

After the split script is run the snakemake checks to see how many C jobs it needs to run. It might not be easy to get that to work automatically with Nextflow