Xinglab / espresso

Other
48 stars 4 forks source link

Error in running espresso #55

Closed zhangkn3 closed 1 month ago

zhangkn3 commented 1 month ago

Hi, thanks for constructing such a wonderful software! I am running espresso to identify novel splicing transcripts. My script is as follows:

echo "Starting espresso" module load Anaconda3/5.0.1-gcb01 source activate /data/conda/env_dir/useespresso

perl /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_S.pl \ -L /gpfs/fs1/data/Archived_bash/9_espresso_AS/sample_list.tsv \ -F /gpfs/fs1/data/data/hg38.fa \ -A /gpfs/fs1/data/data/hg38.knownGene.gtf \ -O /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10 \ -Q 14 -N 10 \ -T 7

perl /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_C.pl \ -I /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10 \ -F /gpfs/fs1/data/data/hg38.fa \ -X 0 \ -T 12

perl /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_Q.pl \ -L /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10/sample_list.tsv.updated \ -A /gpfs/fs1/data/data/hg38.knownGene.gtf \ -O /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10/ \ -V /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10/forms_of_iso.tsv

echo "espresso test Completed"

This is the first time I added the parameter " -N 10" in espresso_S. And I got an error response as follows:

Starting espresso /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/4_Minimap2/minimaped_GBM011.sort.sam 0 sort: write failed: /tmp/sortlvCkux: No space left on device Thread 1 terminated abnormally: Failed to sort --buffer-size=2G -k 3,3 /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list > /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list2. Exit code is 512 at /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_S.pl line 703. Exiting due to error in worker thread [Fri May 10 01:16:03 2024] Loading reference Worker 0 begins to scan: /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/4_Minimap2/minimaped_GBM011.sort.sam Worker 0 finished reporting. [Fri May 10 05:33:49 2024] Re-cluster all reads [Fri May 10 05:34:45 2024] Loading annotation [Fri May 10 05:35:00 2024] Summarizing annotated splice junctions for each read group /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/4_Minimap2/minimaped_GBM011.sort.sam(0) 0_25104(25336) 0_2609(2608) 0_2843(2842) 0_21281(21345) 0_19401(19451) Worker 0 begins to scan: /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/4_Minimap2/minimaped_GBM011.sort.sam Worker 0 terminated early. [Fri May 10 05:42:42 2024] Loading splice junction info sort: cannot read: /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list3: No such file or directory Failed to run 'sort --buffer-size=2G -k1,1n -k3,3 /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list3 --output /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list3.tmp'. Exit code is 512 at /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_C.pl line 178. No valid read_final.list can be found in /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0. [Fri May 10 05:42:42 2024] Loading annotation [Fri May 10 05:43:06 2024] Summarizing annotated isoforms [Fri May 10 05:43:13 2024] Loading corrected splice junctions and alignment information by ESPRESSO Perl exited with active threads: 5 running and unjoined 0 finished and unjoined 0 running and detached espresso test Completed

How do you think of the error? Was it about the script, or was there something wrong with our computer cluster node?

Looking forward to your kind response.

Best, Kenan

zhangkn3 commented 1 month ago

One more question, I previously used the fault parameter and got 90k novel transcripts in the result. How could I modify the script to get less novel transcripts with higher reliability?

Best, Kenan

EricKutschera commented 1 month ago

Here's the main error: sort: write failed: /tmp/sortlvCkux: No space left on device

Which is from this command:

sort --buffer-size=2G -k 3,3 /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list > /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list2

https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_S.pl#L703

ESPRESSO uses the sort command which will use the TMPDIR environment variable as the directory for temporary files or if TMPDIR is not set then /tmp/ will be used. On your system it could be that /tmp/ doesn't have much space. Hopefully the directory where you are writing the espresso output should have enough space. You could try adding export TMPDIR='/gpfs/fs1/data/Archived_bash/9_espresso_AS' to your script

Setting -N (--read_num_cutoff) higher than the default of 2 reads could result in fewer but more reliable novel transcripts by requiring more supporting reads. You've already increased -Q (--mapq_cutoff) which also could help. You could also try -R (--read_ratio_cutoff) which is similar to --read_num_cutoff. The two read cutoff parameters are also arguments to the Q step so you could set them there as well (--read_num_cutoff, --read_ratio_cutoff)

zhangkn3 commented 1 month ago

Thanks, Eric! The script is running successfully this time!

I am adding these parameters and running the scripts parallelly now. Do you have any ideas about the difference between adding both -N and -R in espress_S or _Q? In my mind, the results should be the same when adding the parameter in espresso_S or _Q. Is it correct? Or, do you have any preferences or recommendations for the parameters?

Best, Kenan

EricKutschera commented 1 month ago

The parameters are used a bit differently in the S and Q steps

S step:

-N, --read_num_cutoff
min perfect read count for denovo detected candidate splice junctions

Q step:

-N, --read_num_cutoff
min perfect read count for all splice junctions of novel isoform

In the S step the parameters are used to decide which novel splice junctions are "high confidence". Only the high confidence novel splice junctions will be used in the later steps. In the Q step the parameters are checked again, but this time the check is considering the reads for a specific isoform and all the junctions in that isoform are checked

I would recommend setting the same values for those parameters in both S and Q

zhangkn3 commented 1 month ago

Got it, I will test the result with these parameters. Thanks again for the nice pipeline! Best, Kenan

zhangkn3 commented 1 month ago

Hi Eric, Thanks for your kind help. The results are really cool. The novel_isoform list got narrowed down. However, I found most of the transcripts are named with ENST IDs. Are they really novel unannotated isoforms? What are the differences between these transcripts and the other transcripts named by "ESPRESSO:"

image

Looking forward to hearing from you!

Best, Kenan

EricKutschera commented 1 month ago

From looking at the code it seems like all of the novel_isoform rows in the updated.gtf output file should have a transcript ID like ESPRESSO:.... This is the line where a novel_isoform is written to the gtf file: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_Q_Thread.pm#L1398 In that case the transcript ID should be from this line: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_Q_Thread.pm#L1367 my $isoform_ID = "ESPRESSO:$chr:$n:$nc_SJ_chain_ID";

What version of ESPRESSO did you use to get that output? Did you run any code to create the test.gtf based on the output directly from ESPRESSO?

zhangkn3 commented 1 month ago

Hi Eric, Thanks for your prompt response.

The version of ESPRESSO I am using is 1.4.0. And here is the code I used to generate the test.gtf file. grep -v "^#" /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n20/sample_list_N20_R0_updated.gtf |awk '$2 == "novel_isoform" {print}' > test.gtf

Please check if I made any mistake with this.

Best, Kenan

EricKutschera commented 1 month ago

I tried that command and it worked for me on some ESPRESSO output I have. But if you drop 1 = from the command it would make the second column always be novel_isoform

awk '$2 = "novel_isoform" {print}'

Can you check if the original sample_list_N20_R0_updated.gtf includes those same lines? If it does then I'm not sure what the reason is. Can you post the commands you ran and any log files?

zhangkn3 commented 1 month ago

Hi Eric, You found the real reason.

I rechecked the codes and found I used 1 = when extracting the transcripts.

Thanks a lot for helping me with all of these.

Best, Kenan