Closed zhangkn3 closed 5 months ago
One more question, I previously used the fault parameter and got 90k novel transcripts in the result. How could I modify the script to get less novel transcripts with higher reliability?
Best, Kenan
Here's the main error: sort: write failed: /tmp/sortlvCkux: No space left on device
Which is from this command:
sort --buffer-size=2G -k 3,3 /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list > /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list2
https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_S.pl#L703
ESPRESSO uses the sort
command which will use the TMPDIR
environment variable as the directory for temporary files or if TMPDIR
is not set then /tmp/
will be used. On your system it could be that /tmp/
doesn't have much space. Hopefully the directory where you are writing the espresso output should have enough space. You could try adding export TMPDIR='/gpfs/fs1/data/Archived_bash/9_espresso_AS'
to your script
Setting -N
(--read_num_cutoff
) higher than the default of 2 reads could result in fewer but more reliable novel transcripts by requiring more supporting reads. You've already increased -Q
(--mapq_cutoff
) which also could help. You could also try -R
(--read_ratio_cutoff
) which is similar to --read_num_cutoff
. The two read cutoff parameters are also arguments to the Q step so you could set them there as well (--read_num_cutoff
, --read_ratio_cutoff
)
Thanks, Eric! The script is running successfully this time!
I am adding these parameters and running the scripts parallelly now. Do you have any ideas about the difference between adding both -N and -R in espress_S or _Q? In my mind, the results should be the same when adding the parameter in espresso_S or _Q. Is it correct? Or, do you have any preferences or recommendations for the parameters?
Best, Kenan
The parameters are used a bit differently in the S and Q steps
S step:
-N, --read_num_cutoff
min perfect read count for denovo detected candidate splice junctions
Q step:
-N, --read_num_cutoff
min perfect read count for all splice junctions of novel isoform
In the S step the parameters are used to decide which novel splice junctions are "high confidence". Only the high confidence novel splice junctions will be used in the later steps. In the Q step the parameters are checked again, but this time the check is considering the reads for a specific isoform and all the junctions in that isoform are checked
I would recommend setting the same values for those parameters in both S and Q
Got it, I will test the result with these parameters. Thanks again for the nice pipeline! Best, Kenan
Hi Eric, Thanks for your kind help. The results are really cool. The novel_isoform list got narrowed down. However, I found most of the transcripts are named with ENST IDs. Are they really novel unannotated isoforms? What are the differences between these transcripts and the other transcripts named by "ESPRESSO:"
Looking forward to hearing from you!
Best, Kenan
From looking at the code it seems like all of the novel_isoform
rows in the updated.gtf
output file should have a transcript ID like ESPRESSO:...
. This is the line where a novel_isoform
is written to the gtf file: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_Q_Thread.pm#L1398
In that case the transcript ID should be from this line: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_Q_Thread.pm#L1367
my $isoform_ID = "ESPRESSO:$chr:$n:$nc_SJ_chain_ID";
What version of ESPRESSO did you use to get that output? Did you run any code to create the test.gtf
based on the output directly from ESPRESSO?
Hi Eric, Thanks for your prompt response.
The version of ESPRESSO I am using is 1.4.0. And here is the code I used to generate the test.gtf file.
grep -v "^#" /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n20/sample_list_N20_R0_updated.gtf |awk '$2 == "novel_isoform" {print}' > test.gtf
Please check if I made any mistake with this.
Best, Kenan
I tried that command and it worked for me on some ESPRESSO output I have. But if you drop 1 =
from the command it would make the second column always be novel_isoform
awk '$2 = "novel_isoform" {print}'
Can you check if the original sample_list_N20_R0_updated.gtf
includes those same lines? If it does then I'm not sure what the reason is. Can you post the commands you ran and any log files?
Hi Eric, You found the real reason.
I rechecked the codes and found I used 1 = when extracting the transcripts.
Thanks a lot for helping me with all of these.
Best, Kenan
Hi, thanks for constructing such a wonderful software! I am running espresso to identify novel splicing transcripts. My script is as follows:
echo "Starting espresso" module load Anaconda3/5.0.1-gcb01 source activate /data/conda/env_dir/useespresso
perl /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_S.pl \ -L /gpfs/fs1/data/Archived_bash/9_espresso_AS/sample_list.tsv \ -F /gpfs/fs1/data/data/hg38.fa \ -A /gpfs/fs1/data/data/hg38.knownGene.gtf \ -O /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10 \ -Q 14 -N 10 \ -T 7
perl /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_C.pl \ -I /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10 \ -F /gpfs/fs1/data/data/hg38.fa \ -X 0 \ -T 12
perl /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_Q.pl \ -L /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10/sample_list.tsv.updated \ -A /gpfs/fs1/data/data/hg38.knownGene.gtf \ -O /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10/ \ -V /gpfs/fs1/data/Archived_result/9_espresso_AS/Test_n10/forms_of_iso.tsv
echo "espresso test Completed"
This is the first time I added the parameter " -N 10" in espresso_S. And I got an error response as follows:
Starting espresso /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/4_Minimap2/minimaped_GBM011.sort.sam 0 sort: write failed: /tmp/sortlvCkux: No space left on device Thread 1 terminated abnormally: Failed to sort --buffer-size=2G -k 3,3 /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list > /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list2. Exit code is 512 at /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_S.pl line 703. Exiting due to error in worker thread [Fri May 10 01:16:03 2024] Loading reference Worker 0 begins to scan: /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/4_Minimap2/minimaped_GBM011.sort.sam Worker 0 finished reporting. [Fri May 10 05:33:49 2024] Re-cluster all reads [Fri May 10 05:34:45 2024] Loading annotation [Fri May 10 05:35:00 2024] Summarizing annotated splice junctions for each read group /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/4_Minimap2/minimaped_GBM011.sort.sam(0) 0_25104(25336) 0_2609(2608) 0_2843(2842) 0_21281(21345) 0_19401(19451) Worker 0 begins to scan: /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/4_Minimap2/minimaped_GBM011.sort.sam Worker 0 terminated early. [Fri May 10 05:42:42 2024] Loading splice junction info sort: cannot read: /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list3: No such file or directory Failed to run 'sort --buffer-size=2G -k1,1n -k3,3 /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list3 --output /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0/sam.list3.tmp'. Exit code is 512 at /gpfs/fs1/home/kz116/espresso_src/ESPRESSO_C.pl line 178. No valid read_final.list can be found in /gpfs/fs1/data/khasrawlab/Kenan_Zhang/project_neoantigen_identification/Archived_result/9_espresso_AS/Test_n10/0. [Fri May 10 05:42:42 2024] Loading annotation [Fri May 10 05:43:06 2024] Summarizing annotated isoforms [Fri May 10 05:43:13 2024] Loading corrected splice junctions and alignment information by ESPRESSO Perl exited with active threads: 5 running and unjoined 0 finished and unjoined 0 running and detached espresso test Completed
How do you think of the error? Was it about the script, or was there something wrong with our computer cluster node?
Looking forward to your kind response.
Best, Kenan