Xinglab / espresso

Other
58 stars 4 forks source link

collapsed isoforms #71

Closed trista1115 closed 2 weeks ago

trista1115 commented 2 months ago

Hi Eric,

After running minimap2, would it be necessary to collapse the redundant isoforms before using ESPRESSO?

Thanks~

EricKutschera commented 2 months ago

I'm not sure what you mean. You can run minimap2, sort the sam files, and use them as input to ESPRESSO: https://github.com/Xinglab/espresso/tree/v1.5.0?tab=readme-ov-file#alignment

ESPRESSO will then use the alignments to assign reads to isoforms

trista1115 commented 2 months ago

PCR duplicates can artificially inflate read counts and lead to biased downstream analysis. So I am wondering if it is necessary to collapse identical sequences into a single read (remove PCR duplicates ) before using ESPRESSO.

EricKutschera commented 2 months ago

As you said PCR duplicates can inflate read counts and ideally the duplicates could be removed. Even with duplicates ESPRESSO can still give reasonable results. See Figure 3 of https://www.science.org/doi/10.1126/sciadv.abq5072

trista1115 commented 2 months ago

Thank you for your response! I have another question : my data is Pacbio hifi reads (QV20), should I use the splice mode or splice:hq mode in mimimap2 before using ESPRESSO?

EricKutschera commented 2 months ago

I haven't used splice:hq. From https://lh3.github.io/minimap2/minimap2.html

splice:hq Long-read splice alignment for PacBio CCS reads (-xsplice -C5 -O6,24 -B4)

And this issue suggests it could be better even for high quality ONT reads: https://github.com/lh3/minimap2/issues/1127#issuecomment-1993891353

trista1115 commented 2 months ago

Thank you for your response!!! but I had another question in cmake Parasail, the error is below: CMake Error: The source directory "/espresso_v_1_5_0/src" does not appear to contain CMakeLists.txt.

Best! trista

trista1115 commented 2 months ago

I'm not sure what you mean. You can run minimap2, sort the sam files, and use them as input to ESPRESSO: https://github.com/Xinglab/espresso/tree/v1.5.0?tab=readme-ov-file#alignment

ESPRESSO will then use the alignments to assign reads to isoforms

Could you please confirm if the input BAM file for ESPRESSO should be filtered by removing unmapped reads, reads with secondary or supplementary alignments, and reads with a mapping quality of 0? Or is this filtering not necessary for ESPRESSO?

Thanks

trista1115 commented 2 months ago

Thank you for your response!!! but I had another question in cmake Parasail, the error is below: CMake Error: The source directory "/espresso_v_1_5_0/src" does not appear to contain CMakeLists.txt.

Best! trista

I encountered an error while running ESPRESSO_C.pl. The error message is as follows: [could not load Parasail module needed for Smith-Waterman]. Could you please help me diagnose and resolve this issue?

Thanks!

EricKutschera commented 2 months ago

CMake Error: The source directory "/espresso_v_1_5_0/src" does not appear to contain CMakeLists.txt

The CMakeLists.txt file is not part of ESPRESSO itself, but in another github repo that the build script clones. You can compile the Parasail code with ./src/Parasail/build (https://github.com/Xinglab/espresso/tree/v1.5.0?tab=readme-ov-file#dependencies). If that gives an error can you post the full output of that command? You can use --blast with ESPRESSO_C.pl which does not use parasail but is slower

ESPRESSO will filter out secondary alignments and alignments with MAPQ less than mapq_cutoff which is 1 by default: https://github.com/Xinglab/espresso/blob/v1.5.0/src/ESPRESSO_S.pl#L908

ESPRESSO doesn't directly check for unmapped reads, but based on this post it looks like unmapped reads will be filtered out for MAPQ=0: https://github.com/Xinglab/espresso/issues/51. ESPRESSO can handle supplementary alignments

trista1115 commented 2 months ago

CMake Error: The source directory "/espresso_v_1_5_0/src" does not appear to contain CMakeLists.txt

The CMakeLists.txt file is not part of ESPRESSO itself, but in another github repo that the build script clones. You can compile the Parasail code with ./src/Parasail/build (https://github.com/Xinglab/espresso/tree/v1.5.0?tab=readme-ov-file#dependencies). If that gives an error can you post the full output of that command? You can use --blast with ESPRESSO_C.pl which does not use parasail but is slower

ESPRESSO will filter out secondary alignments and alignments with MAPQ less than mapq_cutoff which is 1 by default: https://github.com/Xinglab/espresso/blob/v1.5.0/src/ESPRESSO_S.pl#L908

ESPRESSO doesn't directly check for unmapped reads, but based on this post it looks like unmapped reads will be filtered out for MAPQ=0: #51. ESPRESSO can handle supplementary alignments

Thank you for your response. I had complied the Parasail:

[100%] Built target test_verify_cigars Checking if your kit is complete... Looks good Warning: -L. changed to -L /mnt/sata/scripts/espresso_v_1_5_0/src/Parasail/. Generating a Unix-style Makefile Writing Makefile for Parasail Writing MYMETA.yml and MYMETA.json cp lib/Parasail.pm blib/lib/Parasail.pm Running Mkbootstrap for Parasail () chmod 644 "Parasail.bs" "/usr/bin/perl" -MExtUtils::Command::MM -e 'cp_nonempty' -- Parasail.bs blib/arch/auto/Parasail/Parasail.bs 644 "/usr/bin/perl" "/usr/share/perl/5.34/ExtUtils/xsubpp" -typemap '/usr/share/perl/5.34/ExtUtils/typemap' Parasail.xs > Parasail.xsc mv Parasail.xsc Parasail.c x86_64-linux-gnu-gcc -c -I. -D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -g -DVERSION=\"0.01\" -DXS_VERSION=\"0.01\" -fPIC "-I/usr/lib/x86_64-linux-gnu/perl/5.34/CORE" Parasail.c rm -f blib/arch/auto/Parasail/Parasail.so ........ chmod 755 blib/arch/auto/Parasail/Parasail.so "/usr/bin/perl" -MExtUtils::Command::MM -e 'cp_nonempty' -- Parasail.bs blib/arch/auto/Parasail/Parasail.bs 644 PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/.t t/Parasail.t .. ok
All tests successful. Files=1, Tests=17, 0 wallclock secs ( 0.01 usr 0.01 sys + 0.05 cusr 0.01 csys = 0.08 CPU) Result: PASS

and ESPRESSO_C.pl works good~

Thank you so much~

trista1115 commented 2 months ago

Hi Eric,

After I finished running ESPRESSO_Q.pl, I checked the espresso_q_summary.txt as shown below:

number of isoforms in input annotation: 149443 number of splice junctions in input annotation: 289999 number of splice sites in input annotation: 501035 number of reads assigned: 23397636 number of reads not assigned: 9054023 number of full splice match reads: 13659840 number of FSM reads with 'perfect match' endpoints: 3194219 number of incomplete splice match reads: 5355743 number of novel in catalog reads: 1156121 number of novel not in catalog reads: 3588186 number of single exon reads: 8112977 number of reads with a failed splice junction: 647878 number of FSM splice junction chains: 29972 number of ISM splice junction chains: 281276 number of NIC splice junction chains: 82446 number of validated NIC chains: 36121 number of NNC splice junction chains: 365143 number of validated NNC chains: 181173 number of splice junction chains with a failed junction: 258409 total FSM abundance: 0 total novel ISM abundance: 0 total NIC abundance: 0 total NNC abundance: 0 total single exon abundance: 0 number of detected FSM isoforms: 0 number of detected novel ISM isoforms: 2016 number of detected NIC isoforms: 22868 number of detected NNC isoforms: 155277 number of detected single exon isoforms: 0 number of internal exon boundary check failures: 1098301 number of terminal exon boundary check failures: 4516372

is it normal for the abundance to be 0?

Thanks~

EricKutschera commented 2 months ago

Were there any error messages? Do you have non-zero values in the abundance.esp output file?

The total abundance should be close to number of reads assigned: 23397636. From a summary file I checked, the lines like total FSM abundance: are the only lines with non-integer numbers. Potentially this is due to the floating-point issues mentioned in https://perldoc.perl.org/Storable#BUGS . What version of perl and Storable did you use?

perl --version perl -e 'use Storable; print("$Storable::VERSION\n")'

trista1115 commented 2 months ago

Were there any error messages? Do you have non-zero values in the abundance.esp output file?

The total abundance should be close to number of reads assigned: 23397636. From a summary file I checked, the lines like total FSM abundance: are the only lines with non-integer numbers. Potentially this is due to the floating-point issues mentioned in https://perldoc.perl.org/Storable#BUGS . What version of perl and Storable did you use?

perl --version perl -e 'use Storable; print("$Storable::VERSION\n")'

firstly, I did not see any errors using ESPRESSO_Q.pl,

image

and then I checked the abundance.esp file, which only contains zero values, as shown below:

ESPRESSO:JH584304.1:46:22 NA ENSMUSG00000095041.8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ESPRESSO:JH584304.1:46:6 NA NA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ESPRESSO:JH584304.1:46:0 NA NA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ESPRESSO:JH584304.1:46:2 NA NA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ESPRESSO:JH584304.1:46:21 NA ENSMUSG00000095041.8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ESPRESSO:JH584304.1:46:33 NA ENSMUSG00000095041.8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ESPRESSO:JH584304.1:46:32 NA ENSMUSG00000095041.8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

my perl version is v5.34.0, and Storable version is 3.23.

EricKutschera commented 1 month ago

Since the abundance file only has zeros I don't think the issue is related to Storable. I'm not sure how ESPRESSO could have assigned a large number of reads to transcripts, but then not output any abundance

Are you able to get output from the example that is close to the expected abundance? https://github.com/Xinglab/espresso/tree/v1.5.0?tab=readme-ov-file#example https://github.com/Xinglab/espresso/blob/v1.5.0/test_data/expected_sirv_abundance.esp

If the example works then there might be something about your data that's causing the issue. In that case could you post your input files? Or a small subset of your data if that still results in reads assigned but no abundance

trista1115 commented 1 month ago

this is my log file when running the test.

[Thu Sep 5 08:45:02 2024] ESPRESSO_S finished its work. [Thu Sep 5 08:45:02 2024] Loading splice junction info [Thu Sep 5 08:45:02 2024] Requesting system to split SAMLIST into 10 pieces First reads were recorded successfully for 1 files. [Thu Sep 5 08:45:03 2024] Loading references [Thu Sep 5 08:45:03 2024] Scanning SAMLIST by 5 workers 0 16 Worker 1 begins to scan sam.list3aa. Worker 1 finished reporting. [Thu Sep 5 08:45:23 2024] ESPRESSO_C finished its work. ........ [Thu Sep 5 08:45:23 2024] thread_id: 2 starting to process chr: SIRV2 ........ [Thu Sep 5 08:45:33 2024] Cleaning up threads [Thu Sep 5 08:45:34 2024] ESPRESSO finished quantification

This is the ESPRESSO_q.summary file, and it works!!!

number of isoforms in input annotation: 68 number of splice junctions in input annotation: 111 number of splice sites in input annotation: 179 number of reads assigned: 3037 number of reads not assigned: 8 number of full splice match reads: 1606 number of FSM reads with 'perfect match' endpoints: 1448 number of incomplete splice match reads: 1317 number of novel in catalog reads: 1 number of novel not in catalog reads: 0 number of single exon reads: 0 number of reads with a failed splice junction: 133 number of FSM splice junction chains: 3 number of ISM splice junction chains: 58 number of NIC splice junction chains: 1 number of validated NIC chains: 0 number of NNC splice junction chains: 0 number of validated NNC chains: 0 number of splice junction chains with a failed junction: 53 total FSM abundance: 3037.01 total novel ISM abundance: 0 total NIC abundance: 0 total NNC abundance: 0 total single exon abundance: 0 number of detected FSM isoforms: 3 number of detected novel ISM isoforms: 0 number of detected NIC isoforms: 0 number of detected NNC isoforms: 0 number of detected single exon isoforms: 0 number of internal exon boundary check failures: 14 number of terminal exon boundary check failures: 7

transcript_ID transcript_name gene_ID test_sample_name SIRV202 NA SIRV2 2400.42 SIRV201 NA SIRV2 299.48 SIRV203 NA SIRV2 337.11

When I ran the test, I found that ESPRESSO automatically generated a samples.tsv.updated file with 3 columns. However, when I ran it with my real data, the samples.tsv.updated file only contained 2 columns. why did this happen?

/path/to/1.bam 0 /path/to/2.bam 1

Here are my input files: sorted.bam and ./ESPRESSO_S_output/samples_N2_R0_compatible_isoform.tsv, as shown below:

m84069s2/118098569/ccs:-:BC4_GTGCTGTGTC:5 16 chr1 3393039 60 137M1D651M1D155M97941N57M1I144M65S 0 0 TGGGAGCTAAAAAATGTTTAATTGTTTGAATAAGAAAAATGTTTCTGATCAAATGTCTCATACAGCTCATATAAAATCCTATACATAACAGAATGTATTCATTAAGGATTACCATAGAACAAGAAAGCTTTGTGTGTGGGGGGGGGGGTGAATTGTGTGCTTGGTAAGCAGTTGCACTGAGGAATAATAGACTATATTGAAAATGATTTGCAGTAAGGTGCCTTGTTTGGGCCAGTACAACATGCACACACTTTTGCATGTTTCATCTTTAAGCAGAATTATATTTGTCAAGGATCACCTGGGAGACACCAGGAGTAGCTAACATGTCTCTAGTATATTCCTTAACAGACTGTGCAGTTAGGTGCCTGTCAGAACTAGCTTATTAGCAAAACCCATGAATGTTTGCCTCCAAGTTCCATTATTTTGTACTTTAGTGATGATGAGTCAGAAGCTTTATATGGTCTGTGCCCTGCCCTTTGGATATCTACAGTATCTGAGGAATATCCTCAGCTCAAAAATATGTCTCCTTCTCTGCAACATTGTTGTTTTTCTCTTCTGAACTCAGAAGGAGATCCTGGGTCATTACCTTAATCTATTAGCTCTACAGTTTGTCTAGATATTTGAACTCTGAGTACCTTACACCAACACCAGGACAGTTGAAGATTCTAGTGGATTAGTGGGAGCATATTTGGGGCAGCAAGGAATCTTGGTACCATTGTGACTGGGTAAGTCACAAAGGAAACCTGGACATTCAAAGTTATTTTAGCATGTTTAAGTGTATATAAAGACCCCAAACCAAAGCACCACTTCCTTCCTTTCTTCTCTCCTTCCTTTCTTTGTAGAGTTGTAACAGATTCCAGACTTTAGTAGGCTCCCAGACCTTAGCAGATACCATGATGGAAGAACTGTTCAGGCCATGGACCCAGCATTTAGGCATCAAGACCTTGGAGGGCCTGTAAGCTGTGAGTCTGTACAATAATGCAGAGCTGCAGGACCAATTGGTGGAGCACTTTCCAGAAAAGTGGCTAGCAGATGCAGCATGCTCACATCTGCATACTCGTACACCATCTTCCAGTAAAACCGCCACCTGCCGCTCTCCCCACTCTGCCGGCTCCGGATACCTAAGTATATTGTGTGCAAATACCAGTAAAGGTGGGAATGTGGCTAAAGAGCCGCAGTCAGGAGGAGACTTTACCAGGGAACTTGGCTC IIIIIIII+IIIIIII<IIIIDIDIIIIIIIIII7IIIIIIDIIIIIIII7IIIIIDIIIIIIIDIII<III<IIIDIIIIIIIIIIIIIIIIIIII<IDIDI7IIIIIIIIIIDIIIIIIIIDIIIIIIIIIIIIIIIIIIIIIIIIDIDIIIIIIIIIII2IIIIIIIIIIIIIIIII+IIIIIIIIIIIIIIIIIIDIIIIIIDIIIIIIIIIIIII<IIIIIII2IIDIII2IIIIIIIIIIII2ID7IIIIIIII7IIIIII2IIIIIIIIIIIIIIIIIIIIDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIDIIIIIIIIIIIIID7<IIIIIIIIIIIIIIDIIIDIIIIIIIII<IIIDIIIIII2DIIIIIII$IIIDIIIIIIIIIIIIII+II+IIIIIIIIDIII<IIIIIDIIII7IIII<IIDII7IIIIIIIIIIIIIIIIII<IIIIIIIIIIIIIII7IIIIIIIIIIIIIIIDIIIIIIIIII7IIIIDIIIIIIDII2II<IIID<D$I7IIII2III7IIIIIDIIIIIIIIIIIIIIIIIIIIIIIIDIIIIIIIIIII+IIIIIIIIIIIIII$IIIIIIDIIIII7I7IIIIIIIIIIIIIIIIIIIIIIIDIIII2IIIIIIIIDII+IIIIIIIIIIIID+I<IIIIIIIIIIIIIIII2IIIIIIIIIIIIIDIDII7II<IIIIIIIDIIIDIIDIIIIIIIIIDIIIIIIDIIIIIDIID<IDIIIIIIIIIIIIIIIIIIIII$I2IIIIIIIIDII<IIIIIDIIDIIIIIIII2DIIIDIIIIIIIIIIIIIIIIIIIIDIDI7IIIIIIIIIIIIIIIII7IIIIIDDIIIDIIIIDIIIIDIDIIIDIIDIIIIDIIIII+IIIIIIIIIDIDIIII7IIIIIIIIII$IIII<IIDIDIIIIIIDIIIIIDIIIIIIIIIIDIII7I2II<IIIIIDDII<III7III<IIIIIII<IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII<IIIDIIIII+7+IIIIIIIIIIIII<IIII7I7IIIIIIIII<I$IIIIIIIIII<IIIIIIIIIIIIIDDIIIIIII<IDIIIIIIII7IIIIIIIDIII7IIIIIIIIIIIIIIIII<ID7II+IIIIII<IIIIIIIIIII< NM:i:3 ms:i:1123 AS:i:1099 nn:i:0 ts:A:+ tp:A:P cm:i:352 s1:i:1119 s2:i:0 de:f:0.0026 MD:Z:137^G651^C356 rl:i:18 m84069s2/246088574/ccs:-:BC4_TGTGTGGTTT:6 16 chr1 3604447 60 29M1I52M3I17M1D191M66820N101M22445N249M563N1M1D29M1767N136M1D145M1D157M 0 0 TGTCTTGTGATAATACATTTTATCTTGAATTTTTCTTCTTCACAGAGAGAGGTAACTTTTAATATATTAGGATTTTTTGTTGGATGGAGTGGTAGTTTTTAGTTTTTTTTTTTAACCAAATAAAAAAGCCTGGGTTGTATCCCCAGAACTGCTAAAACAAAGCCTTTTCAAGTAATTGGGAAACGGGATTGTTGAGCTCCGGGAGACTCCTCGATCTCAAGGTGGGGGTTAAAGCTTCCTTGGCCGTGTAGACTGCACAGACTTCCAGGGCAGCCCGAATCACATACTGAGTTTTTTAAGTGGATATTAGTTGTAAAGTAAAGGATAATCATGCTATTCAGACAATCTAGGTTACAAAGAGAGCTCCAGGGTCTTGGGATAAATGAATCTCCATCTGTCTGCTCACATTCAAGCATCCGTTATTTCATTCCCTCTCCAGGGAATCCCTGTGGAATACACAGCTGAATGCAGATTCTCTGGAAGTAGGAAAGTGGGAAGAGATGGAAACCTCACAGACCAGTGTGTGCCTTTATATTCTATGTCTGACTTTTGGTCCTCCACTTCAAAACCTGATTTTCTTTGGACCTGTTGTTGTAGGTTCAGAGGGCTAAAATCAGTTTTCCTGGGGAAATATGTTTCTTGTCTTTTCTCATGAATCCGTAATTTCAAGAACCTCCCTGTCTCCACATGGATGTCCCCATCCTTCACCTCACCAGACCTCTAAACTCCCTGGGGCCTCCAGTCTCTTGAGGGTTAGGTGCATCATCTCTGAATGAATGTAGACCCAGAATTCATCTACTGTATGTGTGTGTAGAGGCTCATATCAGCTGGTGGATGTTGTCTGTTTGGTGGTCCAGTGTTTGAGAGATCTCAGGGTCCAGATTAATTGAGACTGCTTGTCCTCCTACAGGATCACCCTTCTCCTCAGCTTCTTTCAGCCTTCCCTAATTCAACAAAAGGGGTCAGCTGCTTCTGTCCATTGGTTGGGGGAAAATATCTGCATGTGACTCTTTCAGCTGCCTGTTGGGTCTTTTGGAGTGCCGTCATGATAGGTGACTTTTTGTGAATTCTCCATAACCTCAGTAATAGTGTCAGGACTTGGGACCTCCCC IIIIIIIIIID+IIIDI<IIIIIIDIDII2IIIDDDIIII2IIIIIIIII7IIII<DIDIIIIIIIIIIIII+IIIII2II<<I<2I+7IDD<I2IIIIIIIDIIIIIIIIIIIIII<III+IDIIIIIIIDII<IIIII+IIIIIIIIII7I2IIIIIII2DID2IIIIIII<D<+III<III7IIIIII2IDIIIIIIIIII+IDIDIIIIIIIIDII<II+IIIIII2IIIIIIIIIIIIIIIIIIIII+IIIIIIIIIIIIIIDIIIIIDIIIIIIIIIIIIII2IIIIIIIIIIIIIIIIIIIIIIIII<IIIIDII<IIIIIIIIIIIIIIIIIII<7IIIIIIIIIIIIIIDIIIIDIIII7IIIIII<IIIIIIIIIDIIIIIIIIIIIIIIIIIIIII2IIIIDIIIDIIIIIIIIIIIIIIIIIIIII$I7ID7IIIIIIDIIIIIIIIIII7I7IIIIIIIIIIIIIIIIIIIIDI2IIII<IIIIIIIIIII<IIIDIIIIIIIIIIIIDIIII2I7IIDDIIIIIII<III<IIIIII+IID77+IIIIII+IID+IIII<IIIIIIIII2<I7DIII<IDIIII7IIIII+IIII2IIIIIID2IIIDII7IIIIIIIIIIDIIIIIIIIIIII<IDDII2IIIIDIII+IIDII7IIDIIDIIIIIIIIIIIIIDIIIII+IIIIIIIIIIIIIIIIDIIIIIIIIIIIIII+IDI$IIDDIIIIIIIIIIIIII<II<II2III+D7DIIDDDIIIIIIIIIIIDIIDID7IIIIIIIIDIIIIIII+I+ID<<2I+2DDII<II+<IIIIIIDII7III7II<IDD2I2I$IDI72<IIDIIIIIIIIIIIIIIII7III+IDI+I+DIIIIIDIID7D<III2IIDIIIDIII+IIIDIIIIIIII<III<<I<DII7IIDI+IIIIIIIIID++I7++D$7IIIII7IIIIII<III<IIIIDIII+IIII<IDIIIIIIIIIIIIDIDDDII22II<DIIII2DI+IIIIIIIIIIIIIIII<IIDDI7DIDI<IIIIIIIIDIIIIDIIIDIII<I2IDDI$DIIIIIIIIII+IIIII7IID7II7III NM:i:9 ms:i:1059 AS:i:958 nn:i:0 ts:A:+ tp:A:P cm:i:292 s1:i:938 s2:i:169 de:f:0.0063 MD:Z:82T15^T542^T165^T145^A157 rl:i:286

cat samples_N2_R0_compatible_isoform.tsv | head -n 2 m84069s2/161810350/ccs:-:BC15_TGCGGGTCTT:29 NNC ESPRESSO:chr9:33431:0 m84069s2/136779308/ccs:+:BC15_TTGGGTCTTG:3 NNC ESPRESSO:chr9:33431:0

EricKutschera commented 1 month ago

ESPRESSO should have raised an error if your samples.tsv.updated only had 2 columns: https://github.com/Xinglab/espresso/blob/v1.5.0/src/ESPRESSO_Q.pl#L499

Maybe there were unexpected characters in your samples.tsv. I think that windows line endings like \r\n might cause an issue for ESPRESSO. It could keep or remove the \r inconsistently and then fail to match reads to sample names

trista1115 commented 1 month ago

Thank you for your help.

I am trying to split samples to speed up the process. If I have two samples, should I always use -X 0, or is it necessary to switch to -X 1 for the second sample?

perl ESPRESSO_C.pl -I test_split_sirv_c/0 -F test_split_sirv_c/fastas/0.fa -X 0 perl ESPRESSO_C.pl -I test_split_sirv_c/1 -F test_split_sirv_c/fastas/1.fa -X 0 perl ESPRESSO_C.pl -I test_split_sirv_c/2 -F test_split_sirv_c/fastas/2.fa -X 0 perl ESPRESSO_C.pl -I test_split_sirv_c/3 -F test_split_sirv_c/fastas/3.fa -X 0

Thanks!

EricKutschera commented 1 month ago

After using split_espresso_s_output_for_c.py, -X 0 is always used: https://github.com/Xinglab/espresso/issues/73#issuecomment-2326470834

trista1115 commented 1 month ago

Thank you for your response~

  1. In the ESPRESSO output file entry ESPRESSO:chr7:31230:24, what do the numbers 31230 and 24 represent?

  2. In the file samples_N2_R0_compatible_isoform.tsv, how should we interpret a single read matching multiple feature IDs, as shown below? and some reads do not have any feature IDs

    m84069 sample1 ISM ESPRESSO:chr12:8048:11,ESPRESSO:chr12:8048:13,ESPRESSO:chr12:8048:12,ESPRESSO:chr12:8048:10,ESPRESSO:chr12:8048:7,ESPRESSO:chr12:8048:6,ESPRESSO:chr12:8048:8,ESPRESSO:chr12:8048:5, m84068 sample1 single-exon NA

  3. When running visualize.py, an error occurred, as shown below:

    ['/usr/bin/python3', '/scripts/espresso_v_1_5_0/visualization/gtf_to_bed.py', '--updated-gtf', './ESPRESSO_S_output_q2/samples_N2_R0_updated.gtf', '--output-bed', './visualization/samples_N2_R0_updated.bed', '--descriptive-name', 'Padi6'] ['faToTwoBit', '-long', '/vM32/GRCm39.primary_assembly.genome.fa', './visualization/GRCm39.primary_assembly.genome.2bit'] Traceback (most recent call last): File "/scripts/espresso_v_1_5_0/visualization/visualize.py", line 186, in main() File "/scripts/espresso_v_1_5_0/visualization/visualize.py", line 177, in main chrom_sizes = run_fasta_to_chrom_sizes(args.genome_fasta, args.output_dir) File "/scripts/espresso_v_1_5_0/visualization/visualize.py", line 121, in run_fasta_to_chrom_sizes subprocess.run(fa_to_two_bit_cmd, check=True) File "/usr/lib/python3.10/subprocess.py", line 503, in run with Popen(*popenargs, **kwargs) as process: File "/usr/lib/python3.10/subprocess.py", line 971, in init self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'faToTwoBit'

EricKutschera commented 1 month ago

The novel isoform IDs like ESPRESSO:chr7:31230:24 are ESPRESSO:{chromosome}:{read_group}:{index_in_group}: https://github.com/Xinglab/espresso/blob/v1.5.0/src/ESPRESSO_Q_Thread.pm#L1621

The read_group is determined by ESPRESSO. In v1.5.0: https://github.com/Xinglab/espresso/tree/v1.5.0?tab=readme-ov-file#resources

By default ESPRESSO groups reads based on annotated gene coordinates. With --alignment_read_groups ESPRESSO groups reads if they have overlapping alignment coordinates

The index_in_group counts the number of novel isoforms detected in that group

The last column of the compatible isoforms file shows which transcripts ESPRESSO determined that the read could match to. The ISM read shown is compatible with many isoforms. That could happen if the read only has 1 or a few splice junctions and there are many detected isoforms that have that sequence of splice junctions. The m84068 sample1 single-exon NA line shows that no possible matching transcripts were found for that read. That particular read has no splice junctions (single-exon) and the aligned coordinates must not be contained in any annotated single-exon transcript

The error is saying that it can't find faToTwoBit which is a dependency for the visualization script: https://github.com/Xinglab/espresso/issues/57#issuecomment-2239302043

trista1115 commented 1 month ago
  1. I installed faToTwoBit and successfully ran the test data. However, when I ran it with my own data, no files were generated in the output folder target_genes. Below is the code I used:

test_gene="ENSMUSG00000040935" test_gene_name="GeneA" python3 ${espresso_vi}/visualize.py --genome-fasta ${fa_path} --updated-gtf ./ESPRESSO_S_output_q2/samples_N2_R0_updated.gtf --abundance-esp ./ESPRESSO_S_output_q2/samples_N2_R0_abundance.esp --target-gene ${test_gene}.* --minimum-count 1 --descriptive-name ${test_gene_name} --output-dir ./visualization

  1. In the file samples_N2_R0_abundance.esp, I noticed that some transcript_IDs with the same {read_group} are assigned to two different gene_IDs. Does this indicate that these transcript_IDs represent fusion transcripts?
EricKutschera commented 1 month ago

Your command had --target-gene ${test_gene}.*. The script should be run with the actual gene name (just ENSMUSG00000040935 not ENSMUSG00000040935.*). Then the script will check for matching gene names that potentially have a suffix like ENSMUSG00000040935.13

For a novel transcript, ESPRESSO will look at the splice junctions in that transcript and see if there are any annotated transcripts that use the same splice junction. Any gene IDs from those annotated transcripts will be assigned to the novel transcript: https://github.com/Xinglab/espresso/blob/v1.5.0/src/ESPRESSO_Q_Thread.pm#L2540

trista1115 commented 1 month ago

Hi @EricKutschera

image

What do the numbers in this figure represent? Were they manually added? In my own data, I observed that some boxes only have 5 or 8 counts. Could this be too low, considering that I noticed test_sirv has counts in the range of 300-2400?

EricKutschera commented 1 month ago

In the example for SIRV, 343.66 is the abundance for the SIRV201 isoform. The number is directly from the abundance.esp output file. (That image was generated with an old version of ESPRESSO that gave a slightly different abundance: https://github.com/Xinglab/espresso/blob/b0d7aef9ad95a26769d8f6e6cb851b591cec2308/test_data/expected_sirv_abundance.esp#L3 https://github.com/Xinglab/espresso/blob/v1.5.0/test_data/expected_sirv_abundance.esp#L3 )

The plot in IGV would show the same abundance value duplicated at each exon. The README explains that the image was edited after IGV: https://github.com/Xinglab/espresso/tree/v1.5.0?tab=readme-ov-file#igv

Move one expression value for each isoform to the left axis and remove the duplicate values

The test data happens to have high read counts for all isoforms, but in your data some isoforms may have only a few supporting reads