Xinglab / espresso

Other
48 stars 4 forks source link

How to filter the novel isoform in samples_N2_R0_abundance.esp file? #31

Closed kir1to455 closed 8 months ago

kir1to455 commented 10 months ago

Hi, Thank you for developing ESPRESSO! My data has successfully completed the ESPRESSO_Q.pl step. I found that 27263 annotated isoform and 8482 novel isoform in samples_N2_R0_updated.gtf. However, I noticed some novel isoforms with like "0.0 0.0 1.0 0.0 0.0 1.0"(6 groups) values in samples_N2_R0_abundance.esp. And some novel isoforms do not have a gene name.

  1. How should I filter out these novel transcripts with low expression levels? It doesn't seem to be using read counts here. Do you have a recommended threshold to filter?
  2. How should I annotate these novel transcripts? (just like NMD, retained intron, lncRNA...)

Best wishes, Kirito

EricKutschera commented 10 months ago

For novel isoforms ESPRESSO looks for any annotated isoforms that use a splice junction in the novel isoform. If no annotated isoform has any of the splice junctions then ESPRESSO doesn't report a gene for that novel isoform

The values are read counts, but a single read can be split among isoforms: https://github.com/Xinglab/espresso/tree/v1.3.2#output

The counts are assigned by expectation maximization. Each input read contributes at most 1 count, either to a single isoform or distributed as fractional counts to multiple isoforms

I don't have a recommended threshold to filter out transcripts with low expression

This paper identifies possible NMD transcripts from ESPRESSO output: https://doi.org/10.1038/s41467-023-40083-6 https://github.com/Xinglab/TEQUILA-seq#identification-of-nmd-targeted-transcript-isoforms

This code can identify splicing events like retained intron from ESPRESSO output: https://github.com/Xinglab/rMATS-long/#classify-isoform-differences