Xinglab / espresso

Other
58 stars 4 forks source link

Novel isoform not detected #62

Open stfacc opened 4 months ago

stfacc commented 4 months ago

In my RNAseq there's a novel isoform (extended exon) which is not detected by ESPRESSO. (See BAM file with ESPRESSO gtf below). Only two reads are visible supporting the isoform, but there are actually many more.

Is there any option I can try to identify it?

Many thanks

image

EricKutschera commented 4 months ago

You could add a transcript with that novel splice junction or the full transcript that you see in IGV to your .gtf file. Then ESPRESSO would treat the isoform as annotated and report its abundance

You could try reducing --read_num_cutoff to 1 or 0 (default is 2). That reduces how many "perfect" reads ESPRESSO needs to see for a junction before using it in a novel isoform

If you know the read IDs that you think are for that novel isoform then you could check in the output files to see what ESPRESSO did with those reads. If you ran the Q step with -V compatible_isoform.tsv then you could search for the read IDs in that file to see if ESPRESSO assigned them to another isoform. You could also search for the read IDs in the files like {chr}_read_final.txt. It might show that ESPRESSO marked certain junctions in those reads as "fail" or realigned some of the junctions. If the read IDs aren't in any read_final file then they were filtered out. You could try reducing --mapq_cutoff (default 1) or increasing --inserted_cont_cutoff (default 20) to avoid filtering out those reads

stfacc commented 4 months ago

Thank you Eric for your suggestions. I repeated the process with more samples, and including the standard gencode v45 annotation file. No custom SJ annotations.

This time the isoform in my first comment was detected.

There are however two other problems:

  1. An isoform with a 4-bases extended exon is not identified (see figure below)

I added the option --internal_boundary_limit 3, but didn't help.

In file chr12_read_final.txt I see this line for one of the extended reads:

b4a89131-ea7a-4490-b368-18535897f2a4 SJ 1 chr12:21437814:21440363:x 2892 3009 corrected chr12:21437814:21440367:0 2894.5 3006.5 0.000651527191140263 no yes

So the reads with the new SJ are not filtered out but corrected (even with --internal_boundary_limit 3).

image

  1. An intron retention is not detected.

It's weird that ESPRESSO could identify the same intron retention when processing only a subset of the samples. Not sure how to troubleshoot this issue.

The corresponding reads are marked "ISM" in compatible_isoform.tsv

image

Thanks

EricKutschera commented 4 months ago

For the 4 base different splice junction it looks like the issue is the sequence at the junction. The original alignment is chr12:21437814:21440363:x the :x means that ESPRESSO didn't have that junction annotated and it couldn't match the two bases on the two ends of the junction to what it expects for either the + or - strand. These are the bases ESPRESSO associates with each strand: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_S.pl#L208

The annotated junction looks like its GT AG. The other junction would be GT AA. ESPRESSO won't create a novel isoform unless all the junctions are either annotated or have the expected junction sequence

For the intron retention, it looks like the intron is between the last 2 exons of the transcript. In that case ESPRESSO would treat the read as ending at the 2nd to last exon but with a different transcript endpoint. When ESPRESSO classifies a read as ISM it's only looking at the splice junctions in the read. Since the intron is more than 1000 bases, even if those reads are ISM they should fail the endpoint check to the annotated isoform they share the junctions with. ESPRESSO could report it as a novel isoform, but only if there are no reads assigned to the annotated isoform that it is ISM in relation to