ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
144 stars 13 forks source link

Reads mapping to known transcripts classified as novel #68

Closed tabeariepe closed 1 year ago

tabeariepe commented 1 year ago

Hi,

We tried to create a sqanti like file per transcript and used transcript_model_reads.tsv and SQANTI-like.tsvto do that. We then noticed that some transcripts with a ENST id are classified as novel. This is an example from the two files:

m64167e_210819_012015/3867431/ccs ENST00000517808.1

m64167e_210819_012015/3867431/ccs chr1 - 2683 5 novel_in_catalog ENSG00000135749.19 ENST00000258229.14 7530 34 2751 4217 44 383 incomplete_intron_retention_left;alt_left_site_known;exon_elongation_right False 0 0 True 232984307 233295478 0.25 AAGATTATGTGAGGGTAGGG NA NA NA NA NA NA

andrewprzh commented 1 year ago

Hi @tabeariepe

Yes, transcript discovery is a quite complicate procedure with a lot of parameters, so some read assignments may not be straightforward. Nonetheless, this example is quite strange, could you send me the information about this read from read_assignments.tsv?

Also, the best way to get SQANTI per-transcript classification is to actually run SQANTI on output GTF :) But this feature is on the TODO list.

Best Andrey

tabeariepe commented 1 year ago

Hi Andrey,

This is from the read_assignment.tsv file:

m64167e_210819_012015/3867431/ccs chr1 - ENST00000258229.14 ENSG00000135749.19 inconsistent incomplete_intron_retention_3:233179176-233198938,alt_acceptor_site_known:233208610-233217897,exon_elongation_5:19 233196758-233199030,233200154-233200264,233208518-233208609,233217898-233217931,233218031-233218203 PolyA=False; Canonical=False; Classification=novel_in_catalog;

m64167e_210819_012015/3867431/ccs chr1 - ENST00000430153.5 ENSG00000135749.19 inconsistent incomplete_intron_retention_3:233179176-233198938,alt_acceptor_site_known:233208610-233217897,exon_elongation_5:19 233196758-233199030,233200154-233200264,233208518-233208609,233217898-233217931,233218031-233218203 PolyA=False; Canonical=False; Classification=novel_in_catalog;

m64167e_210819_012015/3867431/ccs chr1 - ENST00000475463.6 ENSG00000135749.19 inconsistent incomplete_intron_retention_3:233179176-233198938,alt_acceptor_site_known:233208610-233217897,exon_elongation_5:19 233196758-233199030,233200154-233200264,233208518-233208609,233217898-233217931,233218031-233218203 PolyA=False; Canonical=False; Classification=novel_in_catalog;

We actually had many transcripts that had a ENST id but novel classification. So this is just one of the examples.

We need some of the information from the sqanti-like file to run a pipeline for long-read proteogenomics (https://github.com/sheynkman-lab/Long-Read-Proteogenomics) but we do not want to reclassify the transcripts with sqanti or do additional filtering.

Best, Tabea

andrewprzh commented 1 year ago

Dear @tabeariepe

I see. I presume that is possible that before clusterisation and intron correction some reads appear to be novel, but in the process on transcript discovery they appear to contribute to known isoforms. Nonetheless, I'll try to figure out what's going on there. If you have a chance to send me isoquant.log file, as well as a part of the BAM file from this particular region, that could be very helpful!

I'm also in the process of developing features for the new release. Likely, it will include SQANI-like output for novel transcripts, as well as additional information in GTF, such as exon ids and gene attributes from the original reference annotation.

Best Andrey

tabeariepe commented 1 year ago

Hi Andrey,

Here are the log file and the BAM file for the region:

isoquant.log m64167e_210819_012015.3867431.ccs.reads.bam.gz

The SQANTI-like output for novel transcripts would be really helpful for us.

Best, Tabea

rsalz commented 1 year ago

Dear @tabeariepe

I see. I presume that is possible that before clusterisation and intron correction some reads appear to be novel, but in the process on transcript discovery they appear to contribute to known isoforms. Nonetheless, I'll try to figure out what's going on there. If you have a chance to send me isoquant.log file, as well as a part of the BAM file from this particular region, that could be very helpful!

I'm also in the process of developing features for the new release. Likely, it will include SQANI-like output for novel transcripts, as well as additional information in GTF, such as exon ids and gene attributes from the original reference annotation.

Best Andrey

We are eagerly awaiting the SQANTI-like output for novel transcripts feature!!

andrewprzh commented 1 year ago

Dear @tabeariepe and @rsalz

SQANTI output for transcripts is now available in 3.2.0!

With respect to reads you've sent. During read assignments reads are assigned to the most similar isoforms. These reads have inconsistencies, e.g. elongated exons (and thus are classified as novel). However, during model construction only a subset of the reference isoforms is reported and these inconsistent reads are assigned to other isoforms based on intron chain matching. Such behavior is not correct, I will fix this at some point.

Thanks for reporting!

Best Andrey

andrewprzh commented 1 year ago

@tabeariepe

Thanks a lot for sending the files previously. I managed to fix the issue with the assignment. I will release the new version soon.

Best Andrey

andrewprzh commented 1 year ago

IsoQuant 3.3 is released and should fix the issue.

Best Andrey