ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
144 stars 13 forks source link

Ambiguous Novel Transcript Locations and Merging Transcript Models #233

Open BenneyMRArgue opened 3 weeks ago

BenneyMRArgue commented 3 weeks ago

Hi,

Thanks so much for developing this tool, I'm excited to be able to use this in my long-read isoform analysis!

I have had a couple questions come up since looking through the output from some IsoQuant runs (v3.4.1). When looking through the transcript model output (transcript_model_grouped_counts.tsv) I noticed that novel transcripts appear to be labeled "transcript####.chr##". I need to merge the outputs from the individual IsoQuant runs (I have a large sc-RNAseq dataset which I have had to split by sample because of resource limitations for each job), so realizing that these transcripts appear to be listed in the order that IsoQuant processed them during the individual experiment raised concern. Is it possible to merge novel transcripts which have the same genomic coordinates but have been assigned different numbers in their respective runs?

I also noticed that some novel transcripts are listed multiple times in the tsv, connected to different chromosomes. For instance, I looked at one in IGV which was placed both in chromosome 7 and 10: IGV_novel_transcript_ambiguity Do you have any insight on why this occurs and how to identify which location is correct?

Thanks, -Benney

andrewprzh commented 2 weeks ago

Dear @BenneyMRArgue

Thanks for the feedback!

I would recommend you to try the latest IsoQuant version (3.5.2). It has far better RAM consumption compared to 3.4.1 - a major problem was fixed since version 3.4.2 resulting in ~10-30x RAM decrease on different tested datasets. Probably, you'd be able to process you dataset at once.

Regarding duplicated transcripts. IsoQuant assigns transcript ids sequentially, but the independent runs will not have identical ids for the same novel transcripts. So unfortunately, it is impossible to track novel transcripts between different runs. Moreover, chromosome name is a part of transcript id, so it's OK to have transcript58.chr7.nic and transcript58.chr10.nic -- these are two completely different transcripts ids.

If you still would like to merge different GTFs, I'd suggest using gffcompare tool.

Best Andrey

BenneyMRArgue commented 1 week ago

Hi Andrey,

Thanks for the input! It's good to know that the novel transcripts can't be compared between runs. I will try all together with the newest version first.

Also thanks so much for clarifying that point about the transcript ids and chromosome assignments, it's a relief to find they are not supposed to be the same!

Best, -Benney