ConesaLab / SQANTI3

Tool for the Quality Control of Long-Read Defined Transcriptomes
GNU General Public License v3.0
195 stars 46 forks source link

understanding the dup isoforms in the classification file #216

Open alexyfyf opened 1 year ago

alexyfyf commented 1 year ago

Hi team,

I have recently run sqanti3 with fasta as input, and the result table contains some duplicate isoforms. I had a quick check and found it stems from the corrected.gtf file first, which seems been generated after aligning with minimap2, the isoform is splitted because of supplement alignment.

I did not find much detail about that, could you explain a bit? Does it make sense to consider them separately, because sometimes the dup is far away say in a different chromosome, but sometimes quite close.

I can give an example here: In same chromosome (430kb apart) Screen Shot 2023-08-23 at 13 57 59 Cross chromosome: Screen Shot 2023-08-23 at 13 59 16

Cheers, Alex

aarzalluz commented 1 year ago

Hi @alexyfyf,

I am not sure why this is happening (SQANTI3 runs minimap2 with --secondary=no, but I guess there still could be some supplementary alignemtns), have you tried any of the other implemented mappers?

If you want more control over the process, I would suggest mapping isoforms outside of SQANTI3, using more stringent parameters or filtering supplementary alignments, and then run the QC script using the recommended GTF input.

Best,

Ángeles

alexyfyf commented 1 year ago

Thank you for you prompt reply and suggestion. I think setting secondary=no does not prevent the supplementary alignment. I will try to process them outside SQANTI. Also from my point of view, these should be considered fusion genes.

However, I do see fusion genes in SQANTI3 classification output, but it is always two genes nearby that are chained together. I am also wondering if that should be the appropriate interpretation. To me, they are more likely to be either read-through transcripts or due to overlap of gene annotations. BTW, I use gencode as suggested in your wiki.

Screen Shot 2023-08-23 at 21 54 55

I also ran gencode gtf into SQANTI3, and it does classify some transcripts as fusion as a mistake. The first one looks like a readthrough, the second one is because GLYATL1 (ENST00000534063) and GLYATL1P4 (ENST00000529326) have a shared exon. Screen Shot 2023-08-23 at 21 59 46

I'd like some suggestions about how to understand this.

Cheers, Alex