Corrected CDS gff file: error gene association

nlapalu commented 3 years ago

Hi, I am using the last version of SQANTI 3 and I have a problem with the evidence transcript / Gene association when the overlap is very limited or only on UTR parts. In my genomes (small fungal genomes), it 's very common to have overlaps between UTR parts of different genes. So when I get the *_corrected.gtf.cds.gff file, I have some wrong links between transcript and gene ID, that could be a problem for the isoform usage report and analysis with tappAS. See an example below:

my gene:

chr_5   ingenannot      transcript      248101  250601  .       -       .       transcript_id "G_06124.1"; gene_id "G_06124";
chr_5   ingenannot      exon    248101  248499  .       -       .       transcript_id "G_06124.1"; gene_id "G_06124";
chr_5   ingenannot      exon    248555  250601  .       -       .       transcript_id "G_06124.1"; gene_id "G_06124";
chr_5   ingenannot      CDS     248309  248499  .       -       2       transcript_id "G_06124.1"; gene_id "G_06124";
chr_5   ingenannot      CDS     248555  250178  .       -       0       transcript_id "G_06124.1"; gene_id "G_06124";

my ISO-seq pacbio file:

chr_5   ingenannot-isoform-ranking      transcript      248101  250601  .       -       .       gene_id "PB.4392";transcript_id "PB.4392.3";rank "1";
chr_5   ingenannot-isoform-ranking      exon    248101  248499  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.3";
chr_5   ingenannot-isoform-ranking      exon    248555  250601  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.3";
chr_5   ingenannot-isoform-ranking      transcript      248058  250584  .       -       .       gene_id "PB.4392";transcript_id "PB.4392.2";rank "2";
chr_5   ingenannot-isoform-ranking      exon    248058  248499  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.2";
chr_5   ingenannot-isoform-ranking      exon    248555  250232  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.2";
chr_5   ingenannot-isoform-ranking      exon    250422  250584  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.2";
chr_5   ingenannot-isoform-ranking      transcript      248182  250584  .       -       .       gene_id "PB.4392";transcript_id "PB.4392.5";rank "3";
chr_5   ingenannot-isoform-ranking      exon    248182  248499  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.5";
chr_5   ingenannot-isoform-ranking      exon    248555  250238  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.5";
chr_5   ingenannot-isoform-ranking      exon    250422  250584  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.5";
chr_5   ingenannot-isoform-ranking      transcript      248111  250601  .       -       .       gene_id "PB.4392";transcript_id "PB.4392.4";rank "4";
chr_5   ingenannot-isoform-ranking      exon    248111  250238  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.4";
chr_5   ingenannot-isoform-ranking      exon    250422  250601  .       -       .       gene_id "PB.4392"; transcript_id "PB.4392.4";
chr_5   ingenannot-isoform-ranking      transcript      250299  254011  .       -       .       gene_id "PB.4393";transcript_id "PB.4393.1";rank "1";
chr_5   ingenannot-isoform-ranking      exon    250299  254011  .       -       .       gene_id "PB.4393"; transcript_id "PB.4393.1";

the corrected file done with sqanti3:

chr_5   PacBio  transcript      248058  250584  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.2";
chr_5   PacBio  exon    248058  248499  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.2";
chr_5   PacBio  exon    248555  250232  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.2";
chr_5   PacBio  exon    250422  250584  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.2";
chr_5   PacBio  CDS     248309  248499  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.2";
chr_5   PacBio  CDS     248555  250178  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.2";
chr_5   PacBio  transcript      248101  250601  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.3";
chr_5   PacBio  exon    248101  248499  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.3";
chr_5   PacBio  exon    248555  250601  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.3";
chr_5   PacBio  CDS     248309  248499  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.3";
chr_5   PacBio  CDS     248555  250178  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.3";
chr_5   PacBio  transcript      248111  250601  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.4";
chr_5   PacBio  exon    248111  250238  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.4";
chr_5   PacBio  exon    250422  250601  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.4";
chr_5   PacBio  CDS     248511  250178  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.4";
chr_5   PacBio  transcript      248182  250584  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.5";
chr_5   PacBio  exon    248182  248499  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.5";
chr_5   PacBio  exon    248555  250238  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.5";
chr_5   PacBio  exon    250422  250584  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.5";
chr_5   PacBio  CDS     248309  248499  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.5";
chr_5   PacBio  CDS     248555  250178  .       -       .       gene_id "G_06124"; transcript_id "PB.4392.5";
chr_5   PacBio  transcript      250299  254011  .       -       .       gene_id "G_06124"; transcript_id "PB.4393.1";
chr_5   PacBio  exon    250299  254011  .       -       .       gene_id "G_06124"; transcript_id "PB.4393.1";
chr_5   PacBio  CDS     253669  253995  .       -       .       gene_id "G_06124"; transcript_id "PB.4393.1";

PB.4393.1 is associated with the gene ID: G_06124, but it's wrong. There is only a small overlap ! I am very confident in my transcript clusters (I performed several steps of filtering and reclustering), how I can avoid this problem ? How I can keep the transcript clusters defined in the evidence file (PB.4392.X != PB.4393.X).

The screenshot of the context:

Thanks for your help.

FJPardoPalacios commented 3 years ago

Hi,

I get your point. The differences that you observe when comparing the collapsing IDs (PB.XX.Y) and SQANTI3 associated gene are due to how each tool defines if several isoforms come from one single gene/locus.

cDNA Cupcake cannot tell you that they come from the same gene because it only uses the reference genome, which is not annotated. It doesn't know where a gene starts or ends. It just group isoforms based on their overlapping SJ. If two isoforms from the same gene do not overlap at all, cDNA Cupcake will name them differently.

In the case of SQANTI3, we do have information about the reference transcriptome. That's why we can associate one transcript to an annotated gene if they overlap (if one transcript overlap 2 genes, we will associate it with the gene with the highest overlap). And in the case you are showing, the highest overlap of PB.4393.1 is with the G_06124 (G_06125 is in the opposite strand, so it doesn't count). So even if it is a small overlap, it is the biggest.

What SQANTI3 does not do is to link isoforms within the analyzed transcriptome. It doesn't make clusters, it just report which is the most likely gene for one isoform to be associated with. For SQANTI3, PB.4393.X and PB.4393.Y are totally independent.

I don't know how you could fix this very specific problem, since it comes to the very definition of what a gene is. Right now, overlapping is our criteria to link detected transcripts to annotated genes. You can try to find out how many times cDNA Cupcake and SQANTI3 differ on their "diagnosis" and manually change the gene IDs of your transcripts to curate it.

I'm sorry but I think that this is a case by case situation. I guess that's why we still manually curate genome annotations...

nlapalu commented 3 years ago

OK, Thanks for the reply. I have written a piece of code to extract all the problematic situations (18), and fixed them by manually edition if it was necessary. Nevertheless, in this case (above), I don't understand why the transcript was not detected as an "antisens" of the next gene, instead of a new isoform. I think that a comparison of the overlap ratio could be fixed this situation. Bests

FJPardoPalacios commented 3 years ago

To link one detected mono-exonic transcript as an "antisense" of an already known mono-exonic transcript we need something that relate them. Isoforms are defined by their chromosome, strand, TSS, splice-junction(s) and TTS. Your case it's a mono-exon and antisense, meaning that we cannot use the strand nor the SJ to link it to G_06125. If any of the TSS or TTS of PB.4393.X would have matched the annotated TSS or TTS of G_06125 (regardless of the strand), they would be classified as antisense, but that's not the case...

Regarding comparing the overlap ratio, in this case it would also create some problematic situations... For instance, imagine that PB.4393.1 extends a bit more into G_06124 and it spans ~70% of that annotated isoform. At the same time, it exceed by far the length of G_06125 (100% overlapping and an extra piece of transcript), and moreover it is in the opposite strand. What would have more "value": (excessive) overlapping in the opposite strand or a partial overlap but in the right strand? From my point of view, the partial overlap at least matches the strand. The excessive overlapping does not find the proper strand nor TTS nor TSS...

ConesaLab / SQANTI3

Corrected CDS gff file: error gene association #65