enormandeau / gawn

Genome Annotation Without Nightmares
41 stars 13 forks source link

Correct mapping but strange gene model #14

Closed soungalo closed 1 year ago

soungalo commented 4 years ago

Hello, I ran GAWN using A. thaliana reference transcripts on a genome assembly from the same species. I notice quite a few cases in which transcripts seem to map very well to the assembly, but the resulting gene model is very different from the reference gene model, especially in terms of CDS features. Here's an example. The transcript AT1G01190.2 has the following reference gene model:

1       araport11       mRNA    83045   84946   .       -       .       ID=transcript:AT1G01190.2;Parent=gene:AT1G01190;biotype=protein_coding;transcript_id=AT1G01190.2
1       araport11       exon    83045   83671   .       -       .       Parent=transcript:AT1G01190.2;Name=AT1G01190.2.exon2;constitutive=0;ensembl_end_phase=0;ensembl_phase=0;exon_id=AT1G01190.2.exon2;rank=2
1       araport11       CDS     83045   83671   .       -       0       ID=CDS:AT1G01190.2;Parent=transcript:AT1G01190.2;protein_id=AT1G01190.2
1       araport11       CDS     83884   84879   .       -       0       ID=CDS:AT1G01190.2;Parent=transcript:AT1G01190.2;protein_id=AT1G01190.2
1       araport11       exon    83884   84946   .       -       .       Parent=transcript:AT1G01190.2;Name=AT1G01190.2.exon1;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=AT1G01190.2.exon1;rank=1
1       araport11       five_prime_UTR  84880   84946   .       -       .       Parent=transcript:AT1G01190.2

whereas in GAWN output gff3, it looks like this:

1_RaGOO indexed_genome  mRNA    77996   79895   .       -       .       ID=AT1G01190.2.mrna1;Name=AT1G01190.2;Parent=AT1G01190.2.path1;Dir=sensecoverage=100.0;identity=99.7;matches=1685;mismatches=3;indels=2;unknowns=0
1_RaGOO indexed_genome  exon    78835   79895   99      -       .       ID=AT1G01190.2.mrna1.exon1;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 1 1063 +
1_RaGOO indexed_genome  exon    77996   78622   99      -       .       ID=AT1G01190.2.mrna1.exon2;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 1064 1690 +
1_RaGOO indexed_genome  CDS     79830   79893   93      -       0       ID=AT1G01190.2.mrna1.cds1;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 3 68 +

I extracted the transcript and protein sequences based on the GFF3 output. The transcript is highly similar to the reference transcript, with only 2 gaps present in the annotated assembly (you can see the alignment here). However, the resulting protein is very short with virtually no similarity to the reference protein. This is not surprising since the annotated CDS is so short.

As far as I understand, the GFF3 result arises directly from the GMAP analysis, so I'm not sure this is in fact a GAWN issue. I'm also not particularly clear on how GMAP predicts CDS features based on transcript sequences only. Can you help me understand the reason for the problem or suggest ways to better explore it or get around it? Do you think that the two gaps in the assembly can cause such a difference in the gene model prediction, or is it more likely a more general issue with GMAP? Thanks!

soungalo commented 4 years ago

Another relevant observation - TransDecoder seems to do a better job at predicting the CDS sequence, given the transcripts extracted from the GAWN output, so that might be a lead or a way around the problem.

enormandeau commented 4 years ago

As you say, the GAWN gff3 file is the direct result of GMAP. This may be worthy of sharing with the authors of GMAP. I personally have no explanation as to why the CDS would be so different.

Can you tell me more about how you used TransDecoder to predict the CDS sequences from GAWN's output?

soungalo commented 4 years ago

I just extracted the exons sequences and concatenated them per mRNA (used a homemade script). Then I just provided these sequences to TransDecoder (LongOrfs and then Predict) and took results from the .pep output file. I'll try to contact the GMAP developers and show them this result.