Closed filonico closed 2 years ago
I extracted the gene and tried version 0.8.0 and 0.9.2 of AGAT and did not experience any problem. I only get two "mRNA" that are not considered as isoforms because are they not from the same type. One is a non coding RNA and do not have CDS.
Thank you very much for your quick response.
So now I'm wondering if I actually understood well how agat_sp_keep_longest_isoform.pl
actually works. Say we have the mRNA in the picture below and the corresponding 3 encoded isoforms, according to the gff file.
Now, AGAT would retain "isoform 1" (which is actually the longest) or would create a "new longest-isoform" composed of CDS1, 2, 3 and 4?
Short: AGAT would retain "isoform 1"
Long: From what I understand the chimere mRNA you describe at the top does not exits. You have 3 mRNA isoforms (second, third and fourth line). What you draw at the top does not exists as mature RNA but corresponds to what might be the pre-mRNA. So AGAT would keep the first (it contatenate the CDS of each mRNA and the one that has le longest will be the kept mRNA).
Yeah sure, you are right! Clearly the mRNA should be a gene (or even better the pre-mRNA, as you said)... I should go back to basics of molecular biology, I guess 🥲
Thanks a lot!!
Could you modify the drawing and remove the "new longest isoform by agat", I'm afraid it can be confusing for other users. They often don't read and just look at the pictures
Dear Jacques,
I'm using the script
agat_sp_keep_longest_isoform.pl
(v.0.8.0 from a conda installation in Ubuntu) on the genome annotation of the mussel Mytilus galloprovincialis from NCBI (you can download the gff file from here). However, I'm facing some issues, since it seems that agat succeeds in removing some isoforms (the raw count of predicted proteins drops from 78,735 to 74,575 after the filtering) but not all of them (busco score of duplicated genes is 32.8% before the filtering and 30.1% after).For example, if you look at the gene
MGAL_10B075785
in the gff file, you will find that it should encode for 6 different isoforms (fromVDI47525.1
toVDI47530.1
), since they share both theParent
and thelocus_tag
. However, after running the aforementioned script, all the six isoforms (or at least some exons of them) are retained. Clearly, I suspect that similar situations are occurring also in other genes for this specific genome assembly (in other genomes AGAT is working greatly).Can this be a problem of the annotation file itself? Maybe some mistakes or a wrong format?
Thank a lot!