NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
467 stars 56 forks source link

Issues with agat_sp_keep_longest_isoform.pl #269

Closed filonico closed 2 years ago

filonico commented 2 years ago

Dear Jacques,

I'm using the script agat_sp_keep_longest_isoform.pl (v.0.8.0 from a conda installation in Ubuntu) on the genome annotation of the mussel Mytilus galloprovincialis from NCBI (you can download the gff file from here). However, I'm facing some issues, since it seems that agat succeeds in removing some isoforms (the raw count of predicted proteins drops from 78,735 to 74,575 after the filtering) but not all of them (busco score of duplicated genes is 32.8% before the filtering and 30.1% after).

For example, if you look at the gene MGAL_10B075785 in the gff file, you will find that it should encode for 6 different isoforms (from VDI47525.1 to VDI47530.1), since they share both the Parent and the locus_tag. However, after running the aforementioned script, all the six isoforms (or at least some exons of them) are retained. Clearly, I suspect that similar situations are occurring also in other genes for this specific genome assembly (in other genomes AGAT is working greatly).

Can this be a problem of the annotation file itself? Maybe some mistakes or a wrong format?

Thank a lot!

Juke34 commented 2 years ago

I extracted the gene and tried version 0.8.0 and 0.9.2 of AGAT and did not experience any problem. I only get two "mRNA" that are not considered as isoforms because are they not from the same type. One is a non coding RNA and do not have CDS.

filonico commented 2 years ago

Thank you very much for your quick response.

So now I'm wondering if I actually understood well how agat_sp_keep_longest_isoform.pl actually works. Say we have the mRNA in the picture below and the corresponding 3 encoded isoforms, according to the gff file.

Now, AGAT would retain "isoform 1" (which is actually the longest) or would create a "new longest-isoform" composed of CDS1, 2, 3 and 4? image

Juke34 commented 2 years ago

Short: AGAT would retain "isoform 1"
Long: From what I understand the chimere mRNA you describe at the top does not exits. You have 3 mRNA isoforms (second, third and fourth line). What you draw at the top does not exists as mature RNA but corresponds to what might be the pre-mRNA. So AGAT would keep the first (it contatenate the CDS of each mRNA and the one that has le longest will be the kept mRNA).

filonico commented 2 years ago

Yeah sure, you are right! Clearly the mRNA should be a gene (or even better the pre-mRNA, as you said)... I should go back to basics of molecular biology, I guess 🥲

Thanks a lot!!

Juke34 commented 2 years ago

Could you modify the drawing and remove the "new longest isoform by agat", I'm afraid it can be confusing for other users. They often don't read and just look at the pictures