NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
465 stars 56 forks source link

Possible bug (v1.0 and 1.2) #389

Closed robertewhite closed 1 year ago

robertewhite commented 1 year ago

Discussed in https://github.com/NBISweden/AGAT/discussions/388

Originally posted by **robertewhite** August 4, 2023 Hi I have been using AGAT to assemble GFF3 files for Epstein-Barr virus, and have had some odd error messages in the log file. As far as I can tell, the output GFF3 looks fine, but it is a lot to analyse in forensic detail. The message is: "Use of uninitialized value in lc at /Users/xxx/miniconda3/envs/agatenv3/lib/perl5/site_perl/AGAT/OmniscientI.pm line 1069, line 511. [and same on many other lines] Since I have used AGAT a lot for a string of files (iteratively correcting my input) and initially this did not appear, but subsequently appeared more and more frequently in the output, I thought it might be a corruption in my installation, so I installed 1.2 in a new environment, and the error persisted. Attaching the log and the input file so you can see if you can reproduce this issue. [the orphan features in the GFF3 file are deliberate] Cheers Rob [WTw_withWp.agat.log](https://github.com/NBISweden/AGAT/files/12261985/WTw_withWp.agat.log) [WTw_withWp.gff3 copy.txt](https://github.com/NBISweden/AGAT/files/12261988/WTw_withWp.gff3.copy.txt)
Juke34 commented 1 year ago

Thank you for your je feedback I guess it does not affect the AGAT work, I will add a fix in order the message does not appear again

Juke34 commented 1 year ago

I working on silencing the message, but it actually reflect a deeper problem in the file you work with. it comes from the feature:

pHB9    Manual  exon    38748   38784   .   +   .   ID=Qp-exon;Name=Qp;locus_tag=EBNAs

that has a locus_tag to define to which mRNA it has to be linked to (priority is ID/gene_id relationship > locus_tag > sequential). And this locus_tag is used in so many places. You have many genes that use the same locus tag, this is awkward. So AGAT does not know to which one to attach it (and worse you have locus_tag only to level1 (gene) and level3 (cds/exon) so as you have many mRNAs AGAT does not know to which one it must link it).

robertewhite commented 1 year ago

Hi Jacques

Thanks for the reply, and for looking in to this.

I cannot tell if this issue is because I am misusing the annotation terms, or if the biology I am trying to represent is too complicated. Essentially the locus I am working with leads – makes 7 different proteins, with seven or eight different polyA sites, and 3 or 8 different promoters (depending how you count the one that is repeated in a repetitive genome region), and these proteins are separated by alternative splicing (and in some cases may be bicistronic). This is why I used locus tag rather than gene.

As it happens this promoter happens to only be used for one protein [EBNA1] (and one mRNA with which the promoter should be contiguous), so I can perhaps alter this annotation.

Any recommendations how we should annotate this locus (or the promoters) to avoid these sort of unanticipated glitches? For gene level estimates, we really want to have an annotation that defines which protein is made from which transcript, despite the high degree of overlap.

Cheers Rob

Dr Rob White Senior Lecturer In Virology Imperial College London Section of Virology St Mary's Hospital Medical School Building Norfolk Place London W2 1PG

tel: 0207 594 1124 www.ebv.org.ukhttp://www.ebv.org.uk/ www.imperial.ac.uk/people/robert.e.white/http://www.imperial.ac.uk/people/robert.e.white/

From: Jacques Dainat @.> Date: Friday, 13 October 2023 at 14:04 To: NBISweden/AGAT @.> Cc: White, Rob @.>, Author @.> Subject: Re: [NBISweden/AGAT] Possible bug (v1.0 and 1.2) (Issue #389) This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

I working on silencing the message, but it actually reflect a deeper problem in the file you work with. it comes from the feature:

pHB9 Manual exon 38748 38784 . + . ID=Qp-exon;Name=Qp;locus_tag=EBNAs

that has a locus_tag to define to which mRNA it has to be linked to (priority is ID/gene_id relationship > locus_tag > sequential). And this locus_tag is used in so many places. You have many genes that use the same locus tag, this is awkward. So AGAT does not know to which one to attach it (and worse you have locus_tag only to level1 (gene) and level3 (cds/exon) so as you have many mRNAs AGAT does not know to which one it must link it).

— Reply to this email directly, view it on GitHubhttps://github.com/NBISweden/AGAT/issues/389#issuecomment-1761488261, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKDCOJRKBGOZU3MJPSJVA5DX7E36FANCNFSM6AAAAAA3EKUTXY. You are receiving this because you authored the thread.Message ID: @.***>

Juke34 commented 1 year ago

Hi, Up to you to define one gene or severals. Overlaps between genes is allowed. In any case each defined gene (e.g. locus) can have multiple mRNA/RNA (that could be considered as isoforms). They may differ by start/stop position or/and by splicing events.

Pay attention to stay consistent how your features are defined. E.g. some exon features have locus_tag attributes some other not, some both :

pHB9    Manual  exon    38748   38784   .   +   .   ID=Qp-exon;Name=Qp;locus_tag=EBNAs

do not have Parent feature but has a locus_tag, while other exon like this one

pHB9    Manual  exon    38556   38784   .   +   .   ID=Fp-exon;Name=Fp;Parent=mRNA_Fp_EBNA1
``
have `Parent` attribute but no locus_tag and other exons like this one

pHB9 Manual exon 21605 21665 . + . ID=W1pr6;label=Exon W1'.6;locus_tag=EBNAs;Parent=mRNA_Wp1E2_LP1W


has both `locus_tag` and `Parent `  attributes.