NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
465 stars 56 forks source link

GN tag and agat_sp_extract_attributes.pl #409

Closed DiegoSafian closed 11 months ago

DiegoSafian commented 1 year ago

Hi, I have a question/request. I need to create my own database to use agat_sp_manage_functional_annotation.pl. I understand that agat_sp_manage_functional_annotation.pl uses the tag GN in the database to keep the gene_name and add it to the functionally annotated gff3. I wonder how I can create a database (fast) with the GN tag using agat_sp_extract_attributes.pl (assuming that the gff3 is already annotated(gene_name)), so then I can use agat_sp_manage_functional_annotation.pl.

Thanks, Diego

Juke34 commented 1 year ago

Right you need to create a FASTA file with the information in the header. At a minimum something like that >UniqueIdentifier description of function of the protein or unknown GN=gen1

You can take example of any protein from the uniprot DB e.g.

>sp|P06611|BTUD_ECOLI Vitamin B12 import ATP-binding protein BtuD OS=Escherichia coli (strain K12) OX=83333 GN=btuD PE=1 SV=1
MSIVMQLQDVAESTRLGPLSGEVRAGEILHLVGPNGAGKSTLLARMAGMTSGKGSIQFAG
QPLEAWSATKLALHRAYLSQQQTPPFATPVWHYLTLHQHDKTRTELLNDVAGALALDDKL
GRSTNQLSGGEWQRVRLAAVVLQITPQANPAGQLLLLDEPMNSLDVAQQSALDKILSALC
QQGLAIVMSSHDLNHTLRHAHRAWLLKGGKMLASGRREEVLTPPNLAQAYGMNFRRLDIE
GHRMLISTI

I guess what you want to achieve is to lift the functional information from another GFF3 annotation. There is no easy way to do that with AGAT. You can indeed extract the information you want using agat_sp_extract_attributes.pl but you will have to code something yourself (using bash, python, awk...) to reconcile the extracted information to the fasta sequence based on identifier. I think the easiest would be to implement something within the agat_sp_extract_sequence.pl allowing to create a protein fasta file with a specific attribute to be placed in the GN tag of the fasta header.

DiegoSafian commented 1 year ago

Thanks. I do not get why agat_sp_manage_functional_annotation.pl is not taking the GN= from the header of my fasta. This is an example of one header

>ENSDARG00000093596|ENSDARG00000063895 NADH dehydrogenase 1%2C mitochondrial Source:ZFIN%3BAcc:ZDB-GENE-011205-7 GN=mt-nd1
MLDILTSHLINPLAYAVPVLIAVAFLTLVERKVLGYIQLRKGPNVIGPRGLLQSVADGVK
LFIKEPIRPSMASPILFLTAPVLALILAIMLWAPIPMPYPVLDLNLGILFIIAISSLAVY
SILGSG*ASNSKYALIGALRAVAQTISYEVSLGLILLSAVIFSGGYTLQTFNTTQEDT*L
LLPL*PLAII*FISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFALFFLAEYSNILLI
NTHSTVLFLGASFTPDAPELITISIATKTAILSILFL*IRASYPRFRYDQLMHLIWKNFL
PITLVLVL*HIALPIALAGLPPQT*

The error report is this error.txt is it because my headers lack PE=1

Juke34 commented 1 year ago

Right I forgot that we check for Protein existence attribute in the fasta header PE : https://www.uniprot.org/help/protein_existence Add a fake value, but be sure to not skip your en try because there is a filter on this value. To be on the safe side add the value 1.

So add PE=1 in the fasta header...

DiegoSafian commented 1 year ago

thanks. I had to add a PE=1 and a SV=1 and it worked in that way. Thanks again