agat_sp_manage_functional_annotation errors thrown with uniprot DB

sarjopp commented 3 years ago

I tried using the uniprot-sprot proteins as my DB for blast but when I did agat_sp_manage_functional_annotation I got "ERROR P20825 not found among the db! You probably didn't give to me the same fasta file than the one used for the blast."

My makeblastdb command: makeblastdb -in uniprot_sprot_canonical_and_isoform_v2021_03.fasta -dbtype prot -parse_seqids -taxid_map uniprot-taxid-map.tab -out uniprot_sprot_canonical_and_isoform_v2021_03.DB

An example seq header from the fasta file:

sp|A9JR22|Z_TAMVU RING finger protein Z OS=Tamiami mammarenavirus (isolate Rat/United States/W 10777/1964) OX=45223 GN=Z PE=3 SV=1

My agat_sp_manage_functional_annotation command: agat_sp_manage_functional_annotation.pl --gff strict_filter_A_atri_tm2_pnm_braker_plus_EDTA.gff3 --blast strict_filter_A_atri_tm2_pnm_braker_plus_EDTA.proteins_2uniprot.tab --db ~/mendel-nas1/EvolAppl/DBs/uniprot_sprot_canonical_and_isoform_v2021_03.fasta

Can you see any obvious issues?

Thanks, Sara

sarjopp commented 3 years ago

In case the "parse_seqids" flag is the issue, I am re-running the searches against a non-parsed version of the DB.

sarjopp commented 3 years ago

Okay, that worked! For anyone having similar problems, do not use -parse_seqids when creating your blastdb if you want to use the results with agat_sp_manage_functional_annotation. I used it because I wanted to associate a taxid map with the blastdb, which requires parse_seqids.

sarjopp commented 3 years ago

OK, I spoke too soon as I am still having some issues. Although agat is now correctly reading much of the uniprot sequence info, I am still getting errors for ~30% of the uniprot proteins:

977 instances of "Header from the db fasta file doesn't match the regular expression" here is one example: sp|D8KY57|SSP_BOMIG Probable salivary secreted peptide OS=Bombus ignitus OX=130704 PE=2 SV=1
1732 instances of "No Protein Existence (PE) information in this header" here is one example: sp|O95718-2|ERR2_HUMAN Isoform 2 of Steroid hormone receptor ERR2 OS=Homo sapiens OX=9606 GN=ESRRB PE=2

As far as I can tell, these headers are fine and formatted exactly like the ones that succeeded. Can you spot any issues?

Thanks, Sara

sarjopp commented 3 years ago

To further update, I used the script's regex on my headers and all seem to fit the required pattern: grep -P "(^[^\s]+)\s(.+?(?= \w{2}=))(.+)" header_test

sp|P04146-2|COPIA_DROME Isoform Short of Copia protein OS=Drosophila melanogaster OX=7227 GN=GIP PE=2 sp|Q17058|MAL1_APIME Alpha-glucosidase OS=Apis mellifera OX=7460 PE=1 SV=1 sp|P41496|FABPM_SCHGR Fatty acid-binding protein, muscle OS=Schistocerca gregaria OX=7010 PE=1 SV=2 sp|P10978|POLX_TOBAC Retrovirus-related Pol polyprotein from transposon TNT 1-94 OS=Nicotiana tabacum OX=4097 PE=2 SV=1 sp|P42279|TRYU_DROME Trypsin eta OS=Drosophila melanogaster OX=7227 GN=etaTry PE=2 SV=2

However, of the examples above, only the last (sp|P42279|TRYU_DROME) was parsed successfully by agat, the others were flagged "Header from the db fasta file doesn't match the regular expression." I am very frustrated and I'm sure I'm missing something obvious!

Sara

Juke34 commented 3 years ago

Hi @sarjopp, thank you for using AGAT and providing your feedbacks.

977 instances of "Header from the db fasta file doesn't match the regular expression" here is one example: sp|D8KY57|SSP_BOMIG Probable salivary secreted peptide OS=Bombus ignitus OX=130704 PE=2 SV=1

This message is wrongly located in the code (should be one level up), the proper warning should be: Gene Name (GN=) is missing from the header. It does not change anything at the result but we should update that to be clearer.

1732 instances of "No Protein Existence (PE) information in this header" here is one example: sp|O95718-2|ERR2_HUMAN Isoform 2 of Steroid hormone receptor ERR2 OS=Homo sapiens OX=9606 GN=ESRRB PE=2

From the example you show the warning is wrong. You have the PE attribute in the header. I checked the REGEX which does not take into account the case where PE is the last element of the Header. So it has to be fixed. (/PE=([1-5])\s/ => /PE=([1-5])/)

sarjopp commented 3 years ago

I had edited by hand to add a space after PE=X in my headers, but thank you! The issue only arose for sequences from uniprot's isoforms database. Since those don't have PE values assigned, I manually added "PE=2" on the assumption that the existence of an alternate isoform implies that the protein has experimental evidence at transcript level.

Thank you for agat, I am finding it so useful!

Sara

sjannielefevre commented 2 months ago

I am having the same problem, that I get the error below for most genes:

ERROR sp|O70167|P3C2G_MOUSE not found among the db! You probably didn't give to me the same fasta file than the one used for the blast. (l2=ccar_ua06-g6085.t1)

In the .out file it says:

2 gene names have been retrieved in the blast file. 2 gene names have been successfully inferred.
Among them there are 0 names that are shared at least per two genes for a total of 0 genes.

56 entries in /swissprot/uniprot_sprot_release_2024_01.fasta have no GN

0 mRNA entries in gff output have no GN

But when I look for the entry in the same fasta file I just used as input for blast, it finds the header:

grep "sp|O70167|P3C2G_MOUSE" uniprot_sprot_release_2024_01.fasta

Returns: >sp|O70167|P3C2G_MOUSE Phosphatidylinositol 3-kinase C2 domain-containing subunit gamma OS=Mus musculus OX=10090 GN=Pik3c2g PE=1 SV=1

Command I am running: singularity run --cleanenv -B ${DATA}:/data,${BLAST}:/blast,${SWISSPROT}:/swissprot,${GFF}:/gff /cluster/projects/nn8014k/programs/AGAT_singularity/agat_1.4.0--pl5321hdfd78af_0.sif bash -c "cd /data && agat_sp_manage_functional_annotation.pl -f /gff/${NAMES[${SLURM_ARRAY_TASK_ID}]}.agatfix.fixframe.gff3 -b /blast/${NAMES[${SLURM_ARRAY_TASK_ID}]}.blastp.swissprot.outfmt6 -db /swissprot/uniprot_sprot_release_2024_01.fasta -o ${NAMES[${SLURM_ARRAY_TASK_ID}]}_annotated"

Any help would be appreciated.

NBISweden / AGAT

agat_sp_manage_functional_annotation errors thrown with uniprot DB #157