NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
442 stars 54 forks source link

agat_sp_manage_functional_annotation errors thrown with uniprot DB #157

Closed sarjopp closed 3 years ago

sarjopp commented 3 years ago

I tried using the uniprot-sprot proteins as my DB for blast but when I did agat_sp_manage_functional_annotation I got "ERROR P20825 not found among the db! You probably didn't give to me the same fasta file than the one used for the blast."

My makeblastdb command: makeblastdb -in uniprot_sprot_canonical_and_isoform_v2021_03.fasta -dbtype prot -parse_seqids -taxid_map uniprot-taxid-map.tab -out uniprot_sprot_canonical_and_isoform_v2021_03.DB

An example seq header from the fasta file:

sp|A9JR22|Z_TAMVU RING finger protein Z OS=Tamiami mammarenavirus (isolate Rat/United States/W 10777/1964) OX=45223 GN=Z PE=3 SV=1

My agat_sp_manage_functional_annotation command: agat_sp_manage_functional_annotation.pl --gff strict_filter_A_atri_tm2_pnm_braker_plus_EDTA.gff3 --blast strict_filter_A_atri_tm2_pnm_braker_plus_EDTA.proteins_2uniprot.tab --db ~/mendel-nas1/EvolAppl/DBs/uniprot_sprot_canonical_and_isoform_v2021_03.fasta

Can you see any obvious issues?

Thanks, Sara

sarjopp commented 3 years ago

In case the "parse_seqids" flag is the issue, I am re-running the searches against a non-parsed version of the DB.

sarjopp commented 3 years ago

Okay, that worked! For anyone having similar problems, do not use -parse_seqids when creating your blastdb if you want to use the results with agat_sp_manage_functional_annotation. I used it because I wanted to associate a taxid map with the blastdb, which requires parse_seqids.

sarjopp commented 3 years ago

OK, I spoke too soon as I am still having some issues. Although agat is now correctly reading much of the uniprot sequence info, I am still getting errors for ~30% of the uniprot proteins:

  1. 977 instances of "Header from the db fasta file doesn't match the regular expression" here is one example: sp|D8KY57|SSP_BOMIG Probable salivary secreted peptide OS=Bombus ignitus OX=130704 PE=2 SV=1
  2. 1732 instances of "No Protein Existence (PE) information in this header" here is one example: sp|O95718-2|ERR2_HUMAN Isoform 2 of Steroid hormone receptor ERR2 OS=Homo sapiens OX=9606 GN=ESRRB PE=2

As far as I can tell, these headers are fine and formatted exactly like the ones that succeeded. Can you spot any issues?

Thanks, Sara

sarjopp commented 3 years ago

To further update, I used the script's regex on my headers and all seem to fit the required pattern: grep -P "(^[^\s]+)\s(.+?(?= \w{2}=))(.+)" header_test

sp|P04146-2|COPIA_DROME Isoform Short of Copia protein OS=Drosophila melanogaster OX=7227 GN=GIP PE=2 sp|Q17058|MAL1_APIME Alpha-glucosidase OS=Apis mellifera OX=7460 PE=1 SV=1 sp|P41496|FABPM_SCHGR Fatty acid-binding protein, muscle OS=Schistocerca gregaria OX=7010 PE=1 SV=2 sp|P10978|POLX_TOBAC Retrovirus-related Pol polyprotein from transposon TNT 1-94 OS=Nicotiana tabacum OX=4097 PE=2 SV=1 sp|P42279|TRYU_DROME Trypsin eta OS=Drosophila melanogaster OX=7227 GN=etaTry PE=2 SV=2

However, of the examples above, only the last (sp|P42279|TRYU_DROME) was parsed successfully by agat, the others were flagged "Header from the db fasta file doesn't match the regular expression." I am very frustrated and I'm sure I'm missing something obvious!

Sara

Juke34 commented 3 years ago

Hi @sarjopp, thank you for using AGAT and providing your feedbacks.

  • 977 instances of "Header from the db fasta file doesn't match the regular expression" here is one example: sp|D8KY57|SSP_BOMIG Probable salivary secreted peptide OS=Bombus ignitus OX=130704 PE=2 SV=1

This message is wrongly located in the code (should be one level up), the proper warning should be: Gene Name (GN=) is missing from the header. It does not change anything at the result but we should update that to be clearer.

  • 1732 instances of "No Protein Existence (PE) information in this header" here is one example: sp|O95718-2|ERR2_HUMAN Isoform 2 of Steroid hormone receptor ERR2 OS=Homo sapiens OX=9606 GN=ESRRB PE=2

From the example you show the warning is wrong. You have the PE attribute in the header. I checked the REGEX which does not take into account the case where PE is the last element of the Header. So it has to be fixed. (/PE=([1-5])\s/ => /PE=([1-5])/)

sarjopp commented 3 years ago

I had edited by hand to add a space after PE=X in my headers, but thank you! The issue only arose for sequences from uniprot's isoforms database. Since those don't have PE values assigned, I manually added "PE=2" on the assumption that the existence of an alternate isoform implies that the protein has experimental evidence at transcript level.

Thank you for agat, I am finding it so useful!

Sara

sjannielefevre commented 2 months ago

I am having the same problem, that I get the error below for most genes:

ERROR sp|O70167|P3C2G_MOUSE not found among the db! You probably didn't give to me the same fasta file than the one used for the blast. (l2=ccar_ua06-g6085.t1)

In the .out file it says:

2 gene names have been retrieved in the blast file. 2 gene names have been successfully inferred.
Among them there are 0 names that are shared at least per two genes for a total of 0 genes.

56 entries in /swissprot/uniprot_sprot_release_2024_01.fasta have no GN

0 mRNA entries in gff output have no GN

But when I look for the entry in the same fasta file I just used as input for blast, it finds the header:

grep "sp|O70167|P3C2G_MOUSE" uniprot_sprot_release_2024_01.fasta

Returns: >sp|O70167|P3C2G_MOUSE Phosphatidylinositol 3-kinase C2 domain-containing subunit gamma OS=Mus musculus OX=10090 GN=Pik3c2g PE=1 SV=1

Command I am running: singularity run --cleanenv -B ${DATA}:/data,${BLAST}:/blast,${SWISSPROT}:/swissprot,${GFF}:/gff /cluster/projects/nn8014k/programs/AGAT_singularity/agat_1.4.0--pl5321hdfd78af_0.sif bash -c "cd /data && agat_sp_manage_functional_annotation.pl -f /gff/${NAMES[${SLURM_ARRAY_TASK_ID}]}.agatfix.fixframe.gff3 -b /blast/${NAMES[${SLURM_ARRAY_TASK_ID}]}.blastp.swissprot.outfmt6 -db /swissprot/uniprot_sprot_release_2024_01.fasta -o ${NAMES[${SLURM_ARRAY_TASK_ID}]}_annotated"

Any help would be appreciated.