Genome names do not match annotation

AuReMe / emapper2gbk

Convert GFF, fastas, annotation table and species name into Genbank.

GNU Lesser General Public License v3.0

11 stars 5 forks source link

Genome names do not match annotation #13

Open Marlinski95 opened 2 years ago

Marlinski95 commented 2 years ago

Hi, I am trying to convert my emapper annotations into genebank format using your tool. I have the following directories set up:

ANNOTATION/ FASTAPROT/ FASTNUCLEIC/ GENBANK/ GFF/ HITS/ ORTHOLOGS/

(emapper2gbk) [mjensen2$] ls FASTNUCLEIC/ BC-1_bin.100.fna BC-1_bin.116.fna BC-1_bin.14.fna etc.

(emapper2gbk) [mjensen2$] ls FASTAPROT/ BC-1_bin.100.emapper.genepred.faa BC-1_bin.116.emapper.genepred.faa BC-1_bin.14.emapper.genepred.faa etc.

(emapper2gbk) [mjensen2$] ls ANNOTATION/ BC-1_bin.100.emapper.annotations BC-1_bin.116.emapper.annotations BC-1_bin.14.emapper.annotations etc.

When I run the following command, however, I get the an error saying that the genomes names do not match the annotation names.

(emapper2gbk) [mjensen2$] emapper2gbk genes -fn ./FASTNUCLEIC/ -fp ./FASTAPROT/ -o ./GENBANK/ -a ./ANNOTATION/ -c 10 -n BC-1 -go gobasic -g ./GFF/

Since it is not the filenames I checked the file content and noticed that emapper has added an additional number to the identifier when it predicted genes and annotated these, e.g.

Contig ID: >bin.1.fak127_1021 Prot ID: >bin.1.fak127_1021_1 Annotation ID: bin.1.fak127_1021_1

I believe this is the problem but I don't know how to work around this as this is something emapper added. Have you encountered this before? I might just be missing a flag of some sort but I am unsure and would appreciate your help!

Cheers, Marlene

ArnaudBelcour commented 2 years ago

Hi @Marlinski95,

When giving a directory as input to emapper2gbk, the tool expects that the files for a same organism in the ANNOTATION/FASTAPROT/FASTNUCLEIC folders have the same name. And it seems that this is not the case for your data.

Your input seems to be:

FASTNUCLEIC
    ├── BC-1_bin.100.fna
    ├── BC-1_bin.116.fna
    ├── ...
FASTAPROT
    ├── BC-1_bin.100.emapper.genepred.faa
    ├── BC-1_bin.116.emapper.genepred.faa
    ├── ...
ANNOTATION
    ├── BC-1_bin.100.emapper.annotations
    ├── BC-1_bin.116.emapper.annotations
    ├── ...

With these names, emapper2gbk will not be able to map the different files to the same organism and will return an error. They should be formatted as this:

FASTNUCLEIC
    ├── BC-1_bin.100.fna
    ├── BC-1_bin.116.fna
    ├── ...
FASTAPROT
    ├── BC-1_bin.100.faa
    ├── BC-1_bin.116.faa
    ├── ...
ANNOTATION
    ├── BC-1_bin.100.tsv
    ├── BC-1_bin.116.tsv
    ├── ...

By renaming the files, this should fix this issue.

Best Regards, Arnaud Belcour.

Marlinski95 commented 2 years ago

Oh I see! Thank so much - so it is always required to reformat the emapper output. Thanks for the quick help. I fixed the file names now but it still has an issue with the .gff files.

When I run the command above it now gives me the following error:

My gff files are renamed as well and now look like this:

(emapper2gbk) [mjensen2@kleinerserver BC-1]$ ls GFF BC-1_bin.100.gff BC-1_bin.116.gff BC-1_bin.14.gff etc.

The instructions on the user page are not entirely clear "the GFF file corresponding to the genome or a folder containing multiple GFF files (must be the same name as the nucleotide folder).". Does this mean the gff directory has to be in the nucleotide directory (when I have anything but the .fna files in there it complains)? Could you clarify?

Thanks again!

ArnaudBelcour commented 2 years ago

Can you give me the complete error message and the command you used? It seems that the path ./gff/ has been associated with the -go option (option used to select/download the go-basic.obo file to process Gene Ontology Terms) instead of the -g option (to handle GFF).

My gff files are renamed as well and now look like this:

(emapper2gbk) [mjensen2@kleinerserver BC-1]$ ls GFF BC-1_bin.100.gff BC-1_bin.116.gff BC-1_bin.14.gff etc.

This GFF folder seems to be correct and should not produce error.

The instructions on the user page are not entirely clear "the GFF file corresponding to the genome or a folder containing multiple GFF files (must be the same name as the nucleotide folder).". Does this mean the gff directory has to be in the nucleotide directory (when I have anything but the .fna files in there it complains)? Could you clarify?

Sorry, there is a typo in it, I will fix it. The correct sentence is: the GFF file corresponding to the genome or a folder containing multiple GFF files (each GFF files must have the same name as the corresponding nucleotide files). What is explained here, is that as for the FASTAPROT and ANNOTATION folders, the name of the files in the GFF folder must be the same than the files from the FASTNUCLEIC folder.

So something like this:

FASTNUCLEIC
    ├── BC-1_bin.100.fna
    ├── BC-1_bin.116.fna
    ├── ...
FASTAPROT
    ├── BC-1_bin.100.faa
    ├── BC-1_bin.116.faa
    ├── ...
ANNOTATION
    ├── BC-1_bin.100.tsv
    ├── BC-1_bin.116.tsv
    ├── ...
GFF
    ├── BC-1_bin.100.gff
    ├── BC-1_bin.116.gff
    ├── ...

And the GFF folder is an independent folder (such as FASTAPROT and ANNOTATION) so it must not be in the nucleotide folder. The location of the GFF folder is given to emapper2gbk with the option -g when using emapper2gbk genomes.

Marlinski95 commented 2 years ago

Hello, thanks for the clarification and extensive response! I truly appreciate it. I know realized that the reason it wasn't working was that I ran the "genes" mode instead of the "genomes" mode. My apologies - rookie mistake. It ran now but I received this error message for every single bin

Creating GFF database (gffutils) for BC-1_bin.15 /!\ Error with BC-1 this taxa has not been found in https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/ /!\ Check the name of the taxa and its presence in the EBI taxonomy database. /!\ No genbank will be created for BC-1. /!\ Only 0 on 127 genbanks have been created, check the logs for error. --- Total runtime 32.14 seconds ---

Am I still missing something? I know that my taxonomic resolutions isn't very high since we suspect a lot of Candidate species in my samples but I think I don't entirely understand how this is tied to reformatting the data.

Thanks a thousands for your help and time! Best,

ArnaudBelcour commented 2 years ago

Hi,

The issue here is that BC-1 is too precise as a taxonomic resolutions for the taxonomic database.

The search on https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/BC-1 show no results.

You should use a higher taxonomic rank (either species or genus). By adding the taxon name to the address https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/, you should see if this is working.

For example https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/Escherichia%20coli (The %20 here is to replace a space between the genus and the species names.)

And do you have only one taxon for all your data? Because you can give a taxonomic file by using the option -nf. This option takes as input a .tsv file with 2 columns (first is the name of the organism and the second is the name of the corresponding taxon). For example:


BC-1_bin.100	Genus species
BC-1_bin.116	Escherichia coli
...	...

In this way you can give the specific taxon associated to each genome.

Best regards, Arnaud Belcour.

Marlinski95 commented 2 years ago

Ahaa, I see! Thank you.

Hmm - those bins are from metagenomes so there is all kinds of stuff in there. I guess I could try to set it to -Bacteria- instead of BC-1 but not sure if that would fix the problem. I'll play around with it - thank you!

On Wed, Apr 20, 2022 at 9:15 AM Arnaud Belcour @.***> wrote:

Hi,

The issue here is that BC-1 is too precise as a taxonomic resolutions for the taxonomic database.

The search on https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/BC-1 show no results.

You should use a higher taxonomic rank (either species or genus). By adding the taxon name to the address https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/, you should see if this is working.

For example https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/Escherichia%20coli (The %20 here is to replace a space between the genus and the species names.)

And do you have only one taxon for all your data? Because you can give a taxonomic file by using the option -nf. This option takes as input a .tsv file with 2 columns (first is the name of the organism and the second is the name of the corresponding taxon). For example: BC-1_bin.100 Genus species BC-1_bin.116 Escherichia coli ... ...

In this way you can give the specific taxon associated to each genome.

Best regards, Arnaud Belcour.

— Reply to this email directly, view it on GitHub https://github.com/AuReMe/emapper2gbk/issues/13#issuecomment-1103919432, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWYRTMRZXBPFYRNJRK6P5LVF77QBANCNFSM5TYXY5PQ . You are receiving this because you were mentioned.Message ID: @.***>