EBI-Metagenomics / genome_uploader

Python script to upload bins and MAGs to ENA (European Nucleotide Archive)
Apache License 2.0
20 stars 3 forks source link

Error submitting genomes? #4

Closed SilasK closed 1 year ago

SilasK commented 1 year ago

I tried to upload some genomes and encounter the error:

Request failed https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/ with error 400 Client Error:  for url: https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/
        Retrieving project and run info from ENA (this might take a while)...
        A backup file has been found.
        Writing genome registration XML...
        All files have been written in mMAG2_single_runs/MAG_upload
registered_MAGs_test.tsv
Registering genome samples XMLs...
        Registering sample xml in test mode.
        Genomes could not be submitted to ENA. Please, check the errors below.
        Failed to validate sample xml, error: Invalid decimal value: expected at least one digit

The error message doesn't tell me where the error lies. I thought it was due to the metagenomecoloumn which I had: mouse gut metagenome

I replaced it with the number 410661, but then I got the following error:

ERROR: metagenomes associated with each genome need to belong to ENA's approved metagenomes list.

It might also be due to the fact that I have 0 in the genome_coverage coloum, See #2

Input table

genome_name | run_accessions | assembly_software | binning_software | binning_parameters | stats_generation_software | completeness | contamination | rRNA_presence | NCBI_lineage | metagenome | co-assembly | genome_coverage | genome_path | broad_environment | local_environment | environmental_medium -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- MGG00002 | ERR1989816 | metaSpades v3.13 | metagenome-atlas v2.3 | default | checkM v1.1 | 84.91 | 1.32 | FALSE | d__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__;s__ | 410661 | FALSE | 0 | genomes/MGG00002.fasta.gz | Host-associated | Mouse digestive system | Cecum MGG00003 | ERR1989816 | metaSpades v3.13 | metagenome-atlas v2.3 | default | checkM v1.1 | 95.7 | 1.08 | FALSE | d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__;f__;g__;s__ | 410661 | FALSE | 0 | genomes/MGG00003.fasta.gz | Host-associated | Mouse digestive system | Cecum MGG00005 | ERR1989816 | metaSpades v3.13 | metagenome-atlas v2.3 | default | checkM v1.1 | 95.91 | 0 | FALSE | d__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Rikenellaceae;g__Alistipes;s__ | 410661 | FALSE | 0 | genomes/MGG00005.fasta.gz | Host-associated | Mouse digestive system | Cecum MGG00007 | ERR1989816 | metaSpades v3.13 | metagenome-atlas v2.3 | default | checkM v1.1 | 95.45 | 1.72 | FALSE | d__Bacteria;p__Firmicutes;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae;g__;s__ | 410661 | FALSE | 0 | genomes/MGG00007.fasta.gz | Host-associated | Mouse digestive system | Cecum MGG00008 | ERR1989816 | metaSpades v3.13 | metagenome-atlas v2.3 | default | checkM v1.1 | 98.92 | 0 | FALSE | d__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Odoribacteraceae;g__Odoribacter;s__ | 410661 | FALSE | 0 | genomes/MGG00008.fasta.gz | Host-associated | Mouse digestive system | Cecum MGG00009 | ERR1989816 | metaSpades v3.13 | metagenome-atlas v2.3 | default | checkM v1.1 | 95.09 | 0 | FALSE | d__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Muribaculaceae;g__Muribaculum;s__Muribaculum intestinale | 410661 | FALSE | 0 | genomes/MGG00009.fasta.gz | Host-associated | Mouse digestive system | Cecum MGG00010 | ERR1989816 | metaSpades v3.13 | metagenome-atlas v2.3 | default | checkM v1.1 | 94.18 | 0.22 | FALSE | d__Bacteria;p__Firmicutes;c__Clostridia;o__Eubacteriales;f__Christensenellaceae;g__Christensenella;s__ | 410661 | FALSE | 0 | genomes/MGG00010.fasta.gz | Host-associated | Mouse digestive system | Cecum MGG00011 | ERR1989816 | metaSpades v3.13 | metagenome-atlas v2.3 | default | checkM v1.1 | 97.28 | 0.57 | FALSE | d__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Muribaculaceae;g__;s__ | 410661 | FALSE | 0 | genomes/MGG00011.fasta.gz | Host-associated | Mouse digestive system | Cecum MGG00012 | ERR1989816 | metaSpades v3.13 | metagenome-atlas v2.3 | default | checkM v1.1 | 96.86 | 0.38 | FALSE | d__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Muribaculaceae;g__;s__ | 410661 | FALSE | 0 | genomes/MGG00012.fasta.gz | Host-associated | Mouse digestive system | Cecum
SilasK commented 1 year ago

After some trial and error, it seems that putting mouse gut metagenome as metagenome is the right thing. Otherwise, I get the error ERROR: metagenomes associated with each genome need to belong to ENA's approved metagenomes list. which brings me back to my initial erro.

SilasK commented 1 year ago

If I wollow the url indicated in the warning https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/ I got to a page with { "error": "Search value must be provided." }

SilasK commented 1 year ago

My test table looks like this now:

genome_name run_accessions assembly_software binning_software binning_parameters stats_generation_software completeness contamination rRNA_presence NCBI_lineage metagenome co-assembly genome_coverage genome_path broad_environment local_environment environmental_medium
MGG00002 ERR1989816 metaSpades_v3.13 metagenome_atlas_v2.3 default checkM_v1.1 84.91 1.32 FALSE dBacteria;pBacteroidetes;cBacteroidia;oBacteroidales;fPorphyromonadaceae;g;s__ mouse gut metagenome FALSE 100 genomes/MGG00002.fasta.gz Host-associated Mouse digestive system Cecum
MGG00003 ERR1989816 metaSpades_v3.13 metagenome_atlas_v2.3 default checkM_v1.1 95.7 1.08 FALSE dBacteria;pProteobacteria;cAlphaproteobacteria;o;f;g;s__ mouse gut metagenome FALSE 100 genomes/MGG00003.fasta.gz Host-associated Mouse digestive system Cecum
MGG00005 ERR1989816 metaSpades_v3.13 metagenome_atlas_v2.3 default checkM_v1.1 95.91 0 FALSE dBacteria;pBacteroidetes;cBacteroidia;oBacteroidales;fRikenellaceae;gAlistipes;s__ mouse gut metagenome FALSE 100 genomes/MGG00005.fasta.gz Host-associated Mouse digestive system Cecum
MGG00007 ERR1989816 metaSpades_v3.13 metagenome_atlas_v2.3 default checkM_v1.1 95.45 1.72 FALSE dBacteria;pFirmicutes;cClostridia;oEubacteriales;fLachnospiraceae;g;s__ mouse gut metagenome FALSE 100 genomes/MGG00007.fasta.gz Host-associated Mouse digestive system Cecum
MGG00008 ERR1989816 metaSpades_v3.13 metagenome_atlas_v2.3 default checkM_v1.1 98.92 0 FALSE dBacteria;pBacteroidetes;cBacteroidia;oBacteroidales;fOdoribacteraceae;gOdoribacter;s__ mouse gut metagenome FALSE 100 genomes/MGG00008.fasta.gz Host-associated Mouse digestive system Cecum
Ge94 commented 1 year ago

Hi Silas, As you say, unfortunately the Failed to validate sample xml, error: Invalid decimal value: expected at least one digit error is quite uninformative as it doesn't point to a specific field. This error is returned by ENA at registration time, therefore it's difficult to parse it and provide any deeper insight. I took a look at your tsv and noticed that the error is probably generated in the "contamination" column, where some contamination values are set to 0. I believe ENA would expect it to be 0.0 - I am going to add a check in the script between today and tomorrow.

I would therefore suggest to restore the original mouse gut metagenome value in the metagenome column, as it is the most accurate for your data.

About error 400 - this is actually a logging error and I will take care of removing it. Thanks for pointing it out.

Finally, I will reply about coverage in the other issue.

Ge94 commented 1 year ago

I looked at your logs again and noticed you are probably using the --xml and -manifests options together. As a heads up, unless your xml needs to be rewritten, the two options can be used separately. This definitely improves performances for high amounts of genomes.

SilasK commented 1 year ago

So I run it first with xml or with maifest or with nither of those?

Ge94 commented 1 year ago

--xml generates the first xml files, it has to be used for xmls to be generated or updated. Manifest generation is the following step, and needs xmls to exist to work. Therefore, you can use these options as you prefer, as long as the xml step is run at some point before manifest generation. Mine was just a suggestion in terms of performances: once your xmls are generated, you can omit the --xmls option and just go with --manifests one.

SilasK commented 1 year ago

I put 0.00in contamination and also 10.99 in genome coverage but I get the same error.

SilasK commented 1 year ago

my command:

python ~/CMMG/genome_upload.py -u PRJNA646353 \
--genome_info genome_uplod_table_test.txt  \
--mags --xmls --manifests --out ~/s/CMMG/ \
--centre_name 'University of Geneva' \
--webin Webin-XXX --password XXXXX \
--force

genome_uplod_table_test.txt

Ge94 commented 1 year ago

Hi Silas, thanks for providing the tsv file. It allowed me to identify the issue within the parsing of the taxonomy field. May I ask you - is the taxonomy you provided in NCBI or GTDB format?

SilasK commented 1 year ago

It's GTDB but then converted to NCBI using the majority vote script from GTDB-tk. I would prefer putting in the GTDB taxonomy but you require the NCBI, isn'tit? Are the empty genera/pecies a problem?

Ge94 commented 1 year ago

Hi Silas, Then all it's good, I was just double checking! But yes, ENA requires NCBI annotations for the submission. I have deployed the new version of the code, which should take care of the issues mentioned above. Let me know if any other problem comes up! Hopefully, this is not the case.

SilasK commented 1 year ago

Seems that i can run the script now