[BUG] database installation error #103

mcahn commented 4 years ago

After installing metaphlan 3.0, and activating its conda environment, I ran:

metaphlan --install

This produces the following error:

File /tigress/MOLBIO/local/pythonenv/metaphlan3/lib/python3.6/site-packages/metaphlan/metaphlan_databases/file_list.txt already present!
Traceback (most recent call last):
  File "/tigress/MOLBIO/local/pythonenv/metaphlan3/bin/metaphlan", line 10, in <module>
  File "/tigress/MOLBIO/local/pythonenv/metaphlan3/lib/python3.6/site-packages/metaphlan/", line 1187, in main
    pars['index'] = check_and_install_database(pars['index'], pars['bowtie2db'], pars['bowtie2_build'], pars['nproc'], pars['force_download'])
  File "/tigress/MOLBIO/local/pythonenv/metaphlan3/lib/python3.6/site-packages/metaphlan/", line 610, in check_and_install_database
    download_unpack_tar(FILE_LIST, index, bowtie2_db, bowtie2_build, nproc)
  File "/tigress/MOLBIO/local/pythonenv/metaphlan3/lib/python3.6/site-packages/metaphlan/", line 463, in download_unpack_tar
    url_tar_file = ls_f["mpa_" + download_file_name + ".tar"]
KeyError: 'mpa_mpa_v30_CHOCOPhlAn_201901.tar'

Metaphlan was installed like this:

conda create -p /path/to/our/conda/envs/metaphlan3 -c bioconda metaphlan

The problem appears to be that in, "mpa_" is getting prepended to the database names when they are used as keys in the lsf dictionary. Removing these extra "mpa" strings seems to solve the problem, like so:

<     tar_file = os.path.join(folder, "mpa_" + download_file_name + ".tar")
<     url_tar_file = ls_f["mpa_" + download_file_name + ".tar"]
>     tar_file = os.path.join(folder, download_file_name + ".tar")
>     url_tar_file = ls_f[download_file_name + ".tar"]
<     md5_file = os.path.join(folder, "mpa_" + download_file_name + ".md5")
<     url_md5_file = ls_f["mpa_" + download_file_name + ".md5"]
>     md5_file = os.path.join(folder, download_file_name + ".md5")
>     url_md5_file = ls_f[download_file_name + ".md5"]

The "file_list.txt already present" message appears not to be the real problem.

Best, Matthew Cahn

mellertd commented 4 years ago

I am getting the same error when attempting a Singularity build with miniconda3 (this makes patching the problem a bit tricky).

fbeghini commented 4 years ago

@mcahn @mellertd This issue was present in older conda builds (pyh5ca1d4c_2), the latest one is pyh5ca1d4c_4. I see that the environment uses Python 3.6, MetaPhlAn requires Python 3.7. You should consider to delete the current environment, create a new one with only Python 3.7 and then install metaphlan, checking that the build installed is pyh5ca1d4c_4

mellertd commented 4 years ago

I was using Python 3.7. I think the problem is that your installation instructions are incomplete. The particular build you refer to has trouble installing because of dependency conflicts (at least on whatever version of Linux the miniconda3 Docker container uses) , and will silently fail back to a previous build with the error.

Here is how conda is complaining at me:

Singularity> conda create -n mpa -c bioconda metaphlan=3.0=pyh5ca1d4c_4
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: /
Found conflicts! Looking for incompatible packages.                                                                         failed                                                                                                  |

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:


  - biom-format -> python[version='2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.5,<3.6.0a0|>=3.6,<3.7.0a0|3.4.*']
  - dendropy -> python[version='2.7.*|3.5.*|3.6.*|3.4.*|>=2.7,<3']
  - matplotlib-base -> python[version='>=2.7,<2.8.0a0|>=3.5,<3.6.0a0']

Your python: python[version='>=3.7']

UPDATE: I fixed this by adding conda-forge to the package search path. I'll respond with the build recipe if it works

mellertd commented 4 years ago

Yep, on second pass @fbeghini , I have to say that you instructions were indeed complete! It was my unfamiliarity with bioconda that was the problem.

The build recipe that worked was:

From: continuumio/miniconda3


    export PATH="/opt/conda/bin:$PATH"
    conda update conda
    conda update --all
    conda config --add channels defaults
    conda config --add channels bioconda
    conda config --add channels conda-forge
    conda install -c bioconda metaphlan=3.0=pyh5ca1d4c_4
    metaphlan --install

Updated with a cleaner build recipe

mcahn commented 4 years ago

Thanks for the reply. I had not added the channels as instructed, because I though I already had those channels. I added them (in the order listed), deleted the previous environment, made a new one, and ran the same installation again. This time it installed Python 3.7 and metaphlan build pyh5ca1d4c_4, and the database download works.

Best, Matthew

nick-youngblut commented 4 years ago

This seems to still be a bug for metaphlan3 packaged with humann3. I'm getting the same error with the conda env:

I just installed humann3 today via the biobakery channel.

fbeghini commented 4 years ago

Have you configured anaconda (as stated here) before installing humann? Is conda update metaphlan updating to the latest version?

With the following channels setting

           channel URLs :

the latest version and build is correctly fetched and installed

nick-youngblut commented 4 years ago

@fbeghini I just recreated the humann3 env and still have metaphlan 3.0 pyh5ca1d4c_1 bioconda in the env.

I'm installing the conda env via snakemake --use-conda with the following yaml:

- conda-forge
- bioconda
- biobakery
- pigz
- bioconda::seqkit
- biobakery::humann
nick-youngblut commented 4 years ago

By the way, it might be best to change the default for the metaphlan bowtie2 database install location, given that the default will install the a very large database (~3-4 Gb) into a conda env if metaphlan is installed via conda. conda wasn't made for holding large files within envs. Also, it takes a ton of time to re-create the bowtie2 database each time a metaphlan conda env is created. I know that metaphlan --install --bowtie2db <PATH> can be used, but this is not well-documented, and most users will just go with the default.

fbeghini commented 4 years ago

Have you tried to include bioconda::metaphlan as a dependency just before humann? It's weird that is not correctly picking the latest version. The humann recipe does not require a specific version, so the latest should be used.

Thank you for the suggestion about the database location, it makes sense also for me. I'll update the documentation accordingly.

nick-youngblut commented 4 years ago

Why not change the humann recipe to require >=3.0.1, given that 3.0 has a bug that makes it unusable?

nick-youngblut commented 4 years ago

btw, continuous integration could help you spot major bugs such as what happed for 3.0. I tried to add that to phylophlan3

nick-youngblut commented 4 years ago

Just to be clear, I just needed to add: - bioconda::metaphlan>=3.0.1 to my yaml in order to get the right version of metaphlan, but the bigger issue is that the humann bioconda recipe allows for the install of metaphlan 3.0

fbeghini commented 4 years ago

I'll cc @ljmciver for the humann recipe

fbeghini commented 4 years ago

I'm working on the CI for MetaPhlAn for testing also if the database is OK, it will be ready in a couple of weeks

nick-youngblut commented 4 years ago

It would also be great to have code for creating custom metaphlan marker databases with the same methodology that was used to create the metaphlan3 database. Right now, there doesn't seem to much info into the detailed steps that were done to create the metaphlan3 (or v2) marker database (besides the paper, which doesn't provide all of the details needed for reproduction).

fbeghini commented 4 years ago

The new MetaPhlAn 3 database was built starting from reference genomes annotated with UniRef90, the new ChocoPhlAn pipeline is not public at the moment, a paper which includes the detailed procedure is on the way.

nick-youngblut commented 4 years ago

MetaPhlAn 3 database was built starting from reference genomes annotated with UniRef90

Thanks! How was the annotation done (eg., if diamond, what e-value and sensitivity?) Any other pre- or post-annotation filtering?

fbeghini commented 4 years ago

I completely relied on Uniprot for the annotations, meaning, you get the reference genomes from the Proteome portal, each entry is composed by UniprotKB accession which can be resolved to an UniRef90 cluster. The information of which species share the same UniRef90 can be used to identify unique genes.

Of course this works for genomes included in Uniprot. In case of MAGs, annotation with DIAMOND/mmseqs2 is an alternative. For annotating MAGs, I use DIAMOND on the proteins obtained with prokka, using evalue 1, coverage 0.8 and identity percentage 90%, the same thresholds that defines UniRef90 clusters.

nick-youngblut commented 4 years ago

Thanks for the details! I'm considering creating a metaphlan3 marker database based on GTDB-r90 (v90 to be released by next week).

Maryamtarazkar commented 4 years ago

I am trying to run a sample input on metaphlan but looks like the database is not installed on my system. I apply: --input_type fastq name.fastq -o name_metaphlan and I get following error:


Warning: Unable to download Traceback (most recent call last): File "/home/sbomman/anaconda2/envs/metaphlan2/bin/", line 1442, in metaphlan2() File "/home/sbomman/anaconda2/envs/metaphlan2/bin/", line 1164, in metaphlan2 pars['index'] = check_and_install_database(pars['index'], pars['bowtie2db'], pars['bowtie2_build'], pars['nproc'], pars['force_download']) File "/home/sbomman/anaconda2/envs/metaphlan2/bin/", line 570, in check_and_install_database index = resolve_latest_database(bowtie2_db, force_redownload_latest) File "/home/sbomman/anaconda2/envs/metaphlan2/bin/", line 549, in resolve_latest_database with open(os.path.join(bowtie2_db,'mpa_latest')) as mpa_latest: FileNotFoundError: [Errno 2] No such file or directory: '/home/sbomman/anaconda2/envs/metaphlan2/bin/metaphlan_databases/mpa_latest'

Would you please tell me how I can install the database? Thank you

nick-youngblut commented 4 years ago

It would be great if you could provide a bit more info on how to create the custom marker database, particularly on the marker sequence format and how to update the pkl file.

marker sequence data

Running bowtie2-inspect on mpa_v30_CHOCOPhlAn_201901 produces a fasta in which the sequences headers look like:

# just showing the sequence headers
>100053__V6HZB2__LEP1GSC062_3504 UniRef90_V6HZB2;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HUW0__LEP1GSC062_1341 UniRef90_V6HUW0;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100225__K6UNG7__SAMN05421595_0182 UniRef90_K6UNG7;k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Dermatophilaceae|g__Austwickia|s__Austwickia_chelonae;GCA_900111385

What is the required format of the sequence headers? What does each part of 100225__K6UNG7__SAMN05421595_018 mean? Does each taxonomic level from kingdom to species need to be provided? What's going on with the sequences formatted as >1000373__GeneID:11569613?

Updating the mpa_v30_CHOCOPhlAn_201901.pkl file

The docs state:

db = pickle.load('metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.pkl', 'r'))

# Add the taxonomy of the new genomes
db['taxonomy']['taxonomy of genome1'] = ('NCBI taxonomy id of genome1', length of genome1)
db['taxonomy']['taxonomy of genome2'] = ('NCBI taxonomy id of genome1', length of genome2)

# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
                                   'clade': the clade that the marker belongs to,
                                   'ext': {the GCA of the first external genome where the marker appears,
                                           the GCA of the second external genome where the marker appears,
                                   'len': length of the marker,
                                   'taxon': the taxon of the marker

# To see an example, try to print the first marker information:
# print db['markers'].items()[0]

# Save the new mpa_pkl file
with bz2.BZ2File('metaphlan_databases/mpa_v30_CHOCOPhlAn_NEW.pkl', 'w') as ofile:
    pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)

...but what is the "new_marker_name" format? Would that be the 100225__K6UNG7__SAMN05421595_018 part of the sequence header? How should "clade" be formatted as? For "ext", is that all of the genomes where the marker appears? For "len" is that the mean length of all sequences matching the marker, or just the uniref90 rep? If it's just using the rep length, what about markers that vary considerably in length? How is "taxon" different than "clade"?

Thanks for your help with this!

fbeghini commented 4 years ago

@Maryamtarazkar Have you tried the procedure described in #109 ?

fbeghini commented 4 years ago

What is the required format of the sequence headers? What does each part of 100225__K6UNG7__SAMN05421595_018 mean? Does each taxonomic level from kingdom to species need to be provided? What's going on with the sequences formatted as >1000373__GeneID:11569613?

The names assigned to sequence headers are arbitrary, it's only required to match the keys in the pickle files (['markers']). For ease of searching, I've called each marker using (NCBI_taxid)__(UniRef90_cluster)__(CDS_name). Taxonomy is not required in the header, this was included for having a common ChocoPhlAn header (HUMAnN sequences headers have included the taxonomy).

1000373__GeneID:11569613 or in general headers with GeneID in their names, are viral markers coming from the previous MetaPhlAn database, the current ChocoPhlAn pipeline is not suitable to find viral markers. As the others, the first field is the NCBI taxid and the second one is the GeneID of the viral gene

...but what is the "new_marker_name" format? Would that be the 100225__K6UNG7__SAMN05421595_018 part of the sequence header? Yes, exactly, it's the name of the new marker that should match the one in the FASTA.

How should "clade" be formatted as? For "ext", is that all of the genomes where the marker appears? For "len" is that the mean length of all sequences matching the marker, or just the uniref90 rep? If it's just using the rep length, what about markers that vary considerably in length? How is "taxon" different than "clade"?

nick-youngblut commented 4 years ago

Thanks for all of the clarifications! That really helps. Just a couple of things to make sure I fully understand:

  1. For ext, when you say "less is better", did you not include all of the genomes that share each marker? If so, how would you select a subset of genomes for all that share a marker?
  2. So len is the uniref90 representative sequence? What if the marker length actually varies quite a bit across strains/species? Did you include a filter to remove such length-variable markers?
  3. Sorry, but I don't understand "latest leaf on the taxonomy". Generally, a leaf means a tree tip, so I'm guessing that you mean the finest taxonomic level (eg., species), but what does "latest" mean?

Also, in regards to the taxonomy, that should be specified as NCBI taxID. I'm guessing then that metaphlan3 uses taxdump files to deal with the taxonomic hierarchy. How would one provide an alternative taxdump (eg., a taxdump for the GTDB)? Maybe I'm not understanding this. What is required for the ['taxonomy of genome1'] field?

fbeghini commented 4 years ago
  1. Sorry, I may not have been clear enough: from the species' core genome, you should identify unique or almost unique genes: if it's unique to the species, the marker has no ext values, sometimes it happens that you cannot find unique genes, so the gene can be shared between n species. In this case, only genes shared with the fewest number of species should be selected. In ext, for a species, you can list one or all the genomes that share the marker, in any case MetaPhlAn will use the "sharing" information not at the genome level, but at the species one.

  2. Marker are species-specific, so it should not be vary so much inside the species. Also, there's no big differences between lengths of UniProtKBs sharing the same UniRef90. What I did in this case, was to take the representative UniProtKB, if it was taxonomically assigned to the interested species, otherwise use the best sequence assigned to the taxonomy (UniProtKB SPROT --> UniProtKB TrEMBL --> Uniparc)

  3. Yes, sorry, that's was I meant.

Inside MetaPhlAn, it is built a taxonomy tree using each entry of the pkl['taxonomy'] ( . From a quick glance, it seems that it should be easy to use GTDB instead NCBI, in this case ['taxonomy of genome1'] should be 'd__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus;RS_GCF_900040965.1', but still its missing the numeric tax ID from GTDB

nick-youngblut commented 4 years ago


Just one last thing:

genes shared with the fewest number of species should be selected

Any rules of thumb to use for this? It seems very subjective.

fbeghini commented 4 years ago

Just a trade-off between core value and #external, in case of non-unique core genes, I try to maximize the core value and minimize the #external, including no more than 10 species, but it would be rare to have so many species.

nick-youngblut commented 4 years ago

I was just looking at the metaphlan3 pkl database file, and I noticed that a couple of things that seem to be missing from the wiki docs:


The taxonomy is formatted as such:

taxonomy: 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Lachnospiraceae_unclassified|s__Eubacterium_rectale|t__GCA_003438925'
taxid: '2|1239|186801|186802|186803||39491'
length of genome: 3429456

While the wiki docs state:

db['taxonomy']['taxonomy of genome1'] = ('NCBI taxonomy id of genome1', length of genome1)

Why are there so many taxIDs? Why some gaps between taxIDs (eg., 186803||39491)?


Each entry contains a score value, but score is not in the wiki docs. Is the score just ignored?

Just to clarify:


Also just to check: does metaphlan3 use the entire taxonomy when determining markers that are within-species versus among-species, given that species names can sometimes be the same across multiple genera?

fbeghini commented 4 years ago

Why are there so many taxIDs? Why some gaps between taxIDs (eg., 186803||39491)?

The taxonomy should reflects the 7-level, so each clade has it's taxID, e.g. Bacteria has 2 | Firmicutes has 1239. Levels without taxid are unclassified taxa called after the latest known clade + unclassified. This has also been done to be compliant with the taxonomy required by CAMI.

Each entry contains a score value, but score is not in the wiki docs. Is the score just ignored?

Yes, it's a legacy of the past. It was just len(pkl['ext'])

Just to clarify: [...] ...correct?

Yes, totally correct.

does metaphlan3 use the entire taxonomy when determining markers that are within-species versus among-species, given that species names can sometimes be the same across multiple genera?

No, right now it uses only the 'clade' field, but I get what you mean, I've encountered this problem when updating the database. It should easy to use the entire taxonomy instead.

nick-youngblut commented 4 years ago

Thanks for all of the details! Are the taxIDs for each taxonomy level necessary for metaphlan3 or just for compliance with CAMI? For instance, can I just provide the taxID at the species level?

fbeghini commented 4 years ago

Yes, but you have to put the six pipes before e.g. ||||||39491 since the tree object expect will split the full taxonomy string according the pipe character.

fconstancias commented 4 years ago

Hi @nick-youngblut, Were you able to generate a GTDB-r90 metaphlan3 marker database?

Thanks for the details! I'm considering creating a metaphlan3 marker database based on GTDB-r90 (v90 to be released by next week).

nick-youngblut commented 4 years ago

@fconstancias I might be able to include it as part of Struo v2. Sorry, but no promises as of now.