biobakery / MetaPhlAn

MetaPhlAn is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data
http://segatalab.cibio.unitn.it/tools/metaphlan/index.html
MIT License
292 stars 84 forks source link

[BUG] database installation error #103

Closed mcahn closed 3 years ago

mcahn commented 4 years ago

After installing metaphlan 3.0, and activating its conda environment, I ran:

metaphlan --install

This produces the following error:

File /tigress/MOLBIO/local/pythonenv/metaphlan3/lib/python3.6/site-packages/metaphlan/metaphlan_databases/file_list.txt already present!
Traceback (most recent call last):
  File "/tigress/MOLBIO/local/pythonenv/metaphlan3/bin/metaphlan", line 10, in <module>
    sys.exit(main())
  File "/tigress/MOLBIO/local/pythonenv/metaphlan3/lib/python3.6/site-packages/metaphlan/metaphlan.py", line 1187, in main
    pars['index'] = check_and_install_database(pars['index'], pars['bowtie2db'], pars['bowtie2_build'], pars['nproc'], pars['force_download'])
  File "/tigress/MOLBIO/local/pythonenv/metaphlan3/lib/python3.6/site-packages/metaphlan/metaphlan.py", line 610, in check_and_install_database
    download_unpack_tar(FILE_LIST, index, bowtie2_db, bowtie2_build, nproc)
  File "/tigress/MOLBIO/local/pythonenv/metaphlan3/lib/python3.6/site-packages/metaphlan/metaphlan.py", line 463, in download_unpack_tar
    url_tar_file = ls_f["mpa_" + download_file_name + ".tar"]
KeyError: 'mpa_mpa_v30_CHOCOPhlAn_201901.tar'

Metaphlan was installed like this:

conda create -p /path/to/our/conda/envs/metaphlan3 -c bioconda metaphlan

The problem appears to be that in metaphlan.py, "mpa_" is getting prepended to the database names when they are used as keys in the lsf dictionary. Removing these extra "mpa" strings seems to solve the problem, like so:

diff metaphlan.py-orig metaphlan.py
462,463c462,463
<     tar_file = os.path.join(folder, "mpa_" + download_file_name + ".tar")
<     url_tar_file = ls_f["mpa_" + download_file_name + ".tar"]
---
>     tar_file = os.path.join(folder, download_file_name + ".tar")
>     url_tar_file = ls_f[download_file_name + ".tar"]
467,468c467,468
<     md5_file = os.path.join(folder, "mpa_" + download_file_name + ".md5")
<     url_md5_file = ls_f["mpa_" + download_file_name + ".md5"]
---
>     md5_file = os.path.join(folder, download_file_name + ".md5")
>     url_md5_file = ls_f[download_file_name + ".md5"]

The "file_list.txt already present" message appears not to be the real problem.

Best, Matthew Cahn

mellertd commented 4 years ago

I am getting the same error when attempting a Singularity build with miniconda3 (this makes patching the problem a bit tricky).

fbeghini commented 4 years ago

@mcahn @mellertd This issue was present in older conda builds (pyh5ca1d4c_2), the latest one is pyh5ca1d4c_4. I see that the environment uses Python 3.6, MetaPhlAn requires Python 3.7. You should consider to delete the current environment, create a new one with only Python 3.7 and then install metaphlan, checking that the build installed is pyh5ca1d4c_4

mellertd commented 4 years ago

I was using Python 3.7. I think the problem is that your installation instructions are incomplete. The particular build you refer to has trouble installing because of dependency conflicts (at least on whatever version of Linux the miniconda3 Docker container uses) , and will silently fail back to a previous build with the error.

Here is how conda is complaining at me:

Singularity> conda create -n mpa -c bioconda metaphlan=3.0=pyh5ca1d4c_4
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: /
Found conflicts! Looking for incompatible packages.                                                                         failed                                                                                                  |

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - biom-format -> python[version='2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.5,<3.6.0a0|>=3.6,<3.7.0a0|3.4.*']
  - dendropy -> python[version='2.7.*|3.5.*|3.6.*|3.4.*|>=2.7,<3']
  - matplotlib-base -> python[version='>=2.7,<2.8.0a0|>=3.5,<3.6.0a0']

Your python: python[version='>=3.7']
...

UPDATE: I fixed this by adding conda-forge to the package search path. I'll respond with the build recipe if it works

mellertd commented 4 years ago

Yep, on second pass @fbeghini , I have to say that you instructions were indeed complete! It was my unfamiliarity with bioconda that was the problem.

The build recipe that worked was:

Bootstrap:docker
From: continuumio/miniconda3

%environment
    PATH=/opt/conda/bin:/bin:/usr/bin

%post
    export PATH="/opt/conda/bin:$PATH"
    conda update conda
    conda update --all
    conda config --add channels defaults
    conda config --add channels bioconda
    conda config --add channels conda-forge
    conda install -c bioconda metaphlan=3.0=pyh5ca1d4c_4
    metaphlan --install

Updated with a cleaner build recipe

mcahn commented 4 years ago

Thanks for the reply. I had not added the channels as instructed, because I though I already had those channels. I added them (in the order listed), deleted the previous environment, made a new one, and ran the same installation again. This time it installed Python 3.7 and metaphlan build pyh5ca1d4c_4, and the database download works.

Best, Matthew

nick-youngblut commented 4 years ago

This seems to still be a bug for metaphlan3 packaged with humann3. I'm getting the same error with the conda env:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                      1_llvm    conda-forge
biom-format               2.1.8            py37hc1659b7_0    conda-forge
biopython                 1.77             py37h8f50634_0    conda-forge
blast                     2.9.0           pl526he19e7b1_5    bioconda
boost-cpp                 1.70.0               h8e57a91_2    conda-forge
bowtie2                   2.4.1            py37h4ef193e_2    bioconda
brotlipy                  0.7.0           py37h8f50634_1000    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py37hc8dfbb8_0    conda-forge
cffi                      1.14.0           py37hd463f26_0    conda-forge
chardet                   3.0.4           py37hc8dfbb8_1006    conda-forge
click                     7.1.2              pyh9f0ad1d_0    conda-forge
cryptography              2.9.2            py37hb09aad4_0    conda-forge
curl                      7.71.0               he644dc0_0    conda-forge
cycler                    0.10.0                     py_2    conda-forge
dendropy                  4.4.0                      py_1    bioconda
diamond                   0.9.36               h56fc30b_0    bioconda
entrez-direct             13.3            pl526h375a9b1_0    bioconda
expat                     2.2.9                he1b5a44_2    conda-forge
freetype                  2.10.2               he06d7ca_0    conda-forge
future                    0.18.2           py37hc8dfbb8_1    conda-forge
glpk                      4.65              he80fd80_1002    conda-forge
gmp                       6.2.0                he1b5a44_2    conda-forge
h5py                      2.10.0          nompi_py37h90cd8ad_103    conda-forge
hdf5                      1.10.6          nompi_h3c11f04_100    conda-forge
humann                    3.0.0.alpha.3    py37h83b1523_0    biobakery
icu                       64.2                 he1b5a44_1    conda-forge
idna                      2.10               pyh9f0ad1d_0    conda-forge
kiwisolver                1.2.0            py37h99015e2_0    conda-forge
krb5                      1.17.1               hfafb76e_1    conda-forge
ld_impl_linux-64          2.34                 h53a641e_5    conda-forge
libblas                   3.8.0               17_openblas    conda-forge
libcblas                  3.8.0               17_openblas    conda-forge
libcurl                   7.71.0               hcdd3856_0    conda-forge
libdeflate                1.6                  h516909a_0    conda-forge
libedit                   3.1.20191231         h46ee950_0    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgfortran-ng            7.5.0                hdf63c60_6    conda-forge
liblapack                 3.8.0               17_openblas    conda-forge
libopenblas               0.3.10               h5ec1e0e_0    conda-forge
libpng                    1.6.37               hed695b0_1    conda-forge
libssh2                   1.9.0                hab1572f_2    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
llvm-openmp               10.0.0               hc9558a2_0    conda-forge
matplotlib-base           3.2.2            py37h30547a4_0    conda-forge
metaphlan                 3.0                pyh5ca1d4c_1    bioconda
msgpack-python            1.0.0            py37h99015e2_1    conda-forge
muscle                    3.8.1551             hc9558a2_5    bioconda
ncurses                   6.1               hf484d3e_1002    conda-forge
numpy                     1.18.5           py37h8960a57_0    conda-forge
openssl                   1.1.1g               h516909a_0    conda-forge
pandas                    1.0.5            py37h0da4684_0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
perl                      5.26.2            h516909a_1006    conda-forge
perl-app-cpanminus        1.7044                  pl526_1    bioconda
perl-archive-tar          2.32                    pl526_0    bioconda
perl-base                 2.23                    pl526_1    bioconda
perl-business-isbn        3.004                   pl526_0    bioconda
perl-business-isbn-data   20140910.003            pl526_0    bioconda
perl-carp                 1.38                    pl526_3    bioconda
perl-common-sense         3.74                    pl526_2    bioconda
perl-compress-raw-bzip2   2.087           pl526he1b5a44_0    bioconda
perl-compress-raw-zlib    2.087           pl526hc9558a2_0    bioconda
perl-constant             1.33                    pl526_1    bioconda
perl-data-dumper          2.173                   pl526_0    bioconda
perl-digest-hmac          1.03                    pl526_3    bioconda
perl-digest-md5           2.55                    pl526_0    bioconda
perl-encode               2.88                    pl526_1    bioconda
perl-encode-locale        1.05                    pl526_6    bioconda
perl-exporter             5.72                    pl526_1    bioconda
perl-exporter-tiny        1.002001                pl526_0    bioconda
perl-extutils-makemaker   7.36                    pl526_1    bioconda
perl-file-listing         6.04                    pl526_1    bioconda
perl-file-path            2.16                    pl526_0    bioconda
perl-file-temp            0.2304                  pl526_2    bioconda
perl-html-parser          3.72            pl526h6bb024c_5    bioconda
perl-html-tagset          3.20                    pl526_3    bioconda
perl-html-tree            5.07                    pl526_1    bioconda
perl-http-cookies         6.04                    pl526_0    bioconda
perl-http-daemon          6.01                    pl526_1    bioconda
perl-http-date            6.02                    pl526_3    bioconda
perl-http-message         6.18                    pl526_0    bioconda
perl-http-negotiate       6.01                    pl526_3    bioconda
perl-io-compress          2.087           pl526he1b5a44_0    bioconda
perl-io-html              1.001                   pl526_2    bioconda
perl-io-socket-ssl        2.066                   pl526_0    bioconda
perl-io-zlib              1.10                    pl526_2    bioconda
perl-json                 4.02                    pl526_0    bioconda
perl-json-xs              2.34            pl526h6bb024c_3    bioconda
perl-libwww-perl          6.39                    pl526_0    bioconda
perl-list-moreutils       0.428                   pl526_1    bioconda
perl-list-moreutils-xs    0.428                   pl526_0    bioconda
perl-lwp-mediatypes       6.04                    pl526_0    bioconda
perl-lwp-protocol-https   6.07                    pl526_4    bioconda
perl-mime-base64          3.15                    pl526_1    bioconda
perl-mozilla-ca           20180117                pl526_1    bioconda
perl-net-http             6.19                    pl526_0    bioconda
perl-net-ssleay           1.88            pl526h90d6eec_0    bioconda
perl-ntlm                 1.09                    pl526_4    bioconda
perl-parent               0.236                   pl526_1    bioconda
perl-pathtools            3.75            pl526h14c3975_1    bioconda
perl-scalar-list-utils    1.52            pl526h516909a_0    bioconda
perl-socket               2.027                   pl526_1    bioconda
perl-storable             3.15            pl526h14c3975_0    bioconda
perl-test-requiresinternet 0.05                    pl526_0    bioconda
perl-time-local           1.28                    pl526_1    bioconda
perl-try-tiny             0.30                    pl526_1    bioconda
perl-types-serialiser     1.0                     pl526_2    bioconda
perl-uri                  1.76                    pl526_0    bioconda
perl-www-robotrules       6.02                    pl526_3    bioconda
perl-xml-namespacesupport 1.12                    pl526_0    bioconda
perl-xml-parser           2.44_01         pl526ha1d75be_1002    conda-forge
perl-xml-sax              1.02                    pl526_0    bioconda
perl-xml-sax-base         1.09                    pl526_0    bioconda
perl-xml-sax-expat        0.51                    pl526_3    bioconda
perl-xml-simple           2.25                    pl526_1    bioconda
perl-xsloader             0.24                    pl526_0    bioconda
pigz                      2.3.4                hed695b0_1    conda-forge
pip                       20.1.1                     py_1    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pyopenssl                 19.1.0                     py_1    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pysam                     0.16.0.1         py37hc501bad_0    bioconda
pysocks                   1.7.1            py37hc8dfbb8_1    conda-forge
python                    3.7.6           cpython_h8356626_6    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
pytz                      2020.1             pyh9f0ad1d_0    conda-forge
raxml                     8.2.12               h14c3975_1    bioconda
readline                  8.0                  hf8c457e_0    conda-forge
requests                  2.24.0             pyh9f0ad1d_0    conda-forge
samtools                  0.1.19               h94a8ba4_6    bioconda
scipy                     1.5.0            py37ha3d9a3c_0    conda-forge
seqkit                    0.12.1                        0    bioconda
setuptools                47.3.1           py37hc8dfbb8_0    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.32.3               hcee41ef_0    conda-forge
tbb                       2020.1               hc9558a2_0    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
tornado                   6.0.4            py37h8f50634_1    conda-forge
urllib3                   1.25.9                     py_0    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.5                h516909a_0    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge

I just installed humann3 today via the biobakery channel.

fbeghini commented 4 years ago

Have you configured anaconda (as stated here) before installing humann? Is conda update metaphlan updating to the latest version?

With the following channels setting

           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://conda.anaconda.org/bioconda/linux-64
                          https://conda.anaconda.org/bioconda/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch

the latest version and build is correctly fetched and installed

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_llvm
  bcbio-gff          bioconda/noarch::bcbio-gff-0.6.6-py_0
  biom-format        conda-forge/linux-64::biom-format-2.1.8-py37hc1659b7_0
  biopython          conda-forge/linux-64::biopython-1.77-py37h8f50634_0
  blast              bioconda/linux-64::blast-2.9.0-h20b68b9_1
  boost              conda-forge/linux-64::boost-1.68.0-py37h8619c78_1001
  boost-cpp          conda-forge/linux-64::boost-cpp-1.68.0-h11c811c_1000
  bowtie2            bioconda/linux-64::bowtie2-2.4.1-py37h4ef193e_2
  brotlipy           conda-forge/linux-64::brotlipy-0.7.0-py37h8f50634_1000
  bx-python          bioconda/linux-64::bx-python-0.8.9-py37h5266303_0
  bzip2              conda-forge/linux-64::bzip2-1.0.8-h516909a_2
  ca-certificates    pkgs/main/linux-64::ca-certificates-2020.6.24-0
  certifi            conda-forge/linux-64::certifi-2020.6.20-py37hc8dfbb8_0
  cffi               conda-forge/linux-64::cffi-1.14.0-py37hd463f26_0
  chardet            conda-forge/linux-64::chardet-3.0.4-py37hc8dfbb8_1006
  click              conda-forge/noarch::click-7.1.2-pyh9f0ad1d_0
  cmseq              bioconda/noarch::cmseq-1.0-pyh5ca1d4c_0
  cryptography       conda-forge/linux-64::cryptography-2.9.2-py37hb09aad4_0
  curl               pkgs/main/linux-64::curl-7.71.0-hbc83047_0
  cycler             conda-forge/noarch::cycler-0.10.0-py_2
  dendropy           bioconda/noarch::dendropy-4.4.0-py_1
  diamond            bioconda/linux-64::diamond-0.9.24-ha888412_1
  fasttree           bioconda/linux-64::fasttree-2.1.10-h14c3975_3
  freetype           conda-forge/linux-64::freetype-2.10.2-he06d7ca_0
  future             conda-forge/linux-64::future-0.18.2-py37hc8dfbb8_1
  glpk               conda-forge/linux-64::glpk-4.65-he80fd80_1002
  gmp                conda-forge/linux-64::gmp-6.2.0-he1b5a44_2
  gnutls             conda-forge/linux-64::gnutls-3.6.13-h79a8f9a_0
  h5py               conda-forge/linux-64::h5py-2.10.0-nompi_py37h90cd8ad_103
  hdf5               conda-forge/linux-64::hdf5-1.10.6-nompi_h3c11f04_100
  htslib             bioconda/linux-64::htslib-1.9-h4da6232_3
  humann             biobakery/linux-64::humann-3.0.0.alpha.3-py37h83b1523_0
  icu                conda-forge/linux-64::icu-58.2-hf484d3e_1000
  idna               conda-forge/noarch::idna-2.10-pyh9f0ad1d_0
  iqtree             bioconda/linux-64::iqtree-2.0.3-h176a8bc_0
  kiwisolver         conda-forge/linux-64::kiwisolver-1.2.0-py37h99015e2_0
  krb5               pkgs/main/linux-64::krb5-1.18.2-h173b8e3_0
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.34-h53a641e_5
  libblas            conda-forge/linux-64::libblas-3.8.0-17_openblas
  libcblas           conda-forge/linux-64::libcblas-3.8.0-17_openblas
  libcurl            pkgs/main/linux-64::libcurl-7.71.0-h20c2e04_0
  libdeflate         conda-forge/linux-64::libdeflate-1.6-h516909a_0
  libedit            conda-forge/linux-64::libedit-3.1.20191231-h46ee950_0
  libffi             conda-forge/linux-64::libffi-3.2.1-he1b5a44_1007
  libgcc-ng          conda-forge/linux-64::libgcc-ng-9.2.0-h24d8f2e_2
  libgfortran-ng     conda-forge/linux-64::libgfortran-ng-7.5.0-hdf63c60_6
  liblapack          conda-forge/linux-64::liblapack-3.8.0-17_openblas
  libopenblas        conda-forge/linux-64::libopenblas-0.3.10-h5ec1e0e_0
  libpng             conda-forge/linux-64::libpng-1.6.37-hed695b0_1
  libssh2            conda-forge/linux-64::libssh2-1.9.0-hab1572f_2
  libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-9.2.0-hdf63c60_2
  llvm-openmp        conda-forge/linux-64::llvm-openmp-10.0.0-hc9558a2_0
  lzo                conda-forge/linux-64::lzo-2.10-h14c3975_1000
  mafft              bioconda/linux-64::mafft-7.470-h516909a_0
  matplotlib-base    pkgs/main/linux-64::matplotlib-base-3.2.2-py37hef1b27d_0
  metaphlan          bioconda/noarch::metaphlan-3.0.1-pyh5ca1d4c_0
  muscle             bioconda/linux-64::muscle-3.8.1551-hc9558a2_5
  ncurses            conda-forge/linux-64::ncurses-6.1-hf484d3e_1002
  nettle             conda-forge/linux-64::nettle-3.4.1-h1bed415_1002
  numpy              conda-forge/linux-64::numpy-1.18.5-py37h8960a57_0
  openssl            conda-forge/linux-64::openssl-1.1.1g-h516909a_0
  pandas             conda-forge/linux-64::pandas-1.0.5-py37h0da4684_0
  patsy              conda-forge/noarch::patsy-0.5.1-py_0
  pcre               conda-forge/linux-64::pcre-8.44-he1b5a44_0
  perl               conda-forge/linux-64::perl-5.26.2-h516909a_1006
  perl-archive-tar   bioconda/linux-64::perl-archive-tar-2.32-pl526_0
  perl-carp          bioconda/linux-64::perl-carp-1.38-pl526_3
  perl-common-sense  bioconda/linux-64::perl-common-sense-3.74-pl526_2
  perl-compress-raw~ bioconda/linux-64::perl-compress-raw-bzip2-2.087-pl526he1b5a44_0
  perl-compress-raw~ bioconda/linux-64::perl-compress-raw-zlib-2.087-pl526hc9558a2_0
  perl-exporter      bioconda/linux-64::perl-exporter-5.72-pl526_1
  perl-exporter-tiny bioconda/linux-64::perl-exporter-tiny-1.002001-pl526_0
  perl-extutils-mak~ bioconda/linux-64::perl-extutils-makemaker-7.36-pl526_1
  perl-io-compress   bioconda/linux-64::perl-io-compress-2.087-pl526he1b5a44_0
  perl-io-zlib       bioconda/linux-64::perl-io-zlib-1.10-pl526_2
  perl-json          bioconda/linux-64::perl-json-4.02-pl526_0
  perl-json-xs       bioconda/linux-64::perl-json-xs-2.34-pl526h6bb024c_3
  perl-list-moreuti~ bioconda/linux-64::perl-list-moreutils-0.428-pl526_1
  perl-list-moreuti~ bioconda/linux-64::perl-list-moreutils-xs-0.428-pl526_0
  perl-pathtools     bioconda/linux-64::perl-pathtools-3.75-pl526h14c3975_1
  perl-scalar-list-~ bioconda/linux-64::perl-scalar-list-utils-1.52-pl526h516909a_0
  perl-types-serial~ bioconda/linux-64::perl-types-serialiser-1.0-pl526_2
  perl-xsloader      bioconda/linux-64::perl-xsloader-0.24-pl526_0
  phylophlan         bioconda/noarch::phylophlan-3.0-py_5
  pip                conda-forge/noarch::pip-20.1.1-py_1
  pycparser          conda-forge/noarch::pycparser-2.20-pyh9f0ad1d_2
  pyopenssl          conda-forge/noarch::pyopenssl-19.1.0-py_1
  pyparsing          conda-forge/noarch::pyparsing-2.4.7-pyh9f0ad1d_0
  pysam              bioconda/linux-64::pysam-0.16.0.1-py37hc501bad_0
  pysocks            conda-forge/linux-64::pysocks-1.7.1-py37hc8dfbb8_1
  python             conda-forge/linux-64::python-3.7.6-cpython_h8356626_6
  python-dateutil    conda-forge/noarch::python-dateutil-2.8.1-py_0
  python-lzo         conda-forge/linux-64::python-lzo-1.12-py37h81344f2_1001
  python_abi         conda-forge/linux-64::python_abi-3.7-1_cp37m
  pytz               conda-forge/noarch::pytz-2020.1-pyh9f0ad1d_0
  raxml              bioconda/linux-64::raxml-8.2.12-h14c3975_1
  readline           conda-forge/linux-64::readline-8.0-hf8c457e_0
  requests           conda-forge/noarch::requests-2.24.0-pyh9f0ad1d_0
  samtools           bioconda/linux-64::samtools-1.9-h10a08f8_12
  scipy              conda-forge/linux-64::scipy-1.5.0-py37ha3d9a3c_0
  seaborn            conda-forge/linux-64::seaborn-0.10.1-1
  seaborn-base       conda-forge/noarch::seaborn-base-0.10.1-py_1
  setuptools         conda-forge/linux-64::setuptools-47.3.1-py37hc8dfbb8_0
  six                conda-forge/noarch::six-1.15.0-pyh9f0ad1d_0
  sqlite             conda-forge/linux-64::sqlite-3.32.3-hcee41ef_0
  statsmodels        conda-forge/linux-64::statsmodels-0.11.1-py37h8f50634_2
  tbb                conda-forge/linux-64::tbb-2020.1-hc9558a2_0
  tk                 conda-forge/linux-64::tk-8.6.10-hed695b0_0
  tornado            conda-forge/linux-64::tornado-6.0.4-py37h8f50634_1
  trimal             bioconda/linux-64::trimal-1.4.1-h6bb024c_3
  urllib3            conda-forge/noarch::urllib3-1.25.9-py_0
  wheel              conda-forge/noarch::wheel-0.34.2-py_1
  xz                 conda-forge/linux-64::xz-5.2.5-h516909a_0
  zlib               conda-forge/linux-64::zlib-1.2.11-h516909a_1006
nick-youngblut commented 4 years ago

@fbeghini I just recreated the humann3 env and still have metaphlan 3.0 pyh5ca1d4c_1 bioconda in the env.

I'm installing the conda env via snakemake --use-conda with the following yaml:

channels:
- conda-forge
- bioconda
- biobakery
dependencies:
- pigz
- bioconda::seqkit
- biobakery::humann
nick-youngblut commented 4 years ago

By the way, it might be best to change the default for the metaphlan bowtie2 database install location, given that the default will install the a very large database (~3-4 Gb) into a conda env if metaphlan is installed via conda. conda wasn't made for holding large files within envs. Also, it takes a ton of time to re-create the bowtie2 database each time a metaphlan conda env is created. I know that metaphlan --install --bowtie2db <PATH> can be used, but this is not well-documented, and most users will just go with the default.

fbeghini commented 4 years ago

Have you tried to include bioconda::metaphlan as a dependency just before humann? It's weird that is not correctly picking the latest version. The humann recipe does not require a specific version, so the latest should be used.

Thank you for the suggestion about the database location, it makes sense also for me. I'll update the documentation accordingly.

nick-youngblut commented 4 years ago

Why not change the humann recipe to require >=3.0.1, given that 3.0 has a bug that makes it unusable?

nick-youngblut commented 4 years ago

btw, continuous integration could help you spot major bugs such as what happed for 3.0. I tried to add that to phylophlan3

nick-youngblut commented 4 years ago

Just to be clear, I just needed to add: - bioconda::metaphlan>=3.0.1 to my yaml in order to get the right version of metaphlan, but the bigger issue is that the humann bioconda recipe allows for the install of metaphlan 3.0

fbeghini commented 4 years ago

I'll cc @ljmciver for the humann recipe

fbeghini commented 4 years ago

I'm working on the CI for MetaPhlAn for testing also if the database is OK, it will be ready in a couple of weeks

nick-youngblut commented 4 years ago

It would also be great to have code for creating custom metaphlan marker databases with the same methodology that was used to create the metaphlan3 database. Right now, there doesn't seem to much info into the detailed steps that were done to create the metaphlan3 (or v2) marker database (besides the paper, which doesn't provide all of the details needed for reproduction).

fbeghini commented 4 years ago

The new MetaPhlAn 3 database was built starting from reference genomes annotated with UniRef90, the new ChocoPhlAn pipeline is not public at the moment, a paper which includes the detailed procedure is on the way.

nick-youngblut commented 4 years ago

MetaPhlAn 3 database was built starting from reference genomes annotated with UniRef90

Thanks! How was the annotation done (eg., if diamond, what e-value and sensitivity?) Any other pre- or post-annotation filtering?

fbeghini commented 4 years ago

I completely relied on Uniprot for the annotations, meaning, you get the reference genomes from the Proteome portal, each entry is composed by UniprotKB accession which can be resolved to an UniRef90 cluster. The information of which species share the same UniRef90 can be used to identify unique genes.

Of course this works for genomes included in Uniprot. In case of MAGs, annotation with DIAMOND/mmseqs2 is an alternative. For annotating MAGs, I use DIAMOND on the proteins obtained with prokka, using evalue 1, coverage 0.8 and identity percentage 90%, the same thresholds that defines UniRef90 clusters.


Francesco Beghini

PhD Student

Lab. of Computational Metagenomics

Department of Cellular, Computational and Integrative Biology - CIBIO

University of Trento Via Sommarive 9, 38123 Trento, Italy

Il Gio 2 Lug 2020, 18:59 Nick Youngblut notifications@github.com ha scritto:

MetaPhlAn 3 database was built starting from reference genomes annotated with UniRef90

Thanks! How was the annotation done (eg., if diamond, what e-value and sensitivity?) Any other pre- or post-annotation filtering?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biobakery/MetaPhlAn/issues/103#issuecomment-653120706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGKRKLD2CWZO4RFCUN5CT3RZS4I5ANCNFSM4OA6NI5A .

nick-youngblut commented 4 years ago

Thanks for the details! I'm considering creating a metaphlan3 marker database based on GTDB-r90 (v90 to be released by next week).

Maryamtarazkar commented 4 years ago

I am trying to run a sample input on metaphlan but looks like the database is not installed on my system. I apply: metaphlan2.py --input_type fastq name.fastq -o name_metaphlan and I get following error:

Downloading https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_latest

Warning: Unable to download https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_latest Traceback (most recent call last): File "/home/sbomman/anaconda2/envs/metaphlan2/bin/metaphlan2.py", line 1442, in metaphlan2() File "/home/sbomman/anaconda2/envs/metaphlan2/bin/metaphlan2.py", line 1164, in metaphlan2 pars['index'] = check_and_install_database(pars['index'], pars['bowtie2db'], pars['bowtie2_build'], pars['nproc'], pars['force_download']) File "/home/sbomman/anaconda2/envs/metaphlan2/bin/metaphlan2.py", line 570, in check_and_install_database index = resolve_latest_database(bowtie2_db, force_redownload_latest) File "/home/sbomman/anaconda2/envs/metaphlan2/bin/metaphlan2.py", line 549, in resolve_latest_database with open(os.path.join(bowtie2_db,'mpa_latest')) as mpa_latest: FileNotFoundError: [Errno 2] No such file or directory: '/home/sbomman/anaconda2/envs/metaphlan2/bin/metaphlan_databases/mpa_latest'

Would you please tell me how I can install the database? Thank you

nick-youngblut commented 4 years ago

It would be great if you could provide a bit more info on how to create the custom marker database, particularly on the marker sequence format and how to update the pkl file.

marker sequence data

Running bowtie2-inspect on mpa_v30_CHOCOPhlAn_201901 produces a fasta in which the sequences headers look like:

# just showing the sequence headers
>1000373__GeneID:11569613
>100053__V6HZB2__LEP1GSC062_3504 UniRef90_V6HZB2;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HUW0__LEP1GSC062_1341 UniRef90_V6HUW0;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100225__K6UNG7__SAMN05421595_0182 UniRef90_K6UNG7;k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Dermatophilaceae|g__Austwickia|s__Austwickia_chelonae;GCA_900111385

What is the required format of the sequence headers? What does each part of 100225__K6UNG7__SAMN05421595_018 mean? Does each taxonomic level from kingdom to species need to be provided? What's going on with the sequences formatted as >1000373__GeneID:11569613?

Updating the mpa_v30_CHOCOPhlAn_201901.pkl file

The docs state:

db = pickle.load(bz2.open('metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.pkl', 'r'))

# Add the taxonomy of the new genomes
db['taxonomy']['taxonomy of genome1'] = ('NCBI taxonomy id of genome1', length of genome1)
db['taxonomy']['taxonomy of genome2'] = ('NCBI taxonomy id of genome1', length of genome2)

# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
                                   'clade': the clade that the marker belongs to,
                                   'ext': {the GCA of the first external genome where the marker appears,
                                           the GCA of the second external genome where the marker appears,
                                          },
                                   'len': length of the marker,
                                   'taxon': the taxon of the marker
                                }

# To see an example, try to print the first marker information:
# print db['markers'].items()[0]

# Save the new mpa_pkl file
with bz2.BZ2File('metaphlan_databases/mpa_v30_CHOCOPhlAn_NEW.pkl', 'w') as ofile:
    pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)

...but what is the "new_marker_name" format? Would that be the 100225__K6UNG7__SAMN05421595_018 part of the sequence header? How should "clade" be formatted as? For "ext", is that all of the genomes where the marker appears? For "len" is that the mean length of all sequences matching the marker, or just the uniref90 rep? If it's just using the rep length, what about markers that vary considerably in length? How is "taxon" different than "clade"?

Thanks for your help with this!

fbeghini commented 4 years ago

@Maryamtarazkar Have you tried the procedure described in #109 ?

fbeghini commented 4 years ago

What is the required format of the sequence headers? What does each part of 100225__K6UNG7__SAMN05421595_018 mean? Does each taxonomic level from kingdom to species need to be provided? What's going on with the sequences formatted as >1000373__GeneID:11569613?

The names assigned to sequence headers are arbitrary, it's only required to match the keys in the pickle files (['markers']). For ease of searching, I've called each marker using (NCBI_taxid)__(UniRef90_cluster)__(CDS_name). Taxonomy is not required in the header, this was included for having a common ChocoPhlAn header (HUMAnN sequences headers have included the taxonomy).

1000373__GeneID:11569613 or in general headers with GeneID in their names, are viral markers coming from the previous MetaPhlAn database, the current ChocoPhlAn pipeline is not suitable to find viral markers. As the others, the first field is the NCBI taxid and the second one is the GeneID of the viral gene

...but what is the "new_marker_name" format? Would that be the 100225__K6UNG7__SAMN05421595_018 part of the sequence header? Yes, exactly, it's the name of the new marker that should match the one in the FASTA.

How should "clade" be formatted as? For "ext", is that all of the genomes where the marker appears? For "len" is that the mean length of all sequences matching the marker, or just the uniref90 rep? If it's just using the rep length, what about markers that vary considerably in length? How is "taxon" different than "clade"?

nick-youngblut commented 4 years ago

Thanks for all of the clarifications! That really helps. Just a couple of things to make sure I fully understand:

  1. For ext, when you say "less is better", did you not include all of the genomes that share each marker? If so, how would you select a subset of genomes for all that share a marker?
  2. So len is the uniref90 representative sequence? What if the marker length actually varies quite a bit across strains/species? Did you include a filter to remove such length-variable markers?
  3. Sorry, but I don't understand "latest leaf on the taxonomy". Generally, a leaf means a tree tip, so I'm guessing that you mean the finest taxonomic level (eg., species), but what does "latest" mean?

Also, in regards to the taxonomy, that should be specified as NCBI taxID. I'm guessing then that metaphlan3 uses taxdump files to deal with the taxonomic hierarchy. How would one provide an alternative taxdump (eg., a taxdump for the GTDB)? Maybe I'm not understanding this. What is required for the ['taxonomy of genome1'] field?

fbeghini commented 4 years ago
  1. Sorry, I may not have been clear enough: from the species' core genome, you should identify unique or almost unique genes: if it's unique to the species, the marker has no ext values, sometimes it happens that you cannot find unique genes, so the gene can be shared between n species. In this case, only genes shared with the fewest number of species should be selected. In ext, for a species, you can list one or all the genomes that share the marker, in any case MetaPhlAn will use the "sharing" information not at the genome level, but at the species one.

  2. Marker are species-specific, so it should not be vary so much inside the species. Also, there's no big differences between lengths of UniProtKBs sharing the same UniRef90. What I did in this case, was to take the representative UniProtKB, if it was taxonomically assigned to the interested species, otherwise use the best sequence assigned to the taxonomy (UniProtKB SPROT --> UniProtKB TrEMBL --> Uniparc)

  3. Yes, sorry, that's was I meant.

Inside MetaPhlAn, it is built a taxonomy tree using each entry of the pkl['taxonomy'] (https://github.com/biobakery/MetaPhlAn/blob/3.0/metaphlan/metaphlan.py#L627) . From a quick glance, it seems that it should be easy to use GTDB instead NCBI, in this case ['taxonomy of genome1'] should be 'd__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus;RS_GCF_900040965.1', but still its missing the numeric tax ID from GTDB

nick-youngblut commented 4 years ago

Awesome!

Just one last thing:

genes shared with the fewest number of species should be selected

Any rules of thumb to use for this? It seems very subjective.

fbeghini commented 4 years ago

Just a trade-off between core value and #external, in case of non-unique core genes, I try to maximize the core value and minimize the #external, including no more than 10 species, but it would be rare to have so many species.

nick-youngblut commented 4 years ago

I was just looking at the metaphlan3 pkl database file, and I noticed that a couple of things that seem to be missing from the wiki docs:

taxonomy

The taxonomy is formatted as such:

taxonomy: 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Lachnospiraceae_unclassified|s__Eubacterium_rectale|t__GCA_003438925'
taxid: '2|1239|186801|186802|186803||39491'
length of genome: 3429456

While the wiki docs state:

db['taxonomy']['taxonomy of genome1'] = ('NCBI taxonomy id of genome1', length of genome1)

Why are there so many taxIDs? Why some gaps between taxIDs (eg., 186803||39491)?

markers

Each entry contains a score value, but score is not in the wiki docs. Is the score just ignored?

Just to clarify:

...correct?

Also just to check: does metaphlan3 use the entire taxonomy when determining markers that are within-species versus among-species, given that species names can sometimes be the same across multiple genera?

fbeghini commented 4 years ago

Why are there so many taxIDs? Why some gaps between taxIDs (eg., 186803||39491)?

The taxonomy should reflects the 7-level, so each clade has it's taxID, e.g. Bacteria has 2 | Firmicutes has 1239. Levels without taxid are unclassified taxa called after the latest known clade + unclassified. This has also been done to be compliant with the taxonomy required by CAMI.

Each entry contains a score value, but score is not in the wiki docs. Is the score just ignored?

Yes, it's a legacy of the past. It was just len(pkl['ext'])

Just to clarify: [...] ...correct?

Yes, totally correct.

does metaphlan3 use the entire taxonomy when determining markers that are within-species versus among-species, given that species names can sometimes be the same across multiple genera?

No, right now it uses only the 'clade' field, but I get what you mean, I've encountered this problem when updating the database. It should easy to use the entire taxonomy instead.

nick-youngblut commented 4 years ago

Thanks for all of the details! Are the taxIDs for each taxonomy level necessary for metaphlan3 or just for compliance with CAMI? For instance, can I just provide the taxID at the species level?

fbeghini commented 4 years ago

Yes, but you have to put the six pipes before e.g. ||||||39491 since the tree object expect will split the full taxonomy string according the pipe character.

fconstancias commented 4 years ago

Hi @nick-youngblut, Were you able to generate a GTDB-r90 metaphlan3 marker database?

Thanks for the details! I'm considering creating a metaphlan3 marker database based on GTDB-r90 (v90 to be released by next week).

nick-youngblut commented 4 years ago

@fconstancias I might be able to include it as part of Struo v2. Sorry, but no promises as of now.