Closed mcahn closed 3 years ago
I am getting the same error when attempting a Singularity build with miniconda3 (this makes patching the problem a bit tricky).
@mcahn @mellertd This issue was present in older conda builds (pyh5ca1d4c_2), the latest one is pyh5ca1d4c_4. I see that the environment uses Python 3.6, MetaPhlAn requires Python 3.7. You should consider to delete the current environment, create a new one with only Python 3.7 and then install metaphlan, checking that the build installed is pyh5ca1d4c_4
I was using Python 3.7. I think the problem is that your installation instructions are incomplete. The particular build you refer to has trouble installing because of dependency conflicts (at least on whatever version of Linux the miniconda3 Docker container uses) , and will silently fail back to a previous build with the error.
Here is how conda is complaining at me:
Singularity> conda create -n mpa -c bioconda metaphlan=3.0=pyh5ca1d4c_4
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: /
Found conflicts! Looking for incompatible packages. failed |
UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:
Specifications:
- biom-format -> python[version='2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.5,<3.6.0a0|>=3.6,<3.7.0a0|3.4.*']
- dendropy -> python[version='2.7.*|3.5.*|3.6.*|3.4.*|>=2.7,<3']
- matplotlib-base -> python[version='>=2.7,<2.8.0a0|>=3.5,<3.6.0a0']
Your python: python[version='>=3.7']
...
UPDATE: I fixed this by adding conda-forge to the package search path. I'll respond with the build recipe if it works
Yep, on second pass @fbeghini , I have to say that you instructions were indeed complete! It was my unfamiliarity with bioconda that was the problem.
The build recipe that worked was:
Bootstrap:docker
From: continuumio/miniconda3
%environment
PATH=/opt/conda/bin:/bin:/usr/bin
%post
export PATH="/opt/conda/bin:$PATH"
conda update conda
conda update --all
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install -c bioconda metaphlan=3.0=pyh5ca1d4c_4
metaphlan --install
Updated with a cleaner build recipe
Thanks for the reply. I had not added the channels as instructed, because I though I already had those channels. I added them (in the order listed), deleted the previous environment, made a new one, and ran the same installation again. This time it installed Python 3.7 and metaphlan build pyh5ca1d4c_4, and the database download works.
Best, Matthew
This seems to still be a bug for metaphlan3 packaged with humann3. I'm getting the same error with the conda env:
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_llvm conda-forge
biom-format 2.1.8 py37hc1659b7_0 conda-forge
biopython 1.77 py37h8f50634_0 conda-forge
blast 2.9.0 pl526he19e7b1_5 bioconda
boost-cpp 1.70.0 h8e57a91_2 conda-forge
bowtie2 2.4.1 py37h4ef193e_2 bioconda
brotlipy 0.7.0 py37h8f50634_1000 conda-forge
bzip2 1.0.8 h516909a_2 conda-forge
ca-certificates 2020.6.20 hecda079_0 conda-forge
certifi 2020.6.20 py37hc8dfbb8_0 conda-forge
cffi 1.14.0 py37hd463f26_0 conda-forge
chardet 3.0.4 py37hc8dfbb8_1006 conda-forge
click 7.1.2 pyh9f0ad1d_0 conda-forge
cryptography 2.9.2 py37hb09aad4_0 conda-forge
curl 7.71.0 he644dc0_0 conda-forge
cycler 0.10.0 py_2 conda-forge
dendropy 4.4.0 py_1 bioconda
diamond 0.9.36 h56fc30b_0 bioconda
entrez-direct 13.3 pl526h375a9b1_0 bioconda
expat 2.2.9 he1b5a44_2 conda-forge
freetype 2.10.2 he06d7ca_0 conda-forge
future 0.18.2 py37hc8dfbb8_1 conda-forge
glpk 4.65 he80fd80_1002 conda-forge
gmp 6.2.0 he1b5a44_2 conda-forge
h5py 2.10.0 nompi_py37h90cd8ad_103 conda-forge
hdf5 1.10.6 nompi_h3c11f04_100 conda-forge
humann 3.0.0.alpha.3 py37h83b1523_0 biobakery
icu 64.2 he1b5a44_1 conda-forge
idna 2.10 pyh9f0ad1d_0 conda-forge
kiwisolver 1.2.0 py37h99015e2_0 conda-forge
krb5 1.17.1 hfafb76e_1 conda-forge
ld_impl_linux-64 2.34 h53a641e_5 conda-forge
libblas 3.8.0 17_openblas conda-forge
libcblas 3.8.0 17_openblas conda-forge
libcurl 7.71.0 hcdd3856_0 conda-forge
libdeflate 1.6 h516909a_0 conda-forge
libedit 3.1.20191231 h46ee950_0 conda-forge
libffi 3.2.1 he1b5a44_1007 conda-forge
libgcc-ng 9.2.0 h24d8f2e_2 conda-forge
libgfortran-ng 7.5.0 hdf63c60_6 conda-forge
liblapack 3.8.0 17_openblas conda-forge
libopenblas 0.3.10 h5ec1e0e_0 conda-forge
libpng 1.6.37 hed695b0_1 conda-forge
libssh2 1.9.0 hab1572f_2 conda-forge
libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge
llvm-openmp 10.0.0 hc9558a2_0 conda-forge
matplotlib-base 3.2.2 py37h30547a4_0 conda-forge
metaphlan 3.0 pyh5ca1d4c_1 bioconda
msgpack-python 1.0.0 py37h99015e2_1 conda-forge
muscle 3.8.1551 hc9558a2_5 bioconda
ncurses 6.1 hf484d3e_1002 conda-forge
numpy 1.18.5 py37h8960a57_0 conda-forge
openssl 1.1.1g h516909a_0 conda-forge
pandas 1.0.5 py37h0da4684_0 conda-forge
pcre 8.44 he1b5a44_0 conda-forge
perl 5.26.2 h516909a_1006 conda-forge
perl-app-cpanminus 1.7044 pl526_1 bioconda
perl-archive-tar 2.32 pl526_0 bioconda
perl-base 2.23 pl526_1 bioconda
perl-business-isbn 3.004 pl526_0 bioconda
perl-business-isbn-data 20140910.003 pl526_0 bioconda
perl-carp 1.38 pl526_3 bioconda
perl-common-sense 3.74 pl526_2 bioconda
perl-compress-raw-bzip2 2.087 pl526he1b5a44_0 bioconda
perl-compress-raw-zlib 2.087 pl526hc9558a2_0 bioconda
perl-constant 1.33 pl526_1 bioconda
perl-data-dumper 2.173 pl526_0 bioconda
perl-digest-hmac 1.03 pl526_3 bioconda
perl-digest-md5 2.55 pl526_0 bioconda
perl-encode 2.88 pl526_1 bioconda
perl-encode-locale 1.05 pl526_6 bioconda
perl-exporter 5.72 pl526_1 bioconda
perl-exporter-tiny 1.002001 pl526_0 bioconda
perl-extutils-makemaker 7.36 pl526_1 bioconda
perl-file-listing 6.04 pl526_1 bioconda
perl-file-path 2.16 pl526_0 bioconda
perl-file-temp 0.2304 pl526_2 bioconda
perl-html-parser 3.72 pl526h6bb024c_5 bioconda
perl-html-tagset 3.20 pl526_3 bioconda
perl-html-tree 5.07 pl526_1 bioconda
perl-http-cookies 6.04 pl526_0 bioconda
perl-http-daemon 6.01 pl526_1 bioconda
perl-http-date 6.02 pl526_3 bioconda
perl-http-message 6.18 pl526_0 bioconda
perl-http-negotiate 6.01 pl526_3 bioconda
perl-io-compress 2.087 pl526he1b5a44_0 bioconda
perl-io-html 1.001 pl526_2 bioconda
perl-io-socket-ssl 2.066 pl526_0 bioconda
perl-io-zlib 1.10 pl526_2 bioconda
perl-json 4.02 pl526_0 bioconda
perl-json-xs 2.34 pl526h6bb024c_3 bioconda
perl-libwww-perl 6.39 pl526_0 bioconda
perl-list-moreutils 0.428 pl526_1 bioconda
perl-list-moreutils-xs 0.428 pl526_0 bioconda
perl-lwp-mediatypes 6.04 pl526_0 bioconda
perl-lwp-protocol-https 6.07 pl526_4 bioconda
perl-mime-base64 3.15 pl526_1 bioconda
perl-mozilla-ca 20180117 pl526_1 bioconda
perl-net-http 6.19 pl526_0 bioconda
perl-net-ssleay 1.88 pl526h90d6eec_0 bioconda
perl-ntlm 1.09 pl526_4 bioconda
perl-parent 0.236 pl526_1 bioconda
perl-pathtools 3.75 pl526h14c3975_1 bioconda
perl-scalar-list-utils 1.52 pl526h516909a_0 bioconda
perl-socket 2.027 pl526_1 bioconda
perl-storable 3.15 pl526h14c3975_0 bioconda
perl-test-requiresinternet 0.05 pl526_0 bioconda
perl-time-local 1.28 pl526_1 bioconda
perl-try-tiny 0.30 pl526_1 bioconda
perl-types-serialiser 1.0 pl526_2 bioconda
perl-uri 1.76 pl526_0 bioconda
perl-www-robotrules 6.02 pl526_3 bioconda
perl-xml-namespacesupport 1.12 pl526_0 bioconda
perl-xml-parser 2.44_01 pl526ha1d75be_1002 conda-forge
perl-xml-sax 1.02 pl526_0 bioconda
perl-xml-sax-base 1.09 pl526_0 bioconda
perl-xml-sax-expat 0.51 pl526_3 bioconda
perl-xml-simple 2.25 pl526_1 bioconda
perl-xsloader 0.24 pl526_0 bioconda
pigz 2.3.4 hed695b0_1 conda-forge
pip 20.1.1 py_1 conda-forge
pycparser 2.20 pyh9f0ad1d_2 conda-forge
pyopenssl 19.1.0 py_1 conda-forge
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
pysam 0.16.0.1 py37hc501bad_0 bioconda
pysocks 1.7.1 py37hc8dfbb8_1 conda-forge
python 3.7.6 cpython_h8356626_6 conda-forge
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.7 1_cp37m conda-forge
pytz 2020.1 pyh9f0ad1d_0 conda-forge
raxml 8.2.12 h14c3975_1 bioconda
readline 8.0 hf8c457e_0 conda-forge
requests 2.24.0 pyh9f0ad1d_0 conda-forge
samtools 0.1.19 h94a8ba4_6 bioconda
scipy 1.5.0 py37ha3d9a3c_0 conda-forge
seqkit 0.12.1 0 bioconda
setuptools 47.3.1 py37hc8dfbb8_0 conda-forge
six 1.15.0 pyh9f0ad1d_0 conda-forge
sqlite 3.32.3 hcee41ef_0 conda-forge
tbb 2020.1 hc9558a2_0 conda-forge
tk 8.6.10 hed695b0_0 conda-forge
tornado 6.0.4 py37h8f50634_1 conda-forge
urllib3 1.25.9 py_0 conda-forge
wheel 0.34.2 py_1 conda-forge
xz 5.2.5 h516909a_0 conda-forge
zlib 1.2.11 h516909a_1006 conda-forge
I just installed humann3 today via the biobakery channel.
Have you configured anaconda (as stated here) before installing humann?
Is conda update metaphlan
updating to the latest version?
With the following channels setting
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://conda.anaconda.org/bioconda/linux-64
https://conda.anaconda.org/bioconda/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
the latest version and build is correctly fetched and installed
_libgcc_mutex conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
_openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-1_llvm
bcbio-gff bioconda/noarch::bcbio-gff-0.6.6-py_0
biom-format conda-forge/linux-64::biom-format-2.1.8-py37hc1659b7_0
biopython conda-forge/linux-64::biopython-1.77-py37h8f50634_0
blast bioconda/linux-64::blast-2.9.0-h20b68b9_1
boost conda-forge/linux-64::boost-1.68.0-py37h8619c78_1001
boost-cpp conda-forge/linux-64::boost-cpp-1.68.0-h11c811c_1000
bowtie2 bioconda/linux-64::bowtie2-2.4.1-py37h4ef193e_2
brotlipy conda-forge/linux-64::brotlipy-0.7.0-py37h8f50634_1000
bx-python bioconda/linux-64::bx-python-0.8.9-py37h5266303_0
bzip2 conda-forge/linux-64::bzip2-1.0.8-h516909a_2
ca-certificates pkgs/main/linux-64::ca-certificates-2020.6.24-0
certifi conda-forge/linux-64::certifi-2020.6.20-py37hc8dfbb8_0
cffi conda-forge/linux-64::cffi-1.14.0-py37hd463f26_0
chardet conda-forge/linux-64::chardet-3.0.4-py37hc8dfbb8_1006
click conda-forge/noarch::click-7.1.2-pyh9f0ad1d_0
cmseq bioconda/noarch::cmseq-1.0-pyh5ca1d4c_0
cryptography conda-forge/linux-64::cryptography-2.9.2-py37hb09aad4_0
curl pkgs/main/linux-64::curl-7.71.0-hbc83047_0
cycler conda-forge/noarch::cycler-0.10.0-py_2
dendropy bioconda/noarch::dendropy-4.4.0-py_1
diamond bioconda/linux-64::diamond-0.9.24-ha888412_1
fasttree bioconda/linux-64::fasttree-2.1.10-h14c3975_3
freetype conda-forge/linux-64::freetype-2.10.2-he06d7ca_0
future conda-forge/linux-64::future-0.18.2-py37hc8dfbb8_1
glpk conda-forge/linux-64::glpk-4.65-he80fd80_1002
gmp conda-forge/linux-64::gmp-6.2.0-he1b5a44_2
gnutls conda-forge/linux-64::gnutls-3.6.13-h79a8f9a_0
h5py conda-forge/linux-64::h5py-2.10.0-nompi_py37h90cd8ad_103
hdf5 conda-forge/linux-64::hdf5-1.10.6-nompi_h3c11f04_100
htslib bioconda/linux-64::htslib-1.9-h4da6232_3
humann biobakery/linux-64::humann-3.0.0.alpha.3-py37h83b1523_0
icu conda-forge/linux-64::icu-58.2-hf484d3e_1000
idna conda-forge/noarch::idna-2.10-pyh9f0ad1d_0
iqtree bioconda/linux-64::iqtree-2.0.3-h176a8bc_0
kiwisolver conda-forge/linux-64::kiwisolver-1.2.0-py37h99015e2_0
krb5 pkgs/main/linux-64::krb5-1.18.2-h173b8e3_0
ld_impl_linux-64 conda-forge/linux-64::ld_impl_linux-64-2.34-h53a641e_5
libblas conda-forge/linux-64::libblas-3.8.0-17_openblas
libcblas conda-forge/linux-64::libcblas-3.8.0-17_openblas
libcurl pkgs/main/linux-64::libcurl-7.71.0-h20c2e04_0
libdeflate conda-forge/linux-64::libdeflate-1.6-h516909a_0
libedit conda-forge/linux-64::libedit-3.1.20191231-h46ee950_0
libffi conda-forge/linux-64::libffi-3.2.1-he1b5a44_1007
libgcc-ng conda-forge/linux-64::libgcc-ng-9.2.0-h24d8f2e_2
libgfortran-ng conda-forge/linux-64::libgfortran-ng-7.5.0-hdf63c60_6
liblapack conda-forge/linux-64::liblapack-3.8.0-17_openblas
libopenblas conda-forge/linux-64::libopenblas-0.3.10-h5ec1e0e_0
libpng conda-forge/linux-64::libpng-1.6.37-hed695b0_1
libssh2 conda-forge/linux-64::libssh2-1.9.0-hab1572f_2
libstdcxx-ng conda-forge/linux-64::libstdcxx-ng-9.2.0-hdf63c60_2
llvm-openmp conda-forge/linux-64::llvm-openmp-10.0.0-hc9558a2_0
lzo conda-forge/linux-64::lzo-2.10-h14c3975_1000
mafft bioconda/linux-64::mafft-7.470-h516909a_0
matplotlib-base pkgs/main/linux-64::matplotlib-base-3.2.2-py37hef1b27d_0
metaphlan bioconda/noarch::metaphlan-3.0.1-pyh5ca1d4c_0
muscle bioconda/linux-64::muscle-3.8.1551-hc9558a2_5
ncurses conda-forge/linux-64::ncurses-6.1-hf484d3e_1002
nettle conda-forge/linux-64::nettle-3.4.1-h1bed415_1002
numpy conda-forge/linux-64::numpy-1.18.5-py37h8960a57_0
openssl conda-forge/linux-64::openssl-1.1.1g-h516909a_0
pandas conda-forge/linux-64::pandas-1.0.5-py37h0da4684_0
patsy conda-forge/noarch::patsy-0.5.1-py_0
pcre conda-forge/linux-64::pcre-8.44-he1b5a44_0
perl conda-forge/linux-64::perl-5.26.2-h516909a_1006
perl-archive-tar bioconda/linux-64::perl-archive-tar-2.32-pl526_0
perl-carp bioconda/linux-64::perl-carp-1.38-pl526_3
perl-common-sense bioconda/linux-64::perl-common-sense-3.74-pl526_2
perl-compress-raw~ bioconda/linux-64::perl-compress-raw-bzip2-2.087-pl526he1b5a44_0
perl-compress-raw~ bioconda/linux-64::perl-compress-raw-zlib-2.087-pl526hc9558a2_0
perl-exporter bioconda/linux-64::perl-exporter-5.72-pl526_1
perl-exporter-tiny bioconda/linux-64::perl-exporter-tiny-1.002001-pl526_0
perl-extutils-mak~ bioconda/linux-64::perl-extutils-makemaker-7.36-pl526_1
perl-io-compress bioconda/linux-64::perl-io-compress-2.087-pl526he1b5a44_0
perl-io-zlib bioconda/linux-64::perl-io-zlib-1.10-pl526_2
perl-json bioconda/linux-64::perl-json-4.02-pl526_0
perl-json-xs bioconda/linux-64::perl-json-xs-2.34-pl526h6bb024c_3
perl-list-moreuti~ bioconda/linux-64::perl-list-moreutils-0.428-pl526_1
perl-list-moreuti~ bioconda/linux-64::perl-list-moreutils-xs-0.428-pl526_0
perl-pathtools bioconda/linux-64::perl-pathtools-3.75-pl526h14c3975_1
perl-scalar-list-~ bioconda/linux-64::perl-scalar-list-utils-1.52-pl526h516909a_0
perl-types-serial~ bioconda/linux-64::perl-types-serialiser-1.0-pl526_2
perl-xsloader bioconda/linux-64::perl-xsloader-0.24-pl526_0
phylophlan bioconda/noarch::phylophlan-3.0-py_5
pip conda-forge/noarch::pip-20.1.1-py_1
pycparser conda-forge/noarch::pycparser-2.20-pyh9f0ad1d_2
pyopenssl conda-forge/noarch::pyopenssl-19.1.0-py_1
pyparsing conda-forge/noarch::pyparsing-2.4.7-pyh9f0ad1d_0
pysam bioconda/linux-64::pysam-0.16.0.1-py37hc501bad_0
pysocks conda-forge/linux-64::pysocks-1.7.1-py37hc8dfbb8_1
python conda-forge/linux-64::python-3.7.6-cpython_h8356626_6
python-dateutil conda-forge/noarch::python-dateutil-2.8.1-py_0
python-lzo conda-forge/linux-64::python-lzo-1.12-py37h81344f2_1001
python_abi conda-forge/linux-64::python_abi-3.7-1_cp37m
pytz conda-forge/noarch::pytz-2020.1-pyh9f0ad1d_0
raxml bioconda/linux-64::raxml-8.2.12-h14c3975_1
readline conda-forge/linux-64::readline-8.0-hf8c457e_0
requests conda-forge/noarch::requests-2.24.0-pyh9f0ad1d_0
samtools bioconda/linux-64::samtools-1.9-h10a08f8_12
scipy conda-forge/linux-64::scipy-1.5.0-py37ha3d9a3c_0
seaborn conda-forge/linux-64::seaborn-0.10.1-1
seaborn-base conda-forge/noarch::seaborn-base-0.10.1-py_1
setuptools conda-forge/linux-64::setuptools-47.3.1-py37hc8dfbb8_0
six conda-forge/noarch::six-1.15.0-pyh9f0ad1d_0
sqlite conda-forge/linux-64::sqlite-3.32.3-hcee41ef_0
statsmodels conda-forge/linux-64::statsmodels-0.11.1-py37h8f50634_2
tbb conda-forge/linux-64::tbb-2020.1-hc9558a2_0
tk conda-forge/linux-64::tk-8.6.10-hed695b0_0
tornado conda-forge/linux-64::tornado-6.0.4-py37h8f50634_1
trimal bioconda/linux-64::trimal-1.4.1-h6bb024c_3
urllib3 conda-forge/noarch::urllib3-1.25.9-py_0
wheel conda-forge/noarch::wheel-0.34.2-py_1
xz conda-forge/linux-64::xz-5.2.5-h516909a_0
zlib conda-forge/linux-64::zlib-1.2.11-h516909a_1006
@fbeghini I just recreated the humann3 env and still have metaphlan 3.0 pyh5ca1d4c_1 bioconda
in the env.
I'm installing the conda env via snakemake --use-conda
with the following yaml:
channels:
- conda-forge
- bioconda
- biobakery
dependencies:
- pigz
- bioconda::seqkit
- biobakery::humann
By the way, it might be best to change the default for the metaphlan bowtie2 database install location, given that the default will install the a very large database (~3-4 Gb) into a conda env if metaphlan is installed via conda. conda wasn't made for holding large files within envs. Also, it takes a ton of time to re-create the bowtie2 database each time a metaphlan conda env is created. I know that metaphlan --install --bowtie2db <PATH>
can be used, but this is not well-documented, and most users will just go with the default.
Have you tried to include bioconda::metaphlan
as a dependency just before humann
? It's weird that is not correctly picking the latest version. The humann recipe does not require a specific version, so the latest should be used.
Thank you for the suggestion about the database location, it makes sense also for me. I'll update the documentation accordingly.
Why not change the humann recipe to require >=3.0.1
, given that 3.0
has a bug that makes it unusable?
btw, continuous integration could help you spot major bugs such as what happed for 3.0. I tried to add that to phylophlan3
Just to be clear, I just needed to add: - bioconda::metaphlan>=3.0.1
to my yaml in order to get the right version of metaphlan, but the bigger issue is that the humann bioconda recipe allows for the install of metaphlan 3.0
I'll cc @ljmciver for the humann recipe
I'm working on the CI for MetaPhlAn for testing also if the database is OK, it will be ready in a couple of weeks
It would also be great to have code for creating custom metaphlan marker databases with the same methodology that was used to create the metaphlan3 database. Right now, there doesn't seem to much info into the detailed steps that were done to create the metaphlan3 (or v2) marker database (besides the paper, which doesn't provide all of the details needed for reproduction).
The new MetaPhlAn 3 database was built starting from reference genomes annotated with UniRef90, the new ChocoPhlAn pipeline is not public at the moment, a paper which includes the detailed procedure is on the way.
MetaPhlAn 3 database was built starting from reference genomes annotated with UniRef90
Thanks! How was the annotation done (eg., if diamond, what e-value and sensitivity?) Any other pre- or post-annotation filtering?
I completely relied on Uniprot for the annotations, meaning, you get the reference genomes from the Proteome portal, each entry is composed by UniprotKB accession which can be resolved to an UniRef90 cluster. The information of which species share the same UniRef90 can be used to identify unique genes.
Of course this works for genomes included in Uniprot. In case of MAGs, annotation with DIAMOND/mmseqs2 is an alternative. For annotating MAGs, I use DIAMOND on the proteins obtained with prokka, using evalue 1, coverage 0.8 and identity percentage 90%, the same thresholds that defines UniRef90 clusters.
Francesco Beghini
PhD Student
Lab. of Computational Metagenomics
Department of Cellular, Computational and Integrative Biology - CIBIO
University of Trento Via Sommarive 9, 38123 Trento, Italy
Il Gio 2 Lug 2020, 18:59 Nick Youngblut notifications@github.com ha scritto:
MetaPhlAn 3 database was built starting from reference genomes annotated with UniRef90
Thanks! How was the annotation done (eg., if diamond, what e-value and sensitivity?) Any other pre- or post-annotation filtering?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biobakery/MetaPhlAn/issues/103#issuecomment-653120706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGKRKLD2CWZO4RFCUN5CT3RZS4I5ANCNFSM4OA6NI5A .
Thanks for the details! I'm considering creating a metaphlan3 marker database based on GTDB-r90 (v90 to be released by next week).
I am trying to run a sample input on metaphlan but looks like the database is not installed on my system. I apply: metaphlan2.py --input_type fastq name.fastq -o name_metaphlan and I get following error:
Downloading https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_latest
Warning: Unable to download https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_latest
Traceback (most recent call last):
File "/home/sbomman/anaconda2/envs/metaphlan2/bin/metaphlan2.py", line 1442, in
Would you please tell me how I can install the database? Thank you
It would be great if you could provide a bit more info on how to create the custom marker database, particularly on the marker sequence format and how to update the pkl file.
Running bowtie2-inspect
on mpa_v30_CHOCOPhlAn_201901
produces a fasta in which the sequences headers look like:
# just showing the sequence headers
>1000373__GeneID:11569613
>100053__V6HZB2__LEP1GSC062_3504 UniRef90_V6HZB2;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HUW0__LEP1GSC062_1341 UniRef90_V6HUW0;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100225__K6UNG7__SAMN05421595_0182 UniRef90_K6UNG7;k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Dermatophilaceae|g__Austwickia|s__Austwickia_chelonae;GCA_900111385
What is the required format of the sequence headers? What does each part of 100225__K6UNG7__SAMN05421595_018
mean? Does each taxonomic level from kingdom to species need to be provided? What's going on with the sequences formatted as >1000373__GeneID:11569613
?
The docs state:
db = pickle.load(bz2.open('metaphlan_databases/mpa_v30_CHOCOPhlAn_201901.pkl', 'r'))
# Add the taxonomy of the new genomes
db['taxonomy']['taxonomy of genome1'] = ('NCBI taxonomy id of genome1', length of genome1)
db['taxonomy']['taxonomy of genome2'] = ('NCBI taxonomy id of genome1', length of genome2)
# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
'clade': the clade that the marker belongs to,
'ext': {the GCA of the first external genome where the marker appears,
the GCA of the second external genome where the marker appears,
},
'len': length of the marker,
'taxon': the taxon of the marker
}
# To see an example, try to print the first marker information:
# print db['markers'].items()[0]
# Save the new mpa_pkl file
with bz2.BZ2File('metaphlan_databases/mpa_v30_CHOCOPhlAn_NEW.pkl', 'w') as ofile:
pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
...but what is the "new_marker_name" format? Would that be the 100225__K6UNG7__SAMN05421595_018
part of the sequence header? How should "clade" be formatted as? For "ext", is that all of the genomes where the marker appears? For "len" is that the mean length of all sequences matching the marker, or just the uniref90 rep? If it's just using the rep length, what about markers that vary considerably in length? How is "taxon" different than "clade"?
Thanks for your help with this!
@Maryamtarazkar Have you tried the procedure described in #109 ?
What is the required format of the sequence headers? What does each part of
100225__K6UNG7__SAMN05421595_018
mean? Does each taxonomic level from kingdom to species need to be provided? What's going on with the sequences formatted as>1000373__GeneID:11569613
?
The names assigned to sequence headers are arbitrary, it's only required to match the keys in the pickle files (['markers']). For ease of searching, I've called each marker using (NCBI_taxid)__(UniRef90_cluster)__(CDS_name). Taxonomy is not required in the header, this was included for having a common ChocoPhlAn header (HUMAnN sequences headers have included the taxonomy).
1000373__GeneID:11569613 or in general headers with GeneID in their names, are viral markers coming from the previous MetaPhlAn database, the current ChocoPhlAn pipeline is not suitable to find viral markers. As the others, the first field is the NCBI taxid and the second one is the GeneID of the viral gene
...but what is the "new_marker_name" format? Would that be the
100225__K6UNG7__SAMN05421595_018
part of the sequence header? Yes, exactly, it's the name of the new marker that should match the one in the FASTA.How should "clade" be formatted as? For "ext", is that all of the genomes where the marker appears? For "len" is that the mean length of all sequences matching the marker, or just the uniref90 rep? If it's just using the rep length, what about markers that vary considerably in length? How is "taxon" different than "clade"?
ext
contains the list of genomes (present in the ['taxonomy']
section) that share the same maker, less is better, a zero length ext
means that the marker is unique for the clade. This field is used when markers presence is calculated, if too many markers are shared by found species, the species will be tagged as misidentified.len
is the length of the species-specific nucleotide sequence found from the UniRef90 cluster.clade
contains only the latest leaf of the taxonomy
field, the latest was kept only for compatibility but in a future will be removed since clade
is the only field that is used to map markers to cladesThanks for all of the clarifications! That really helps. Just a couple of things to make sure I fully understand:
ext
, when you say "less is better", did you not include all of the genomes that share each marker? If so, how would you select a subset of genomes for all that share a marker? len
is the uniref90 representative sequence? What if the marker length actually varies quite a bit across strains/species? Did you include a filter to remove such length-variable markers?Also, in regards to the taxonomy, that should be specified as NCBI taxID. I'm guessing then that metaphlan3 uses taxdump files to deal with the taxonomic hierarchy. How would one provide an alternative taxdump (eg., a taxdump for the GTDB)? Maybe I'm not understanding this. What is required for the ['taxonomy of genome1']
field?
Sorry, I may not have been clear enough: from the species' core genome, you should identify unique or almost unique genes: if it's unique to the species, the marker has no ext
values, sometimes it happens that you cannot find unique genes, so the gene can be shared between n
species. In this case, only genes shared with the fewest number of species should be selected.
In ext
, for a species, you can list one or all the genomes that share the marker, in any case MetaPhlAn will use the "sharing" information not at the genome level, but at the species one.
Marker are species-specific, so it should not be vary so much inside the species. Also, there's no big differences between lengths of UniProtKBs sharing the same UniRef90. What I did in this case, was to take the representative UniProtKB, if it was taxonomically assigned to the interested species, otherwise use the best sequence assigned to the taxonomy (UniProtKB SPROT --> UniProtKB TrEMBL --> Uniparc)
Yes, sorry, that's was I meant.
Inside MetaPhlAn, it is built a taxonomy tree using each entry of the pkl['taxonomy']
(https://github.com/biobakery/MetaPhlAn/blob/3.0/metaphlan/metaphlan.py#L627) . From a quick glance, it seems that it should be easy to use GTDB instead NCBI, in this case ['taxonomy of genome1']
should be 'd__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus;RS_GCF_900040965.1'
, but still its missing the numeric tax ID from GTDB
Awesome!
Just one last thing:
genes shared with the fewest number of species should be selected
Any rules of thumb to use for this? It seems very subjective.
Just a trade-off between core value and #external, in case of non-unique core genes, I try to maximize the core value and minimize the #external, including no more than 10 species, but it would be rare to have so many species.
I was just looking at the metaphlan3 pkl database file, and I noticed that a couple of things that seem to be missing from the wiki docs:
The taxonomy is formatted as such:
taxonomy: 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Lachnospiraceae_unclassified|s__Eubacterium_rectale|t__GCA_003438925'
taxid: '2|1239|186801|186802|186803||39491'
length of genome: 3429456
While the wiki docs state:
db['taxonomy']['taxonomy of genome1'] = ('NCBI taxonomy id of genome1', length of genome1)
Why are there so many taxIDs? Why some gaps between taxIDs (eg., 186803||39491
)?
Each entry contains a score
value, but score is not in the wiki docs. Is the score just ignored?
Just to clarify:
clade
should be species classification (eg., 's__Eubacterium_rectale'
)ext
should be list of genome IDs that fall outside of clade
(eg., ['GCA_001405235', 'GCA_001406495', 'GCA_000210035']
)taxon
should be the taxonomy from kingdom down to species (eg., 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Lachnospiraceae_unclassified|s__Eubacterium_rectale'
) ...correct?
Also just to check: does metaphlan3 use the entire taxonomy when determining markers that are within-species versus among-species, given that species names can sometimes be the same across multiple genera?
Why are there so many taxIDs? Why some gaps between taxIDs (eg., 186803||39491)?
The taxonomy should reflects the 7-level, so each clade has it's taxID, e.g. Bacteria has 2 | Firmicutes has 1239. Levels without taxid are unclassified taxa called after the latest known clade + unclassified. This has also been done to be compliant with the taxonomy required by CAMI.
Each entry contains a score value, but score is not in the wiki docs. Is the score just ignored?
Yes, it's a legacy of the past. It was just len(pkl['ext'])
Just to clarify: [...] ...correct?
Yes, totally correct.
does metaphlan3 use the entire taxonomy when determining markers that are within-species versus among-species, given that species names can sometimes be the same across multiple genera?
No, right now it uses only the 'clade'
field, but I get what you mean, I've encountered this problem when updating the database. It should easy to use the entire taxonomy instead.
Thanks for all of the details! Are the taxIDs for each taxonomy level necessary for metaphlan3 or just for compliance with CAMI? For instance, can I just provide the taxID at the species level?
Yes, but you have to put the six pipes before e.g. ||||||39491
since the tree object expect will split the full taxonomy string according the pipe character.
Hi @nick-youngblut, Were you able to generate a GTDB-r90 metaphlan3 marker database?
Thanks for the details! I'm considering creating a metaphlan3 marker database based on GTDB-r90 (v90 to be released by next week).
@fconstancias I might be able to include it as part of Struo v2. Sorry, but no promises as of now.
After installing metaphlan 3.0, and activating its conda environment, I ran:
metaphlan --install
This produces the following error:
Metaphlan was installed like this:
The problem appears to be that in metaphlan.py, "mpa_" is getting prepended to the database names when they are used as keys in the lsf dictionary. Removing these extra "mpa" strings seems to solve the problem, like so:
The "file_list.txt already present" message appears not to be the real problem.
Best, Matthew Cahn