bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
396 stars 190 forks source link

CONCOCT command is not following best practice #76

Open alneberg opened 5 years ago

alneberg commented 5 years ago

Hello,

first I must thank you for what seems to be an amazing resource! I can see that the performance of CONCOCT is not that good in the figure showed at the README and likewise in the figure in the paper. I had a look at the command used to run CONCOCT:

https://github.com/bxlab/metaWRAP/blob/e7f740d683a11d10141a8c0c3897cba4444bc6c2/bin/metawrap-modules/binning.sh#L394-L417

And from what I can see it is not using contigs cut-up into smaller pieces. Is this correct?

CONCOCT will not perform well on this and it is the general instruction to use the cut-up script to create chop up the longest contigs: https://github.com/BinPro/CONCOCT/blob/develop/scripts/cut_up_fasta.py

There is then a script to create a clustering on the original clustering here: https://github.com/EnvGen/toolbox/blob/master/scripts/concoct/merge_cutup_clustering.py

I know this is a big hassle and I am indeed hoping to fix this in a future release of CONCOCT (Yes, it's true, it is still alive!!!), but I am afraid this is the current procedure.

Do you think it would be possible to implement this in metaWRAP? It's a bit sad to see the poor performance displayed on the readme... :(

Let me know if you need any assistance!

ursky commented 5 years ago

Hey Johannes!

Thanks for messaging me about this. I actually wondered why CONCOCT wasn't doing so hot on larger data sets. That makes sense now. Its unfortunate this information was not immediately apparent to users. I definitely want to fix this in metaWRAP, as CONCOCT offers a very fresh binning approach compared to the other binners I tested, so it could offer many good bins when incorporated with the bin_refinement module of metaWRAP. It already does pretty well on smaller data, so I imagine it will do even better once we fix this.

I see the scripts you mention do not come with the standard Bioconda distribution of CONCOCT. Looks like I will need to include them in the metawrap-scripts folder. Also, you probably noticed that the read alignment stage is done before the binning stages, so the contig depth files are already ready to go before concoct starts. Would it be ok to split the larger contigs and assume that they all have the same read depths? In other words, could you fix the pipeline by changing things only after line 403, having only the original fasta file and the ${out}/work_files/concoct_depth.txt and ${home}/${out}/work_files/assembly.fa files?

I will not have much time to work on this in the coming months, so I would greatly appreciate it if you could help me, especially since you know exactly how this works. Do you mind filling in the binning.sh script file to incorporate your suggested steps? I can work with you to properly incorporate the changes and make sure they work, and would be happy to re-run my benchmarks with the fixed pipeline.

Thanks, and I look forward to fixing this with you!

alneberg commented 5 years ago

Hi Gherman,

yes I am aware that the origin of this mess is concoct. I'm also hoping the performance will be improved. We also have a new version of concoct coming (already available in the develop branch) which should be several times faster and using threads in a better way.

I am afraid that simply assuming the same coverage would at least not be optimal. I'm not sure how bad it would be as I haven't tried it, but at least the number data points per bin would then be better distributed. The best alternative would be to split the contigs before mapping the reads, but I totally agree with you that this is a major hassle. The second best alternative would in my point be to set coverage for each sub-contig based on some probability distribution using the main contig mean coverage and coverage variance as given by the script you are using to get the coverage values. But that would still need splitting up the fasta file with the current version of concoct. Also a script to perform this generation remains to be written. But maybe this is the easiest way forward for concoct within metaWRAP?

I will try to include those missing scripts into the bioconda package (which is by the way kindly contributed by other users and not from any of the developers).

I'm not sure either how much time I can spend on this problem, but I am indeed keen on getting it fixed. 👍

Thanks!

ursky commented 5 years ago

Great! Im really looking forward to the concoct that can take advantage of multi-threading effectively! Right now its just not scalable to most larger sequencing projects.

I do not really like the idea of faking the coverage variance through modeling... The data is right there, so why not use it? Why do you think that assuming the coverage from the overall average coverage is bad? If we have a long contig from a genome of a given abundance that we then cut up into short fragments, wouldn't the abundance of each of its pieces be expected to be identical? In practice, the read coverage will deviate some due to random change and coverage biases, but the true value - the abundance of that genome in that sample - does not change from us cutting up the contig. Isn't coverage just a proxy to the abundance of each fragment in that sample? The longer the fragment the more accurate that estimation. Sorry in advance if I am misunderstanding something here.

If we have to re-estimate the abundance of each fragment, I would prefer to be efficient and not have to repeat the alignment step. Instead, the .bam files from the alignments are right there. In theory, we could just split the fasta file, and then hack into the .bam file to make the read mappings consistent with the fragmented contig naming. Then run the coverage estimation tool on the new "split" .bam files.

The downside is that this would be more work to make a program to parse out the bam files, so unless you want to help me integrate this, the easiest way would be to just brute-force and re-run the alignment stage on the cut-up contigs. This will be inefficient (doing the alignment twice), but one could argue that if the users cannot afford to re-align the reads then then should not be attempting to bin with COCCOCT, given how inefficient it is on large data. Let me know which way you want to try!

alneberg commented 5 years ago

Yes, it really needs a speed up, which I am confident in saying we will be able to achieve.

Yes, I am afraid I wasn't clear enough. I am not aiming at faking the variance. Instead I want to use the variance and the coverage value from the long contig to build a probability distribution for each long contig. This distribution will then be used to produce slightly varying coverage values for the sub-contigs produced. Does that make sense? I think this would result in a more realistic values for the short contigs than just using the same coverage value for all sub-contigs.

However, maybe it's worth trying using the same coverage values for all sub-contigs first, since I haven't tried that.

The idea of creating the script you describe has passed my mind a few times, but I'm not sure how efficient I would be able to make such a script. Do you have a good idea of how it would be done? Creating a few 100k random distributions and subsample ~10-100 values from each sounds like a easier programming task to me. But yes, one would then not get the true coverage for the sub part.

ursky commented 5 years ago

Can you provide the code that you would use to cut up the contigs, align the reads, bin, and then merge them? I am thinking what is the easiest way to modify the code for now. If you work with binning.sh directly that would help. Assume all scripts will be in ${SOFT}/. I want to see what the most simple solution would look like. Once that's working I can benchmark how it performs if we assume coverage (both your way and mine) to see what works best.

ursky commented 5 years ago

Hey @alneberg I saw you released CONCOCT 1.0. Congrats! I can't wait to plug it into metawrap so the CONCOCT performance is more on par with the others (especially speed and scalability). Are there any modifications I need to make to the command calling concoct, or just updating the version is enough?

concoct --coverage_file $COVERAGE --composition_file $ASSEMBLY -l $LENGTH

I also saw that the cut-up contig coverage does not have to be re-estimated now. Is there any clear-cut guide on how to do the cut up the contigs, use coverage from the main contigs, perform the binning, and then get final bins with the original contigs? I feel like given just two variables/files - a coverage file $COVERAGE and a assembly $ASSEMBLY - this should be possible, but I cannot find a definitive resource. This is not my priority at the moment, but most metaWRAP users are not using CONCOCT at the moment, instead going with metaBAT1, metaBAT2, and MaxBin2. This is due to speed (which will be fixed as soon as I add the new version), but also due to the binning performance, which I need your help to fix. Thanks in advance.

alneberg commented 5 years ago

Hi @ursky, sorry for not getting back to you about this! The speed should be improved in the new version but the scalability for increasing data size is probably the same (except for improved parallelism).

The commands given in the readme should be enough, but for you that would still mean you have to additionally run the cutup command and the concoct script to generate the coverage table. The big change is that the mapping doesn't have to be redone against the cutup contigs. However, the cutup contigs are still needed for concoct as input (--composition_file).

You should also add a value for -t which indicates how many threads concoct has access to. Give it as many as possible, as the parallelism performance should be quite good now. By default it only uses a single thread, so this is a necessary tweak.

The clustering results on the cutup contigs need to be merged to correspond to the original contigs. This command is also given in the readme. The scripts used in the readme are added to the path when concoct is installed.

Please let me know if you need any other information from me, it would be very nice to have the concoct installation work as well as it can.

Cheers, Johannes

ursky commented 5 years ago

Excellent, thank you. I was able to integrate the contig splitting relatively easily, and it seems to work very well!

However, I am unable to use the -t option without encountering an error https://github.com/BinPro/CONCOCT/issues/232, so I cant use the multi-threading yet. Do you want to try to fix it now, or should I just push it to metaWRAP main with -t 1 for now.

alneberg commented 5 years ago

Ok, concoct should now be able to use the -t flag again. At least in a fresh environment. Might not be acting the same in the combined env of metaWRAP...

ursky commented 5 years ago

After some testing, all features of CONCOCT appear to be functional under the metaWRAP wrapper. I am happy to report that CONCOCT v1.0 is now deployed as part of metaWRAP v1.1.4, including the contig splitting step and multithreading. I will be re-doings some of the benchmarks in the coming weeks to do CONCOCT performance justice.

alneberg commented 5 years ago

Fantastic! I browsed through the code and it looks great!

A small side note (only for your convenience), the extra cd:ing can be avoided with the -b argument. If you use a non-existing directory and use a trailing /, concoct should create and put all the output files within that directory. For example:

concoct --composition_file contigs_10K.fa --coverage_file coverage_table.tsv -b concoct_output/

will create the directory concoct_output and use it to store all the output in.

Feel free to close this issue as you like.

Great work!

ursky commented 5 years ago

Thanks, I incorporated the changes. Unfortunately there is another issue now:

Up and running. Check /scratch/gu/TEACHING/binning_concoct_test4/work_files/concoct_out/log.txt for progress
Traceback (most recent call last):
  File "/home/guritsk1/miniconda2/envs/metawrap-env/bin/concoct", line 88, in <module>
    results = main(args)
  File "/home/guritsk1/miniconda2/envs/metawrap-env/bin/concoct", line 40, in main
    args.seed
  File "/home/guritsk1/miniconda2/envs/metawrap-env/lib/python2.7/site-packages/concoct/transform.py", line 5, in perform_pca
    pca_object = PCA(n_components=nc, random_state=seed).fit(d)
TypeError: __init__() got an unexpected keyword argument 'random_state'

Its something to do with the scikit-learn/scipy/numpy versions, but I am not sure which exactly. My environment:

# packages in environment at /home/guritsk1/miniconda2/envs/metawrap-env:
#
# Name                    Version                   Build  Channel
_r-mutex                  1.0.0                     mro_2
alabaster                 0.7.12                     py_0    conda-forge
aragorn                   1.2.38               h470a237_2    bioconda
asn1crypto                0.24.0                py27_1003    conda-forge
babel                     2.6.0                      py_1    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.5                        py_1    conda-forge
backports_abc             0.5                        py_1    conda-forge
barrnap                   0.9                           2    bioconda
bcftools                  1.6                           0    bioconda
bedtools                  2.27.1               he860b03_3    bioconda
binutils_impl_linux-64    2.28.1               had2808c_3
binutils_linux-64         7.2.0               had2808c_27
biopython                 1.69                np110py27_0    bioconda
blas                      1.0                         mkl
blast                     2.6.0               boost1.64_2    bioconda
bmfilter                  3.101                hfc679d8_2    bioconda
bmtagger                  3.101                h470a237_4    bioconda
bmtool                    3.101                hfc679d8_2    bioconda
boost                     1.64.0                   py27_4    conda-forge
boost-cpp                 1.64.0                        1    conda-forge
bowtie2                   2.3.4.3          py27h2d50403_0    bioconda
brewer2mpl                1.4.1                      py_3    conda-forge
bwa                       0.7.15                        1    bioconda
bwidget                   1.9.11                        1
bz2file                   0.98                       py_0    conda-forge
bzip2                     1.0.6             h14c3975_1002    conda-forge
ca-certificates           2018.11.29           ha4d7672_0    conda-forge
cairo                     1.14.12              h8948797_3
certifi                   2018.11.29            py27_1000    conda-forge
cffi                      1.12.1           py27h9745a5d_0    conda-forge
chardet                   3.0.4                 py27_1003    conda-forge
checkm-genome             1.0.13                   py27_0    bioconda
concoct                   1.0.0            py27h7724fef_3    bioconda
cryptography              2.5              py27h1ba5d50_0
curl                      7.62.0               hbc83047_0
cutadapt                  1.18             py27h14c3975_1    bioconda
cycler                    0.10.0                     py_1    conda-forge
cython                    0.29.5           py27hf484d3e_0    conda-forge
dbus                      1.13.2               h714fa37_1
dendropy                  4.4.0                      py_0    bioconda
docutils                  0.14                  py27_1001    conda-forge
enum34                    1.1.6                 py27_1001    conda-forge
expat                     2.2.5             hf484d3e_1002    conda-forge
extract_fullseq           3.101                         3    bioconda
fastqc                    0.11.5                        1    bioconda
fontconfig                2.13.0               h9420a91_0
fraggenescan              1.31                 h470a237_0    bioconda
freetype                  2.9.1             h94bbf69_1005    conda-forge
fribidi                   1.0.5             h14c3975_1000    conda-forge
functools32               3.2.3.2                    py_3    conda-forge
futures                   3.2.0                 py27_1000    conda-forge
gcc_impl_linux-64         7.2.0                habb00fd_3
gcc_linux-64              7.2.0               h550dcbe_27
gfortran_impl_linux-64    7.2.0                hdf63c60_3
gfortran_linux-64         7.2.0               h550dcbe_27
glib                      2.56.2               hd408876_0
gmp                       6.1.2             hf484d3e_1000    conda-forge
gnutls                    3.5.19               h2a4e5f8_1    conda-forge
graphite2                 1.3.13            hf484d3e_1000    conda-forge
gsl                       2.4                  h14c3975_4
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb453b48_1
gxx_impl_linux-64         7.2.0                hdf63c60_3
gxx_linux-64              7.2.0               h550dcbe_27
harfbuzz                  1.9.0             he243708_1001    conda-forge
hmmer                     3.1b2                         3    bioconda
htslib                    1.6                           0    bioconda
icu                       58.2              hf484d3e_1000    conda-forge
idba                      1.1.3                         1    bioconda
idna                      2.8                   py27_1000    conda-forge
imagesize                 1.1.0                      py_0    conda-forge
infernal                  1.1.2                h14c3975_2    bioconda
ipaddress                 1.0.22                     py_1    conda-forge
java-jdk                  8.0.92                        1    bioconda
jellyfish                 1.1.12               h2d50403_0    bioconda
jemalloc                  4.5.0                         0    bioconda
jinja2                    2.10                       py_1    conda-forge
jpeg                      9c                h14c3975_1001    conda-forge
kiwisolver                1.0.1           py27h6bb024c_1002    conda-forge
kraken                    1.1                  h470a237_2    bioconda
krb5                      1.14.6                        0    conda-forge
krona                     2.7                     pl526_2    bioconda
libcurl                   7.62.0               h20c2e04_0
libedit                   3.1.20170329      hf8c457e_1001    conda-forge
libffi                    3.2.1             hf484d3e_1005    conda-forge
libgcc                    7.2.0                h69d50b8_2    conda-forge
libgcc-ng                 7.3.0                hdf63c60_0    conda-forge
libgfortran               3.0.0                         1    conda-forge
libgfortran-ng            7.2.0                hdf63c60_3    conda-forge
libiconv                  1.14                          4    conda-forge
libidn11                  1.33                          0    conda-forge
libopenblas               0.2.20               h9ac9557_7
libpng                    1.6.36            h84994c4_1000    conda-forge
libssh2                   1.8.0                         1    conda-forge
libstdcxx-ng              7.3.0                hdf63c60_0    conda-forge
libtiff                   4.0.10            h648cc4a_1001    conda-forge
libuuid                   1.0.3                         1    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxml2                   2.9.9                he19cac6_0
llvm-meta                 7.0.0                         0    conda-forge
markupsafe                1.1.1            py27h14c3975_0    conda-forge
matplotlib                2.2.3            py27h8a2030e_1    conda-forge
matplotlib-base           2.2.3            py27h60b886d_1    conda-forge
maxbin2                   2.2.5                         0    ursky
megahit                   1.1.2                    py27_1    bioconda
metabat2                  2.12.1                        0    ursky
minced                    0.3.2                         0    bioconda
mkl                       11.3.3                        0
mmtf-python               1.0.2                    py27_0    bioconda
mro-base                  3.4.3                h1c2f66e_1
mro-basics                3.4.3                         0
msgpack-python            0.6.1            py27h6bb024c_0    conda-forge
ncurses                   6.1               hf484d3e_1002    conda-forge
nettle                    3.3                           0    conda-forge
nose                      1.3.7                 py27_1002    conda-forge
numpy                     1.10.4                   py27_2
olefile                   0.46                       py_0    conda-forge
openblas                  0.3.5             h9ac9557_1000    conda-forge
openjdk                   11.0.1            h14c3975_1014    conda-forge
openmp                    7.0.0                h2d50403_0    conda-forge
openssl                   1.1.1a            h14c3975_1000    conda-forge
packaging                 19.0                       py_0    conda-forge
pandas                    0.23.4          py27h637b7d7_1000    conda-forge
pango                     1.42.4               h049681c_0
parallel                  20160622                      1    bioconda
patsy                     0.5.1                      py_0    conda-forge
pcre                      8.42                 h439df22_0
perl                      5.26.2            h14c3975_1002    conda-forge
perl-app-cpanminus        1.7044                  pl526_1    bioconda
perl-bioperl              1.6.924                       4    bioconda
perl-carp                 1.38                    pl526_1    bioconda
perl-constant             1.33                    pl526_1    bioconda
perl-encode               2.88                    pl526_1    bioconda
perl-encode-locale        1.05                    pl526_6    bioconda
perl-exporter             5.72                    pl526_1    bioconda
perl-extutils-makemaker   7.34                    pl526_3    bioconda
perl-file-path            2.15                    pl526_0    bioconda
perl-file-temp            0.2304                  pl526_2    bioconda
perl-lwp-simple           6.15            pl526h470a237_4    bioconda
perl-parent               0.236                   pl526_1    bioconda
perl-threaded             5.22.0                       13    bioconda
perl-xml-namespacesupport 1.12                    pl526_0    bioconda
perl-xml-parser           2.44            pl526h3a4f0e9_6    bioconda
perl-xml-sax              1.00                    pl526_0    bioconda
perl-xml-sax-base         1.09                    pl526_0    bioconda
perl-xml-sax-expat        0.51                    pl526_2    bioconda
perl-xml-simple           2.25                    pl526_0    bioconda
perl-yaml                 1.27                    pl526_0    bioconda
pigz                      2.3.4                         0    conda-forge
pillow                    5.4.1           py27h00a061d_1000    conda-forge
pip                       19.0.3                   py27_0    conda-forge
pixman                    0.34.0            h14c3975_1003    conda-forge
pplacer                   1.1.alpha17                   0    bioconda
prodigal                  2.6.3                         1    bioconda
prokka                    1.13                          0    bioconda
pthread-stubs             0.4               h14c3975_1001    conda-forge
pycairo                   1.16.3                   py27_0    conda-forge
pycparser                 2.19                       py_0    conda-forge
pygments                  2.3.1                      py_0    conda-forge
pyopenssl                 19.0.0                   py27_0    conda-forge
pyparsing                 2.3.1                      py_0    conda-forge
pyqt                      5.6.0           py27h13b7fb3_1008    conda-forge
pysam                     0.13.0          py27_htslib1.6_0    bioconda
pysocks                   1.6.8                 py27_1002    conda-forge
python                    2.7.15               h9bab390_6
python-dateutil           2.8.0                      py_0    conda-forge
pytz                      2018.9                     py_0    conda-forge
qt                        5.6.3                h8bf5577_3
quast                     4.1                      py27_0    bioconda
r-assertthat              0.2.0           mro343h889e2dd_0
r-boot                    1.3_20                 mro343_0
r-checkpoint              0.4.3                  mro343_0
r-class                   7.3_14                 mro343_0
r-cli                     1.0.0           mro343h889e2dd_0
r-cluster                 2.0.6                  mro343_0
r-codetools               0.2_15                 mro343_0
r-colorspace              1.3_2           mro343h086d26f_0
r-crayon                  1.3.4           mro343h889e2dd_0
r-curl                    3.1                    mro343_0
r-deployrrserve           9.0.0                  mro343_0
r-dichromat               2.0_0           mro343h889e2dd_0
r-digest                  0.6.13          mro343h086d26f_0
r-doparallel              1.0.12                 mro343_0
r-foreach                 1.4.5                  mro343_0
r-foreign                 0.8_69                 mro343_0
r-ggplot2                 2.2.1           mro343h889e2dd_0
r-glue                    1.2.0           mro343h086d26f_0
r-gtable                  0.2.0           mro343h889e2dd_0
r-iterators               1.0.9                  mro343_0
r-jsonlite                1.5                    mro343_0
r-kernsmooth              2.23_15                mro343_0
r-labeling                0.3             mro343h889e2dd_0
r-lattice                 0.20_35                mro343_0
r-lazyeval                0.2.1           mro343h086d26f_0
r-magrittr                1.5             mro343h889e2dd_0
r-mass                    7.3_47                 mro343_0
r-matrix                  1.2_12                 mro343_0
r-mgcv                    1.8_22                 mro343_0
r-microsoftr              3.4.3.0097             mro343_0
r-munsell                 0.4.3           mro343h889e2dd_0
r-nlme                    3.1_131                mro343_0
r-nnet                    7.3_12                 mro343_0
r-pillar                  1.0.1           mro343h889e2dd_0
r-plyr                    1.8.4           mro343h599a50d_0
r-png                     0.1_7                  mro343_0
r-r6                      2.2.2                  mro343_0
r-rcolorbrewer            1.1_2           mro343h889e2dd_0
r-rcpp                    0.12.14         mro343h599a50d_0
r-recommended             3.4.3                  mro343_0
r-reshape2                1.4.3           mro343h599a50d_0
r-revoioq                 8.0.9                  mro343_0
r-revomods                11.0.0                 mro343_0
r-revoutilsmath           10.0.1                 mro343_0
r-rlang                   0.1.6           mro343h086d26f_0
r-rpart                   4.1_11                 mro343_0
r-runit                   0.4.26                 mro343_0
r-scales                  0.5.0           mro343h599a50d_0
r-spatial                 7.3_11                 mro343_0
r-stringi                 1.1.6           mro343h599a50d_0
r-stringr                 1.2.0           mro343h889e2dd_0
r-survival                2.41_3                 mro343_0
r-tibble                  1.4.1           mro343h086d26f_0
r-utf8                    1.1.2           mro343h086d26f_0
r-viridislite             0.2.0           mro343h889e2dd_0
readline                  7.0               hf8c457e_1001    conda-forge
reportlab                 3.4.0                    py27_0
requests                  2.21.0                py27_1000    conda-forge
salmon                    0.9.1                         1    bioconda
samtools                  1.6                  h02bfda8_2    bioconda
scikit-learn              0.16.1              np110py27_0
scipy                     0.17.1              np110py27_1
seaborn                   0.8.1                      py_1    conda-forge
setuptools                40.8.0                   py27_0    conda-forge
singledispatch            3.4.0.3               py27_1000    conda-forge
sip                       4.18                     py27_1    conda-forge
six                       1.12.0                py27_1000    conda-forge
snowballstemmer           1.2.1                      py_1    conda-forge
spades                    3.13.0                        0    bioconda
sphinx                    1.8.4                    py27_0    conda-forge
sphinx_rtd_theme          0.4.3                      py_0    conda-forge
sphinxcontrib-websupport  1.1.0                      py_1    conda-forge
sqlite                    3.26.0               h7b6447c_0
srprism                   2.4.24                        2    bioconda
statsmodels               0.9.0           py27h3010b51_1000    conda-forge
subprocess32              3.2.7                    py27_0    conda-forge
system                    5.8                           2
taxator-tk                1.3.3e                        0    ursky
tbb                       2019.3            h6bb024c_1000    conda-forge
tbl2asn                   25.6                          3    bioconda
tk                        8.6.9             h84994c4_1000    conda-forge
tktable                   2.10                 h14c3975_0
tornado                   5.1.1           py27h14c3975_1000    conda-forge
trim-galore               0.4.5                         2    bioconda
typing                    3.5.2.2                  py27_0    bioconda
urllib3                   1.24.1                py27_1000    conda-forge
wheel                     0.33.1                   py27_0    conda-forge
xopen                     0.5.0                      py_0    bioconda
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.9             h14c3975_1004    conda-forge
xorg-libsm                1.2.2                h470a237_5    conda-forge
xorg-libx11               1.6.7             h14c3975_1000    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.2             h14c3975_1007    conda-forge
xorg-libxext              1.3.3             h14c3975_1004    conda-forge
xorg-libxrender           0.9.10            h14c3975_1002    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h14c3975_1007    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
zlib                      1.2.11            h14c3975_1004    conda-forge
alneberg commented 5 years ago

Ah, that's bad. I think it's explained by the version of scikit-learn which should be > 0.18.0 to have the random_state parameter. I will have to update this requirement in the bioconda requirements.

ursky commented 5 years ago

Ok, that makes sense. Let me know when you update the recipe so I can test it.

alneberg commented 5 years ago

The recipe was updated just now: https://github.com/bioconda/bioconda-recipes/pull/13820 Might have been built already, otherwise anytime soon.

ursky commented 5 years ago

Thanks! I just tested and got a new error:

Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

Looks like you need to enforce some more dependencies: https://stackoverflow.com/questions/36659453/intel-mkl-fatal-error-cannot-load-libmkl-avx2-so-or-libmkl-def-so

ursky commented 5 years ago
# packages in environment at /home/guritsk1/miniconda2/envs/metawrap-bare:
#
# Name                    Version                   Build  Channel
_r-mutex                  1.0.0               anacondar_1
alabaster                 0.7.12                     py_0    conda-forge
anaconda                  custom           py27h4a00acb_0
aragorn                   1.2.38               h470a237_2    bioconda
asn1crypto                0.24.0                py27_1003    conda-forge
babel                     2.6.0                      py_1    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.5                        py_1    conda-forge
backports_abc             0.5                        py_1    conda-forge
barrnap                   0.9                           2    bioconda
bcftools                  1.6                           0    bioconda
beautifulsoup4            4.7.1                 py27_1001    conda-forge
bedtools                  2.27.1               he860b03_3    bioconda
binutils_impl_linux-64    2.28.1               had2808c_3
binutils_linux-64         7.2.0               had2808c_27
biopython                 1.68                     py27_0    bioconda
blas                      1.0                         mkl
blast                     2.6.0               boost1.64_2    bioconda
bmfilter                  3.101                hfc679d8_2    bioconda
bmtagger                  3.101                h470a237_4    bioconda
bmtool                    3.101                hfc679d8_2    bioconda
boost                     1.64.0                   py27_4    conda-forge
boost-cpp                 1.64.0                        1    conda-forge
bowtie2                   2.3.4.3          py27h2d50403_0    bioconda
brewer2mpl                1.4.1                      py_3    conda-forge
bwa                       0.7.15                        1    bioconda
bwidget                   1.9.11                        1
bz2file                   0.98                       py_0    conda-forge
bzip2                     1.0.6             h14c3975_1002    conda-forge
ca-certificates           2018.11.29           ha4d7672_0    conda-forge
cairo                     1.14.12           h80bd089_1005    conda-forge
certifi                   2018.11.29            py27_1000    conda-forge
cffi                      1.12.1           py27h9745a5d_0    conda-forge
chardet                   3.0.4                 py27_1003    conda-forge
checkm-genome             1.0.13                   py27_0    bioconda
concoct                   1.0.0            py27h7724fef_4    bioconda
conda                     4.6.7                    py27_0    conda-forge
conda-build               3.17.8                   py27_0    conda-forge
contextlib2               0.5.5                      py_2    conda-forge
cryptography              2.5              py27h1ba5d50_0
curl                      7.62.0               hbc83047_0
cutadapt                  1.18             py27h14c3975_1    bioconda
cycler                    0.10.0                     py_1    conda-forge
cython                    0.29.5           py27hf484d3e_0    conda-forge
dbus                      1.13.2               h714fa37_1
dendropy                  4.4.0                      py_0    bioconda
docutils                  0.14                  py27_1001    conda-forge
enum34                    1.1.6                 py27_1001    conda-forge
expat                     2.2.5             hf484d3e_1002    conda-forge
extract_fullseq           3.101                         3    bioconda
fastqc                    0.11.5                        1    bioconda
filelock                  3.0.10                     py_0    conda-forge
fontconfig                2.13.1            h2176d3f_1000    conda-forge
fraggenescan              1.31                 h470a237_0    bioconda
freetype                  2.9.1             h94bbf69_1005    conda-forge
functools32               3.2.3.2                    py_3    conda-forge
futures                   3.2.0                 py27_1000    conda-forge
gcc_impl_linux-64         7.2.0                habb00fd_3
gcc_linux-64              7.2.0               h550dcbe_27
gfortran_impl_linux-64    7.2.0                hdf63c60_3
gfortran_linux-64         7.2.0               h550dcbe_27
glib                      2.56.2               hd408876_0
glob2                     0.6                        py_0    conda-forge
gmp                       6.1.2             hf484d3e_1000    conda-forge
gnutls                    3.5.19               h2a4e5f8_1    conda-forge
graphite2                 1.3.13            hf484d3e_1000    conda-forge
gsl                       2.4                  h14c3975_4
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb453b48_1
gxx_impl_linux-64         7.2.0                hdf63c60_3
gxx_linux-64              7.2.0               h550dcbe_27
harfbuzz                  1.9.0             he243708_1001    conda-forge
hmmer                     3.1b2                         3    bioconda
htslib                    1.6                           0    bioconda
icu                       58.2              hf484d3e_1000    conda-forge
idba                      1.1.3                         1    bioconda
idna                      2.8                   py27_1000    conda-forge
imagesize                 1.1.0                      py_0    conda-forge
infernal                  1.1.2                h14c3975_2    bioconda
ipaddress                 1.0.22                     py_1    conda-forge
java-jdk                  8.0.92                        1    bioconda
jellyfish                 1.1.12               h2d50403_0    bioconda
jemalloc                  4.5.0                         0    bioconda
jinja2                    2.10                       py_1    conda-forge
jpeg                      9c                h14c3975_1001    conda-forge
kiwisolver                1.0.1           py27h6bb024c_1002    conda-forge
kraken                    1.1                  h470a237_2    bioconda
krb5                      1.14.6                        0    conda-forge
krona                     2.7                     pl526_2    bioconda
libarchive                3.3.3                h5d8350f_2
libcurl                   7.62.0               h20c2e04_0
libffi                    3.2.1             hf484d3e_1005    conda-forge
libgcc                    7.2.0                h69d50b8_2    conda-forge
libgcc-ng                 7.3.0                hdf63c60_0    conda-forge
libgfortran               3.0.0                         1    conda-forge
libgfortran-ng            7.2.0                hdf63c60_3    conda-forge
libiconv                  1.14                          4    conda-forge
libidn11                  1.33                          0    conda-forge
liblief                   0.9.0                h7725739_2
libopenblas               0.2.20               h9ac9557_7
libpng                    1.6.36            h84994c4_1000    conda-forge
libssh2                   1.8.0                         1    conda-forge
libstdcxx-ng              7.3.0                hdf63c60_0    conda-forge
libtiff                   4.0.10            h648cc4a_1001    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxml2                   2.9.9                he19cac6_0
llvm-meta                 7.0.0                         0    conda-forge
lz4-c                     1.8.3             hf484d3e_1001    conda-forge
lzo                       2.10              h14c3975_1000    conda-forge
markupsafe                1.1.0           py27h14c3975_1000    conda-forge
matplotlib                2.2.3            py27h8a2030e_1    conda-forge
matplotlib-base           2.2.3            py27h60b886d_1    conda-forge
maxbin2                   2.2.5                         0    ursky
megahit                   1.1.3                    py27_0    bioconda
metabat2                  2.12.1                        0    ursky
minced                    0.3.2                         0    bioconda
mkl                       11.3.3                        0
mmtf-python               1.0.2                    py27_0    bioconda
msgpack-python            0.6.1            py27h6bb024c_0    conda-forge
ncurses                   6.1               hf484d3e_1002    conda-forge
nettle                    3.3                           0    conda-forge
nose                      1.3.7                 py27_1002    conda-forge
numpy                     1.11.3           py27h3dfced4_4
olefile                   0.46                       py_0    conda-forge
openblas                  0.3.5             h9ac9557_1000    conda-forge
openjdk                   11.0.1            h14c3975_1014    conda-forge
openmp                    7.0.0                h2d50403_0    conda-forge
openssl                   1.1.1a            h14c3975_1000    conda-forge
packaging                 19.0                       py_0    conda-forge
pandas                    0.23.4          py27h637b7d7_1000    conda-forge
pango                     1.40.14           hf0c64fd_1003    conda-forge
parallel                  20160622                      1    bioconda
patchelf                  0.9               hf484d3e_1002    conda-forge
patsy                     0.5.1                      py_0    conda-forge
pcre                      8.42                 h439df22_0
perl                      5.26.2            h14c3975_1002    conda-forge
perl-app-cpanminus        1.7044                  pl526_1    bioconda
perl-bioperl              1.6.924                       4    bioconda
perl-carp                 1.38                    pl526_1    bioconda
perl-constant             1.33                    pl526_1    bioconda
perl-encode               2.88                    pl526_1    bioconda
perl-encode-locale        1.05                    pl526_6    bioconda
perl-exporter             5.72                    pl526_1    bioconda
perl-extutils-makemaker   7.34                    pl526_3    bioconda
perl-file-path            2.15                    pl526_0    bioconda
perl-file-temp            0.2304                  pl526_2    bioconda
perl-lwp-simple           6.15            pl526h470a237_4    bioconda
perl-parent               0.236                   pl526_1    bioconda
perl-threaded             5.22.0                       13    bioconda
perl-xml-namespacesupport 1.12                    pl526_0    bioconda
perl-xml-parser           2.44            pl526h3a4f0e9_6    bioconda
perl-xml-sax              1.00                    pl526_0    bioconda
perl-xml-sax-base         1.09                    pl526_0    bioconda
perl-xml-sax-expat        0.51                    pl526_2    bioconda
perl-xml-simple           2.25                    pl526_0    bioconda
perl-yaml                 1.27                    pl526_0    bioconda
pigz                      2.3.4                         0    conda-forge
pillow                    5.4.1           py27h00a061d_1000    conda-forge
pip                       19.0.3                   py27_0    conda-forge
pixman                    0.34.0            h14c3975_1003    conda-forge
pkginfo                   1.5.0.1                    py_0    conda-forge
pplacer                   1.1.alpha17                   0    bioconda
prodigal                  2.6.3                         1    bioconda
prokka                    1.13                          0    bioconda
psutil                    5.5.1            py27h14c3975_0    conda-forge
pthread-stubs             0.4               h14c3975_1001    conda-forge
py-lief                   0.9.0            py27h7725739_2
pycairo                   1.16.3                   py27_0    conda-forge
pycosat                   0.6.3           py27h14c3975_1001    conda-forge
pycparser                 2.19                       py_0    conda-forge
pygments                  2.3.1                      py_0    conda-forge
pyopenssl                 19.0.0                   py27_0    conda-forge
pyparsing                 2.3.1                      py_0    conda-forge
pyqt                      5.6.0           py27h13b7fb3_1008    conda-forge
pysam                     0.13.0          py27_htslib1.6_0    bioconda
pysocks                   1.6.8                 py27_1002    conda-forge
python                    2.7.15               h9bab390_6
python-dateutil           2.8.0                      py_0    conda-forge
python-libarchive-c       2.8                   py27_1004    conda-forge
pytz                      2018.9                     py_0    conda-forge
pyyaml                    3.13            py27h14c3975_1001    conda-forge
qt                        5.6.3                h8bf5577_3
quast                     4.1                      py27_0    bioconda
r-assertthat              0.2.0            r343h889e2dd_0
r-base                    3.4.3                h9bb98a2_5
r-boot                    1.3_20           r343h889e2dd_0
r-class                   7.3_14           r343h086d26f_4
r-cli                     1.0.0            r343h6115d3f_1
r-cluster                 2.0.6            r343h4829c52_0
r-codetools               0.2_15           r343h889e2dd_0
r-colorspace              1.3_2            r343h086d26f_0
r-crayon                  1.3.4            r343h889e2dd_0
r-dichromat               2.0_0            r343h889e2dd_4
r-digest                  0.6.13           r343h086d26f_0
r-foreign                 0.8_69           r343h086d26f_0
r-ggplot2                 2.2.1            r343h889e2dd_0
r-glue                    1.2.0            r343h086d26f_0
r-gtable                  0.2.0            r343h889e2dd_0
r-kernsmooth              2.23_15          r343h4829c52_4
r-labeling                0.3              r343h889e2dd_4
r-lattice                 0.20_35          r343h086d26f_0
r-lazyeval                0.2.1            r343h086d26f_0
r-magrittr                1.5              r343h889e2dd_4
r-mass                    7.3_48           r343h086d26f_0
r-matrix                  1.2_12           r343h086d26f_0
r-mgcv                    1.8_22           r343h086d26f_0
r-munsell                 0.4.3            r343h889e2dd_0
r-nlme                    3.1_131          r343h4829c52_0
r-nnet                    7.3_12           r343h086d26f_0
r-pillar                  1.0.1            r343h889e2dd_0
r-plyr                    1.8.4            r343h599a50d_0
r-r6                      2.2.2            r343h889e2dd_0
r-rcolorbrewer            1.1_2            r343h889e2dd_0
r-rcpp                    0.12.14          r343h599a50d_0
r-recommended             3.4.3                    r343_0
r-reshape2                1.4.3            r343h599a50d_0
r-rlang                   0.1.6            r343h086d26f_0
r-rpart                   4.1_11           r343h086d26f_0
r-scales                  0.5.0            r343h599a50d_0
r-spatial                 7.3_11           r343h086d26f_4
r-stringi                 1.1.6            r343h599a50d_0
r-stringr                 1.2.0            r343h889e2dd_0
r-survival                2.41_3           r343h086d26f_0
r-tibble                  1.4.1            r343h086d26f_0
r-utf8                    1.1.2            r343h086d26f_0
r-viridislite             0.2.0            r343h889e2dd_0
readline                  7.0               hf8c457e_1001    conda-forge
reportlab                 3.4.0                    py27_0
requests                  2.21.0                py27_1000    conda-forge
ruamel_yaml               0.15.71         py27h14c3975_1000    conda-forge
salmon                    0.10.1                        1    bioconda
samtools                  1.6                  h02bfda8_2    bioconda
scandir                   1.9.0           py27h14c3975_1000    conda-forge
scikit-learn              0.18.1              np111py27_0
scipy                     0.18.1              np111py27_0
seaborn                   0.8.1                      py_1    conda-forge
setuptools                40.8.0                   py27_0    conda-forge
singledispatch            3.4.0.3               py27_1000    conda-forge
sip                       4.18                     py27_1    conda-forge
six                       1.12.0                py27_1000    conda-forge
snowballstemmer           1.2.1                      py_1    conda-forge
soupsieve                 1.8                      py27_0    conda-forge
spades                    3.13.0                        0    bioconda
sphinx                    1.8.4                    py27_0    conda-forge
sphinx_rtd_theme          0.4.3                      py_0    conda-forge
sphinxcontrib-websupport  1.1.0                      py_1    conda-forge
sqlite                    3.26.0            h67949de_1000    conda-forge
srprism                   2.4.24                        2    bioconda
statsmodels               0.9.0           py27h3010b51_1000    conda-forge
subprocess32              3.2.7                    py27_0    conda-forge
taxator-tk                1.3.3e                        0    ursky
tbb                       2019.3            h6bb024c_1000    conda-forge
tbl2asn                   25.6                          3    bioconda
tk                        8.6.9             h84994c4_1000    conda-forge
tktable                   2.10                 h14c3975_0
tornado                   5.1.1           py27h14c3975_1000    conda-forge
tqdm                      4.31.1                     py_0    conda-forge
trim-galore               0.4.5                         2    bioconda
typing                    3.5.2.2                  py27_0    bioconda
urllib3                   1.24.1                py27_1000    conda-forge
wheel                     0.33.1                   py27_0    conda-forge
xopen                     0.5.0                      py_0    bioconda
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.9             h14c3975_1004    conda-forge
xorg-libsm                1.2.3             h4937e3b_1000    conda-forge
xorg-libx11               1.6.7             h14c3975_1000    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.2             h14c3975_1007    conda-forge
xorg-libxext              1.3.3             h14c3975_1004    conda-forge
xorg-libxrender           0.9.10            h14c3975_1002    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h14c3975_1007    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
yaml                      0.1.7             h14c3975_1001    conda-forge
zlib                      1.2.11            h14c3975_1004    conda-forge
zstd                      1.3.3                         1    conda-forge
ursky commented 5 years ago

I just installed nomkl-3.0-0 and it works perfectly. I will update the metawrap recipe to enforce that, but its probably better for concoct to do that natively.

The following NEW packages will be INSTALLED:

nomkl pkgs/free/linux-64::nomkl-3.0-0

The following packages will be UPDATED:

  blas                              pkgs/main::blas-1.0-mkl --> conda-forge::blas-1.1-openblas
  numpy              pkgs/main::numpy-1.11.3-py27h3dfced4_4 --> conda-forge::numpy-1.11.3-py27_blas_openblas_201
  scikit-learn       pkgs/free::scikit-learn-0.18.1-np111p~ --> pkgs/main::scikit-learn-0.19.1-py27_nomklh6479e79_0
  scipy                 pkgs/free::scipy-0.18.1-np111py27_0 --> pkgs/main::scipy-1.1.0-py27_nomklh9d22d0a_0
alneberg commented 5 years ago

Hmm, I guess I could. But this time it works in a clean conda installation (at least for me). And therefore nomkl is not necessary to get concoct to run, it seems to be able to run with both mkl and nomkl so I would prefer not to lock down the dependencies too much.

I'm already quite impressed how many different (and potentially conflicting) packages you've managed to bundle in one conda environment. I'm struggling to put together just one single package. But yes, great that its working again!

ursky commented 5 years ago

Understandable. I updated the recipe on my end and it seems to work flawlessly. And yes, its a pain - hence the hundreds of little unique issues I am constantly bombarded with...

By the way, feel free to advertise metawrap binning --concoct as a simple way to perform the entire CONCOCT pipeline in one command - the assembly indexing, alignment, contig cutting, quantitaiton, binning, merging, and outputting bin fasta files. Compared to CONCOT, other binners are usually more user-friendly 1-liners. This could be a way to make it running CONCOCT simpler for newer users.

semarpetrus commented 5 years ago

Hi,

This might not be directly related to the issues of concoct that were discussed here, but I was not sure I should start another thread for it.

In the current code, the inputs for the concoct command are: "${home}/${out}/work_files/concoct_depth.txt" "${home}/${out}/work_files/assembly_10K.fa" Which is problematic when passing full paths as the output to metawrap; the input paths become "/home/path/to/working/directory//home/path/to/output/work_files/assembly_10k.fa" I saw a note in the code saying that concoct used files in the home/working directory but it seems that maybe that was changed in the recent release with the addition of these input parameters?

Could these be changed to "${out}/work_files/..." ?

Thank you

ursky commented 5 years ago

Thanks for pointing that out. Ill fix up the outdated variable handling before metawrap v1.1.7.

franciscozorrilla commented 5 years ago

Hello!

I have been playing around with the metaWRAP bin refinement module and I obtain some interesting results. Note that I implemented the binning tools myself and I only make use of the refinement module of metaWRAP. For a typical sample I get the following:

binning_results

binning_results

I understand that I have used quite lax completeness and contamination parameters (40 and 25 respectively), and I plan to filter out bins at a later time. It looks like if use a cutoff of 80% completion for filtering, I am better off using the original CONCOCT bins, although it does appear that metaWRAP does reduce some contamination.

Do you think that my bin refinement parameters too lax for the refinement to work properly? I tested with slightly more stringent parameters on another sample, but the results are even less promising:

binning_results

Interested in hearing both of your opinions @alneberg @ursky

Best wishes, FZ

ursky commented 5 years ago

It is entirely possible that refinement may not improve results in every case scenario. In your example, if metabat2 has all the bins that concoct/maxbin2 produces, then there is no new information being added to metabat2, so the refinement module does not improve the result. It not common, but seeing how few bins you have makes me think that this is a simpler binning challenge so this would make sense. In general, metaWRAP will strongly prioritize low contamination, so you can see that it did so at some expense of completion.

One thing I can say for sure is to run the refinement module with the completion and contamination parameters that you actually want to use later, since the parameters will define what bin qualities to prioritize. No such thing as "ill filter them out later" - thats not how the refinement module is meant to be used. If you intend to only trust bins with 80% comp and 10% cont, then run with -c 80 -x 10. Also running with min completion below 50% is not supported since it can produce some funky results because some assumptions in the refinement algorithm may be violated.

franciscozorrilla commented 5 years ago

Thanks for the quick response! I will re-run with more stringent parameters as you suggest. The low number of bins in my samples is because I assemble and bin each sample individually, as I am dealing with gut microbiome data. The number of bins is actually quite normal/high (at least in the first sample's plot) when compared to the Segata or Finn publications, which average between ~8-15 bins per metagenome.

I look forward to also trying out the bin reassembly module next which looks very promising!

Best wishes, FZ