Closed nick-youngblut closed 3 years ago
I'm getting a different error when running the job locally:
***************************************************
..:: dRep dereplicate Step 1. Filter ::..
***************************************************
Will filter the genome list
Loading genomes from a list
Calculating genome info of genomes
100.00% of genomes passed length filtering
***************************************************
..:: dRep dereplicate Step 2. Cluster ::..
***************************************************
Running primary clustering
Running pair-wise MASH clustering
Will split genomes into 7 groups for primary clustering
Traceback (most recent call last):
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/bin/dRep", line 32, in <module>
Controller().parseArguments(args)
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/controller.py", line 100, in parseArguments
self.dereplicate_operation(**vars(args))
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/controller.py", line 48, in dereplicate_operation
drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/d_workflows.py", line 37, in dereplicate_wrapper
drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/d_cluster/controller.py", line 179, in d_cluster_wrapper
GenomeClusterController(workDirectory, **kwargs).main()
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/d_cluster/controller.py", line 32, in main
self.run_primary_clustering()
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/d_cluster/controller.py", line 100, in run_primary_clustering
Mdb, Cdb, cluster_ret = drep.d_cluster.compare_utils.all_vs_all_MASH(self.Bdb, self.wd.get_dir('MASH'), **self.kwargs)
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/d_cluster/compare_utils.py", line 110, in all_vs_all_MASH
genome_chunks = run_mash_on_genome_chunks(genome_chunks, mash_exe, sketch_folder, MASH_folder, logdir, **kwargs)
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/d_cluster/compare_utils.py", line 180, in run_mash_on_genome_chunks
drep.thread_cmds(cmds, logdir=logdir, t=int(p))
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/__init__.py", line 56, in thread_cmds
pool.map(thread_cmd_wrapper, tups)
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/__init__.py", line 51, in thread_cmd_wrapper
run_cmd(*tup)
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/site-packages/drep/__init__.py", line 47, in run_cmd
call(cmd,stdout=sto, stderr=ste)
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/lib/python3.8/subprocess.py", line 1702, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: '/ebio/abt3_projects/Georg_animal_feces/bin/llg/.snakemake/conda/bb52dd38/bin/mash'
...which suggests that the empty distance matrix
error was due to lack of memory for my cluster job.
It appears that drep is calling mash with all of the paths to all genomes, which seems to be too long for my ~32000 genomes.
I guess that I'm stuck using fastANI independently of drep
Hi Nick,
For 32,000 genomes you'll definitely need to add the argument --multiround_primary_clustering
. This requires dRep v3 if you don't already have that installed. More info is on it here: https://drep.readthedocs.io/en/latest/choosing_parameters.html#using-greedy-algorithms
The first error may have been cause by running out of RAM, or it could be a problem with mash failing silently. If you try this again add the -d
argument so that we can troubleshoot if it crashes again.
The second error (the local one) has to do with the command length limit for your bash setup. I'm not sure how to actually change the argument length limit for your bash profile, but lowering the deep argument --primary_chunksize
could fix this problem if you hit it again. I've never had this problem with the default 5000, but lowering it to 3000 or so shouldn't result in any noticeable dip in performance.
-Matt
Thanks for the heads up on --multiround_primary_clustering
and --primary_chunksize
!
The first error was due to a lack of memory.
The cluster admin won't change the max command length. I'm guessing that you are using shorter file paths than me, which is why the 5000 default is working for you. This is a general problem for software that only allows nargs="+"
instead of allowing one input file with a list of paths. Thanks for implementing the later in dRep!
Just one small thing in regards to python cli dev: many people don't know that argparse allows for dashes throughout a param (eg., --multiround-primary-clustering
and --primary-chunksize
), which is a bit easier to type. It's definitely personal preference though. I just thought I'd pass on that FYI, given that the developer is usually the one typing those params a ton (eg., during all of the software testing), so little things like params that are slightly easier to type can make a difference.
Cool, thanks for the heads up! I don't want to change those parameters as they stand, as I don't want to mess with workflows that others have implemented using the current flags, but I'll keep that in mind for the future
-Matt
Hello,
sorry if I am commenting on this again. I am having a very very similar problem here. I am trying to dereplicate across a series of MAGs and I am having the same "Empty Distance Matrix Error", please see below.
I have tried both solution you suggested before, e.g. --multiround_primary_clustering
and --primary_chunksize
but does not seem to help.
..:: dRep dereplicate Step 1. Filter ::..
***************************************************
Will filter the genome list
346 genomes were input to dRep
Calculating genome info of genomes
98.27% of genomes passed length filtering
Running prodigal
Running checkM
0.29% of genomes passed checkM filtering
***************************************************
..:: dRep dereplicate Step 2. Cluster ::..
***************************************************
Running primary clustering
Running pair-wise MASH clustering
Traceback (most recent call last):
File "/mnt/home/benucci/anaconda2/envs/drep/bin/dRep", line 32, in <module>
Controller().parseArguments(args)
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/controller.py", line 100, in parseArguments
self.dereplicate_operation(**vars(args))
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/controller.py", line 48, in dereplicate_operation
drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_workflows.py", line 37, in dereplicate_wrapper
drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/controller.py", line 179, in d_cluster_wrapper
GenomeClusterController(workDirectory, **kwargs).main()
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/controller.py", line 32, in main
self.run_primary_clustering()
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/controller.py", line 100, in run_primary_clustering
Mdb, Cdb, cluster_ret = drep.d_cluster.compare_utils.all_vs_all_MASH(self.Bdb, self.wd.get_dir('MASH'), **self.kwargs)
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 115, in all_vs_all_MASH
Cdb, cluster_ret = cluster_mash_database(Mdb, **kwargs)
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 280, in cluster_mash_database
Cdb, linkage = drep.d_cluster.cluster_utils.cluster_hierarchical(linkage_db, linkage_method= P_Lmethod, \
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/cluster_utils.py", line 114, in cluster_hierarchical
linkage = scipy.cluster.hierarchy.linkage(arr, method= linkage_method)
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/scipy/cluster/hierarchy.py", line 1068, in linkage
n = int(distance.num_obs_y(y))
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/scipy/spatial/distance.py", line 2555, in num_obs_y
raise ValueError("The number of observations cannot be determined on "
ValueError: The number of observations cannot be determined on an empty distance matrix.
This ia my conda env:
# packages in environment at /mnt/home/benucci/anaconda2/envs/drep:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
biopython 1.78 py39h7f8727e_0 anaconda
blas 1.0 mkl
brotli 1.0.9 h5eee18b_7
brotli-bin 1.0.9 h5eee18b_7
bzip2 1.0.8 h7b6447c_0
c-ares 1.18.1 h7f98852_0 conda-forge
ca-certificates 2022.12.7 ha878542_0 conda-forge
certifi 2022.12.7 pyhd8ed1ab_0 conda-forge
checkm-genome 1.2.2 pyhdfd78af_1 bioconda
cycler 0.11.0 pyhd3eb1b0_0
dbus 1.13.18 hb2f20db_0 anaconda
dendropy 4.5.2 pyh3252c3a_0 bioconda
drep 3.4.0 pyhdfd78af_0 bioconda
expat 2.4.4 h295c915_0 anaconda
fastani 1.33 h0fdf51a_0 bioconda
fftw 3.3.9 h27cfd23_1
fontconfig 2.13.1 h6c09931_0 anaconda
fonttools 4.25.0 pyhd3eb1b0_0
freetype 2.12.1 h4a9f257_0
gettext 0.21.1 h27087fc_0 conda-forge
giflib 5.2.1 h7b6447c_0
glib 2.69.1 h4ff587b_1 anaconda
gsl 2.7 he838d99_0 conda-forge
gst-plugins-base 1.14.0 h8213a91_2 anaconda
gstreamer 1.14.0 h28cd5cc_2 anaconda
hmmer 3.3.2 h87f3376_2 bioconda
icu 58.2 he6710b0_3 anaconda
intel-openmp 2021.4.0 h06a4308_3561
jbig 2.1 h7f98852_2003 conda-forge
joblib 1.1.0 pyhd3eb1b0_0 anaconda
jpeg 9e h7f8727e_0
kiwisolver 1.4.2 py39h295c915_0 anaconda
krb5 1.19.2 hac12032_0 anaconda
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libblas 3.9.0 12_linux64_mkl conda-forge
libbrotlicommon 1.0.9 h5eee18b_7
libbrotlidec 1.0.9 h5eee18b_7
libbrotlienc 1.0.9 h5eee18b_7
libcblas 3.9.0 12_linux64_mkl conda-forge
libclang 10.0.1 default_hb85057a_2 anaconda
libcurl 7.82.0 h7bff187_0 conda-forge
libdeflate 1.10 h7f98852_0 conda-forge
libedit 3.1.20210910 h7f8727e_0 anaconda
libev 4.33 h516909a_1 conda-forge
libevent 2.1.12 h8f2d780_0 anaconda
libffi 3.3 he6710b0_2
libgcc 7.2.0 h69d50b8_2
libgcc-ng 12.2.0 h65d4601_19 conda-forge
libgfortran-ng 11.2.0 h00389a5_1
libgfortran5 11.2.0 h1234567_1
libgomp 12.2.0 h65d4601_19 conda-forge
libidn2 2.3.4 h166bdaf_0 conda-forge
libllvm10 10.0.1 hbcb73fb_5 anaconda
libnghttp2 1.47.0 h727a467_0 conda-forge
libnsl 2.0.0 h5eee18b_0
libpng 1.6.37 hbc83047_0
libpq 12.9 h16c4e8d_3 anaconda
libssh2 1.10.0 haa6b8db_3 conda-forge
libstdcxx-ng 11.2.0 h1234567_1
libtiff 4.3.0 h542a066_3 conda-forge
libunistring 0.9.10 h7f98852_0 conda-forge
libuuid 1.41.5 h5eee18b_0
libwebp 1.2.4 h11a3e52_0
libwebp-base 1.2.4 h5eee18b_0
libxcb 1.15 h7f8727e_0 anaconda
libxkbcommon 1.0.1 hfa300c1_0 anaconda
libxml2 2.9.14 h74e7548_0 anaconda
libxslt 1.1.35 h4e12654_0 anaconda
libzlib 1.2.13 h166bdaf_4 conda-forge
lz4-c 1.9.3 h295c915_1
mash 1.1 0 bioconda
matplotlib 3.5.1 py39h06a4308_1 anaconda
matplotlib-base 3.5.1 py39ha18d171_1 anaconda
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py39h7f8727e_0 anaconda
mkl_fft 1.3.1 py39hd3c417c_0 anaconda
mkl_random 1.2.2 py39h51133e4_0 anaconda
mummer4 4.0.0rc1 pl5321h87f3376_3 bioconda
munkres 1.0.7 py_1 bioconda
ncurses 6.3 h5eee18b_3
nspr 4.33 h295c915_0 anaconda
nss 3.74 h0370c37_0 anaconda
numpy 1.23.1 py39h6c91a56_0 anaconda
numpy-base 1.23.1 py39ha15fc14_0 anaconda
openssl 1.1.1s h0b41bf4_1 conda-forge
packaging 21.3 pyhd3eb1b0_0
pandas 1.2.3 py39hde0f152_0 conda-forge
pcre 8.45 h295c915_0 anaconda
perl 5.32.1 2_h7f98852_perl5 conda-forge
pillow 9.2.0 py39hace64e9_1 anaconda
pip 22.1.2 py39h06a4308_0 anaconda
ply 3.11 py39h06a4308_0 anaconda
pplacer 1.1.alpha19 h9ee0642_2 bioconda
prodigal 2.6.3 hec16e2b_4 bioconda
pyparsing 3.0.4 pyhd3eb1b0_0 anaconda
pyqt 5.15.7 py39h6a678d5_1 anaconda
pyqt5-sip 12.11.0 py39h6a678d5_1 anaconda
pysam 0.19.0 py39h5030a8b_0 bioconda
python 3.9.12 h12debd9_1 anaconda
python-dateutil 2.8.2 pyhd3eb1b0_0
python_abi 3.9 2_cp39 conda-forge
pytz 2022.1 py39h06a4308_0 anaconda
qt-main 5.15.2 h327a75a_7 anaconda
qt-webengine 5.15.9 hd2b0992_4 anaconda
qtwebkit 5.212 h4eab89a_4 anaconda
readline 8.2 h5eee18b_0
scikit-learn 1.1.1 py39h6a678d5_0 anaconda
scipy 1.9.3 py39h14f4228_0
seaborn 0.11.2 pyhd3eb1b0_0 anaconda
setuptools 59.8.0 py39hf3d152e_1 conda-forge
sip 6.6.2 py39h6a678d5_0 anaconda
six 1.16.0 pyhd3eb1b0_1
sqlite 3.39.3 h5082296_0
threadpoolctl 2.2.0 pyh0d69192_0
tk 8.6.12 h1ccaba5_0
toml 0.10.2 pyhd3eb1b0_0 anaconda
tornado 6.1 py39h27cfd23_0 anaconda
tzdata 2022f h04d1e81_0
wget 1.20.3 ha56f1ee_1 conda-forge
wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.6 h5eee18b_0
zlib 1.2.13 h166bdaf_4 conda-forge
zstd 1.5.2 ha4553b6_0
These are dependencies:
(drep) [benucci@dev-amd20 code]$ dRep check_dependencies
mash.................................... all good (location = /mnt/home/benucci/anaconda2/envs/drep/bin/mash)
nucmer.................................. all good (location = /mnt/home/benucci/anaconda2/envs/drep/bin/nucmer)
checkm.................................. all good (location = /mnt/home/benucci/anaconda2/envs/drep/bin/checkm)
ANIcalculator........................... !!! ERROR !!! (location = None)
prodigal................................ all good (location = /mnt/home/benucci/anaconda2/envs/drep/bin/prodigal)
centrifuge.............................. !!! ERROR !!! (location = None)
nsimscan................................ !!! ERROR !!! (location = None)
fastANI................................. all good (location = /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI)
And this is how I call it
dRep dereplicate \
-p $cores \
--multiround_primary_clustering \
--primary_chunksize 3000 \
$cait_scratch/c08_binsDereplication_drep \
-g $cait_scratch/c07_aggregatedBins_dastool/dastool__DASTool_bins/*.fa
Thanks a lot for your help!
Gian
Hi @Gian77 - the problem is that you're only having a single genome pass the checkM filtering. You probably need to relax the checkM filtering criteria.
Best, MO
@MrOlm, WTH, you're right. Sorry, I did not check the output carefully. Thank you, Gian
I'm running
drep dereplicate
on ~32000 genomes, and getting the following error:The specific command:
Any idea why drep is generating an "empty distance matrix"?
My conda env: