MrOlm / drep

Rapid comparison and dereplication of genomes
264 stars 38 forks source link

Error disk quota exceeded #109

Closed yifengyuan closed 3 years ago

yifengyuan commented 3 years ago

Hi,

I am trying to use dRep to shrink the 4616 UHGG genomes published by the Nature Biotechnology paper. Here is the command that I have been using:

dRep dereplicate -pa 0.85 -sa 0.9 -nc 0.30 -cm larger -p 32 -g ./*.fasta.
dRep dereplicate -pa 0.85 -sa 0.9 -nc 0.30 -cm larger -p 32 -d --set_recursion 2000 -g ./*.fasta.

I have tried with multiple arguments. Only -pa 0.85 -sa 0.9 worked. dRep generated tons of files so I have had 'disk quota exceeded error' with -pa 0.70 -sa 0.75, -pa 0.60 -sa 0.65. My quota increased from ~400000 to more than 2000000 which is my limit, and caused the error.

I have tried to work with different groups of genomes. With my collection of genomes (4500 bacterial genomes), it shows the same error.

Could you advice here please?


Thu Mar  4 23:21:43 EST 2021
***************************************************
    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************

Will filter the genome list
4,500 genomes were input to dRep
Calculating genome info of genomes
100.00% of genomes passed length filtering
Running prodigal
Running checkM
Setting Maximum Recursion depth to 2000
86.71% of genomes passed checkM filtering
***************************************************
    ..:: dRep dereplicate Step 2. Cluster ::..
***************************************************

Running primary clustering
Running pair-wise MASH clustering
18 primary clusters made
Running secondary clustering
Running 4167664 ANImf comparisons- should take ~ 52095.8 min
Traceback (most recent call last):
  File "/scratch/users/yfyuan/conda3/bin/dRep", line 32, in <module>
    Controller().parseArguments(args)
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/controller.py", line 100, in parseArguments
    self.dereplicate_operation(**vars(args))
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/controller.py", line 48, in dereplicate_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/d_workflows.py", line 37, in dereplicate_wrapper
    drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 179, in d_cluster_wrapper
    GenomeClusterController(workDirectory, **kwargs).main()
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 35, in main
    self.run_secondary_clustering()
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 140, in run_secondary_clustering
    Ndb, Cdb, c2ret = drep.d_cluster.compare_utils.secondary_clustering(self.Bdb, self.MCdb, algorithm, self.wd.get_dir('data'), wd=self.wd, **self.kwargs)
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 294, in secondary_clustering
    ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 351, in compare_genomes
    df = drep.d_cluster.external.run_pairwise_ANImf(genome_list, working_data_folder, **kwargs)
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/d_cluster/external.py", line 63, in run_pairwise_ANImf
    drep.thread_cmds(cmds, shell=True, logdir=logdir, t=int(p))
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/__init__.py", line 56, in thread_cmds
    pool.map(thread_cmd_wrapper, tups)
  File "/scratch/users/yfyuan/conda3/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/scratch/users/yfyuan/conda3/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
  File "/scratch/users/yfyuan/conda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/scratch/users/yfyuan/conda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/__init__.py", line 51, in thread_cmd_wrapper
    run_cmd(*tup)
  File "/scratch/users/yfyuan/conda3/lib/python3.7/site-packages/drep/__init__.py", line 32, in run_cmd
    sto = open(os.path.join(logdir + uniq_filename + '.STDOUT'), 'w')
OSError: [Errno 122] Disk quota exceeded: '/scratch/users/yfyuan/genomes/alm9019_dRep/dRep_6065_p2/log/cmd_logs/2021-03-06_12.31.04.876438.STDOUT'
MrOlm commented 3 years ago

Hello,

It seems like the problem is that dRep is making too many files for your disk quota. When running in debug mode (with -d) dRep does make an absurd amount of files.

I would try re-running without -d. If you still have an error, please let me know.

Best, Matt

yifengyuan commented 3 years ago

Thanks. It works fine without -d option.