Questions about reproducing

lijiashan2020 commented 2 years ago

Hi, When I run the command python train.py --config=configs_clean/RDKitCoords_flexible_self_docking.yml, the following error will always be reported after processing part of the receptor data. But when I reduce the training dataset to one tenth of that in this article, it can work normally. I am very confused about this. Do you know where the problem is?

joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/.local/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/.conda/envs/equidock/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
RuntimeError: invalid value in pickle
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/EquiBind/train.py", line 304, in <module>
    main_function()
  File "/EquiBind/train.py", line 295, in main_function
    train_wrapper(args)
  File "/EquiBind/train.py", line 141, in train_wrapper
    return train(args, run_dir)
  File "/EquiBind/train.py", line 169, in train
    train_data = PDBBind(device=device, complex_names_path=args.train_names,lig_predictions_name=args.train_predictions_name, is_train_data=True, **args.dataset_params)
  File "/EquiBind/datasets/pdbbind.py", line 127, in __init__
    self.process()
  File "/EquiBind/datasets/pdbbind.py", line 252, in process
    receptor_representatives = pmap_multi(get_receptor, zip(rec_paths, ligs), n_jobs=self.n_jobs, cutoff=self.chain_radius, desc='Get receptors')
  File "/EquiBind/commons/utils.py", line 46, in pmap_multi
    results = Parallel(n_jobs=n_jobs, verbose=verbose, timeout=None)(
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/.local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/.conda/envs/equidock/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/.conda/envs/equidock/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/.local/lib/python3.9/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 794, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "/.local/lib/python3.9/site-packages/joblib/parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/.local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/.local/lib/python3.9/site-packages/joblib/externals/loky/reusable_executor.py", line 177, in submit
    return super(_ReusablePoolExecutor, self).submit(
  File "/.local/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 1115, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

Anyway, thank you very much for your help. EquiBind is a great work!

HannesStark commented 2 years ago

Hi! Thanks for the kind words! This sounds to me like you are running out of memory. You could split up the preprocessing into multiple steps instead of putting all the data into memory at the same time.

lijiashan2020 commented 2 years ago

Thank you for your reply! I will try the method you mentioned.

HannesStark / EquiBind

Questions about reproducing #42