cistrome / MIRA

Python package for analysis of multiomic single cell RNA-seq and ATAC-seq.
56 stars 8 forks source link

PicklingError: Could not pickle the task to send it to the workers. #32

Closed Yansr3 closed 7 months ago

Yansr3 commented 1 year ago

I'm trying to run the Bayesian tuner on GPU in a highthroughput computing cluster system. When trying to run tuner with n_job greater than 1. I got this error:

Traceback (most recent call last): File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/loky/backend/queues.py", line 159, in feed obj = dumps(obj, reducers=reducers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/loky/backend/reduction.py", line 215, in dumps dump(obj, buf, reducers=reducers, protocol=protocol) File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/loky/backend/reduction.py", line 208, in dump _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj) File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump return Pickler.dump(self, obj) ^^^^^^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'weakref.ReferenceType' object

By setting n_jobs=1, I could get the job running, but it didn't speed up the tunning process comparing to run without GPU. I added model.to_gpu() after mira.topics.make_model. I checked GPU information in the server as follows. The GPU-Util is around 13%. The cuda version in the server is 12.1. I was using the mira-multiome 2.1.1a3, python 3.11, linux system.

import torch torch.cuda.is_available() True torch.cuda.device_count() 1 torch.cuda.current_device() 0 cur_dev = torch.cuda.current_device() torch.cuda.get_device_name(cur_dev) 'NVIDIA A100-SXM4-80GB'

I tried mira-multiome 2.1.0. But with torch 1.13.1, I got torch.cuda.device_count() as 0.

I'm hoping to see if I could seek some help for the error and also some advice to see if the tuning is actually running on GPU. Besides, does the n_jobs parameter equal the number of GPUs required for the tuning, or the number of trails in a single GPU?

AllenWLynch commented 7 months ago

Apologies for completely missing this issue. I am publishing a fix for this early this week.