I'm trying to run the Bayesian tuner on GPU in a highthroughput computing cluster system. When trying to run tuner with n_job greater than 1. I got this error:
Traceback (most recent call last):
File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/loky/backend/queues.py", line 159, in feed
obj = dumps(obj, reducers=reducers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/loky/backend/reduction.py", line 215, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/loky/backend/reduction.py", line 208, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
return Pickler.dump(self, obj)
^^^^^^^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'weakref.ReferenceType' object
By setting n_jobs=1, I could get the job running, but it didn't speed up the tunning process comparing to run without GPU. I added model.to_gpu() after mira.topics.make_model. I checked GPU information in the server as follows. The GPU-Util is around 13%. The cuda version in the server is 12.1. I was using the mira-multiome 2.1.1a3, python 3.11, linux system.
I tried mira-multiome 2.1.0. But with torch 1.13.1, I got torch.cuda.device_count() as 0.
I'm hoping to see if I could seek some help for the error and also some advice to see if the tuning is actually running on GPU. Besides, does the n_jobs parameter equal the number of GPUs required for the tuning, or the number of trails in a single GPU?
I'm trying to run the Bayesian tuner on GPU in a highthroughput computing cluster system. When trying to run tuner with n_job greater than 1. I got this error:
Traceback (most recent call last): File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/loky/backend/queues.py", line 159, in feed obj = dumps(obj, reducers=reducers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/loky/backend/reduction.py", line 215, in dumps dump(obj, buf, reducers=reducers, protocol=protocol) File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/loky/backend/reduction.py", line 208, in dump _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj) File "/var/lib/condor/execute/slot1/dir_3556563/tmp/wk/input/mira-GPU/lib/python3.11/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump return Pickler.dump(self, obj) ^^^^^^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'weakref.ReferenceType' object
By setting n_jobs=1, I could get the job running, but it didn't speed up the tunning process comparing to run without GPU. I added model.to_gpu() after mira.topics.make_model. I checked GPU information in the server as follows. The GPU-Util is around 13%. The cuda version in the server is 12.1. I was using the mira-multiome 2.1.1a3, python 3.11, linux system.
I tried mira-multiome 2.1.0. But with torch 1.13.1, I got torch.cuda.device_count() as 0.
I'm hoping to see if I could seek some help for the error and also some advice to see if the tuning is actually running on GPU. Besides, does the n_jobs parameter equal the number of GPUs required for the tuning, or the number of trails in a single GPU?