Closed edmund735 closed 2 weeks ago
Sampled hyperparams: {'batch_size': 1024, 'buffer_size': 100000, 'ent_coef': 'auto', 'gamma': 0.9999, 'gradient_steps': 1, 'learning_rate': 0.004315216575412321, 'learning_starts': 0, 'policy_kwargs': {'log_std_init': -1.4239746627852474, 'n_quantiles': 31, 'net_arch': [64, 64], 'top_quantiles_to_drop_per_net': 25, 'use_sde': False}, 'target_entropy': 'auto', 'tau': 0.02, 'top_quantiles_to_drop_per_net': 25, 'train_freq': 1} 2024-04-07 15:55:23.577027: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:1883] could not synchronize on CUDA context: CUDA_ERROR_STREAM_CAPTURE_UNSUPPORTED: operation not permitted when stream is capturing :: Begin stack trace _PyObject_MakeTpCall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyObject_MakeTpCall
_PyObject_MakeTpCall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
PyObject_Call
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
PyObject_Call
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
PyObject_Call
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalFrameDefault
clone
End stack trace
[I 2024-04-07 15:55:23,577] Trial 1 pruned. [W 2024-04-07 15:55:23,606] Trial 3 failed with parameters: {'gamma': 1, 'learning_rate': 0.03739146141228411, 'batch_size': 256, 'buffer_size': 100000, 'learning_starts': 1000, 'train_freq': 1, 'tau': 0.02, 'log_std_init': -1.1735175685607313, 'net_arch': 'small', 'n_quantiles': 26, 'top_quantiles_to_drop_per_net': 24} because of the following error: XlaRuntimeError('INTERNAL: Failed to synchronize GPU for autotuning.'). jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/.../.conda/envs/...1/lib/python3.10/site-packages/optuna/study/_optimize.py", line 196, in _run_trial value_or_values = func(trial) File "/scratch/network/.../.../rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 793, in objective model.learn(self.n_timesteps, callback=callbacks, **learn_kwargs) # type: ignore[arg-type] File "/home/.../.conda/envs/...1/lib/python3.10/site-packages/sbx/tqc/tqc.py", line 183, in learn return super().learn( File "/home/.../.conda/envs/...1/lib/python3.10/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn self.train(batch_size=self.batch_size, gradient_steps=gradient_steps) File "/home/.../.conda/envs/...1/lib/python3.10/site-packages/sbx/tqc/tqc.py", line 220, in train ) = self._train( jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to synchronize GPU for autotuning.
[W 2024-04-07 15:55:23,607] Trial 3 failed with value None. [I 2024-04-07 15:55:44,276] Trial 2 finished with value: -149.14224087500003 and parameters: {'gamma': 0.995, 'learning_rate': 0.005830150992686316, 'batch_size': 2048, 'buffer_size': 10000, 'learning_starts': 0, 'train_freq': 8, 'tau': 0.01, 'log_std_init': -3.101106181907312, 'net_arch': 'medium', 'n_quantiles': 13, 'top_quantiles_to_drop_per_net': 1}. Best is trial 2 with value: -149.14224087500003. [I 2024-04-07 15:55:44,442] Trial 0 finished with value: -1286.9111508125 and parameters: {'gamma': 0.995, 'learning_rate': 0.03380452664776398, 'batch_size': 128, 'buffer_size': 1000000, 'learning_starts': 1000, 'train_freq': 4, 'tau': 0.02, 'log_std_init': -3.2686941182290763, 'net_arch': 'big', 'n_quantiles': 45, 'top_quantiles_to_drop_per_net': 13}. Best is trial 2 with value: -149.14224087500003. [I 2024-04-07 15:55:44,610] Trial 4 finished with value: -408.14151849999996 and parameters: {'gamma': 0.99, 'learning_rate': 0.022024554072114278, 'batch_size': 512, 'buffer_size': 100000, 'learning_starts': 1000, 'train_freq': 8, 'tau': 0.02, 'log_std_init': 0.9307981026739451, 'net_arch': 'big', 'n_quantiles': 35, 'top_quantiles_to_drop_per_net': 29}. Best is trial 2 with value: -149.14224087500003. jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/network/.../.../rl-baselines3-zoo/train_sbx.py", line 19, in
) = self._train( jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to synchronize GPU for autotuning.
'''
This might be related to Jax not handling multi-threading/multi-processing well.
You should probably have a look at distributed tuning using a shared database (I would recommend the log format): https://rl-baselines3-zoo.readthedocs.io/en/master/guide/tuning.html
🐛 Bug
Hi,
When I try to run TQC hyperparameter optimization with multiple jobs (n-jobs>1) with a GPU (this also happens with multiple CPU cores and n-jobs=1), it gives me this error:
To Reproduce
System Info
Describe the characteristic of your environment:
Library installed through pip
GPU models and configuration +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100 80GB PCIe On | 00000000:0D:00.0 Off | 0 | | N/A 40C P0 67W / 300W | 3508MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100 80GB PCIe On | 00000000:B5:00.0 Off | 0 | | N/A 38C P0 49W / 300W | 5MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
Python 3.10.14
pytorch 2.2.2 py3.10_cuda12.1_cudnn8.9.2_0 pytorch-cuda 12.1 ha16c6d3_5 pytorch pytorch-mutex 1.0 cuda pytorch torchtriton 2.2.0 py310 pytorch
Gym version gymnasium 0.29.1
Versions of any other relevant libraries jax 0.4.25 pyhd8ed1ab_0 conda-forge jax-jumpy 1.0.0 pyhd8ed1ab_0 conda-forge jaxlib 0.4.23 cuda118py310h8c47008_200 conda-forge
Additional context
I've noticed there's no bug when n-jobs=1, only when running multiple jobs. Maybe something with the way Optuna runs multiple jobs?
Checklist