Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
16.89k stars 4.12k forks source link

Can't train agent. RuntimeError: CUDA error: device-side assert triggered #5818

Closed Braeze closed 1 year ago

Braeze commented 1 year ago

I had no problem training the agent days ago but now after a lot of changes to the env. It crashes everytime i try to train it. Don't know why it suddenly happens? It works fine in heuristic but it does not work on cpu nor gpu. Friend get the same error while running the env while on windows 10

Changes i made that might be the cause are: Added RayPersepction Sensor3d Trying to use some imitation learning

Error can be found below Steps to reproduce the behavior: mlagents-learn --force file.yaml

Yaml file look like this: behaviors: MoveToTables: trainer_type: ppo hyperparameters: batch_size: 256 buffer_size: 1024 learning_rate: 3.0e-4 beta: 5.0e-4 epsilon: 0.2 lambd: 0.99 num_epoch: 3 learning_rate_schedule: linear network_settings: normalize: false hidden_units: 128 num_layers: 2 reward_signals: extrinsic: gamma: 0.99 strength: 1. gail: strength: 0.5 demo_path: Demos/TableAgentDemo.demo behavioral_cloning: strength: 0.5 demo_path: Demos/TableAgentDemo.demo max_steps: 500000 time_horizon: 64 summary_freq: 10000

Long Error below:

¨¨¨¨C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\torch\utils.py:320: UserWarning: The use of x.T on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider x.mT to transpose batches of matricesor x.permute(*torch.arange(x.ndim - 1, -1, -1)) to reverse the dimensions of a tensor. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\TensorShape.cpp:2985.) return (tensor.T masks).sum() / torch.clamp( C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [9,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [11,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [20,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [67,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [68,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [74,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [84,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [86,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [87,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [96,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [110,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [126,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [41,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [53,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [56,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [57,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [0,0,0], thread: [61,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. Traceback (most recent call last): File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\trainer_controller.py", line 176, in start_learning n_steps = self.advance(env_manager) File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped return func(args, kwargs) File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\trainer_controller.py", line 251, in advance trainer.advance() File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\trainer\rl_trainer.py", line 315, in advance if self._update_policy(): File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\ppo\trainer.py", line 212, in _update_policy update_stats = self.optimizer.bc_module.update() File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\torch\components\bc\module.py", line 95, in update run_out = self._update_batch(mini_batch_demo, self.n_sequences) File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\torch\components\bc\module.py", line 183, in _update_batch self.optimizer.step() File "C:\Users\tobia\miniconda3\lib\site-packages\torch\optim\optimizer.py", line 113, in wrapper return func(*args, *kwargs) File "C:\Users\tobia\miniconda3\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "C:\Users\tobia\miniconda3\lib\site-packages\torch\optim\adam.py", line 140, in step if self.defaults['capturable'] else torch.tensor(0.) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\tobia\miniconda3\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\tobia\miniconda3\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\tobia\miniconda3\Scripts\mlagents-learn.exe__main__.py", line 7, in File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\learn.py", line 260, in main run_cli(parse_command_line()) File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\learn.py", line 256, in run_cli run_training(run_seed, options, num_areas) File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\learn.py", line 132, in run_training tc.start_learning(env_manager) File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped return func(*args, kwargs) File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\trainer_controller.py", line 201, in start_learning self._save_models() File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped return func(*args, *kwargs) File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\trainer_controller.py", line 80, in _save_models self.trainers[brain_name].save_model() File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\trainer\rl_trainer.py", line 185, in save_model model_checkpoint = self._checkpoint() File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped return func(args, kwargs) File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\trainer\rl_trainer.py", line 157, in _checkpoint export_path, auxillary_paths = self.model_saver.save_checkpoint( File "C:\Users\tobia\miniconda3\lib\site-packages\mlagents\trainers\model_saver\torch_model_saver.py", line 58, in save_checkpoint torch.save(state_dict, f"{checkpoint_path}.pt") File "C:\Users\tobia\miniconda3\lib\site-packages\torch\serialization.py", line 379, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "C:\Users\tobia\miniconda3\lib\site-packages\torch\serialization.py", line 601, in _save storage = storage.cpu() File "C:\Users\tobia\miniconda3\lib\site-packages\torch\storage.py", line 112, in cpu return torch.UntypedStorage(self.size()).copy(self, False) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.¨¨¨¨

Env options

Braeze commented 1 year ago

the issue happens only when RayPersepction Sensor3d is added

Soontosh commented 1 year ago

How did you fix this?

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.