Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
16.93k stars 4.13k forks source link

terminate called after throwing an instance of 'c10::Error' #5459

Closed caprinux closed 3 years ago

caprinux commented 3 years ago

Upon running the command mlagents-learn Config/TrainerConfig.yaml --run-id=run2 --train, it runs for about 5 second before it terminates with the following errors:

 Version information:
  ml-agents: 0.26.0,
  ml-agents-envs: 0.26.0,
  Communicator API: 1.5.0,
  PyTorch: 1.7.1+rocm3.8
[WARNING] The --train option has been deprecated. Train mode is now the default. Use --inference to run in inference mode.
[INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
[INFO] Connected to Unity environment with package version 2.0.0-exp.1 and communication version 1.5.0
[INFO] Connected new brain: CatAndMouse?team=0
[INFO] Hyperparameters for behavior name CatAndMouse: 
    trainer_type:   ppo
    hyperparameters:    
      batch_size:   128
      buffer_size:  2048
      learning_rate:    0.0003
      learning_rate_schedule:   linear
    network_settings:   
      normalize:    False
      hidden_units: 512
      num_layers:   2
      vis_encode_type:  simple
      memory:   None
      goal_conditioning_type:   hyper
    reward_signals: 
      extrinsic:    
        gamma:  0.99
        strength:   1.0
        network_settings:   
          normalize:    False
          hidden_units: 128
          num_layers:   2
          vis_encode_type:  simple
          memory:   None
          goal_conditioning_type:   hyper
      curiosity:    
        gamma:  0.99
        strength:   0.02
        network_settings:   
          normalize:    False
          hidden_units: 256
          num_layers:   2
          vis_encode_type:  simple
          memory:   None
          goal_conditioning_type:   hyper
    init_path:  None
    keep_checkpoints:   5
    checkpoint_interval:    500000
    max_steps:  10000000
    time_horizon:   128
    summary_freq:   30000
    threaded:   False
    self_play:  None
    behavioral_cloning: None
terminate called after throwing an instance of 'c10::Error'
  what():  HIP error: hipErrorNoDevice
Exception raised from deviceCount at /pytorch/aten/src/ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.h:98 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f72b55a9d12 in /home/caprinux/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x57d4f1 (0x7f72b5fa24f1 in /home/caprinux/.local/lib/python3.8/site-packages/torch/lib/libtorch_hip.so)
frame #2: torch::autograd::Engine::start_device_threads() + 0x442 (0x7f72e7635252 in /home/caprinux/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x1247f (0x7f7324cfb47f in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: torch::autograd::Engine::initialize_device_threads_pool() + 0xd5 (0x7f72e7632785 in /home/caprinux/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x2f (0x7f72e763afaf in /home/caprinux/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x3c (0x7f72f5f20ddc in /home/caprinux/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0xacd (0x7f72e763a46d in /home/caprinux/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x4e (0x7f72f5f20bde in /home/caprinux/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: THPEngine_run_backward(THPEngine*, _object*, _object*) + 0xe3f (0x7f72f5f21caf in /home/caprinux/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: PyCFunction_Call + 0x59 (0x5f2cc9 in /usr/bin/python3.8)
frame #11: _PyObject_MakeTpCall + 0x23f (0x5f30ff in /usr/bin/python3.8)
frame #12: _PyEval_EvalFrameDefault + 0x6246 (0x5705f6 in /usr/bin/python3.8)
frame #13: _PyEval_EvalCodeWithName + 0x26a (0x568d9a in /usr/bin/python3.8)
frame #14: _PyFunction_Vectorcall + 0x393 (0x5f5b33 in /usr/bin/python3.8)
frame #15: _PyEval_EvalFrameDefault + 0x57d7 (0x56fb87 in /usr/bin/python3.8)
frame #16: _PyEval_EvalCodeWithName + 0x26a (0x568d9a in /usr/bin/python3.8)
frame #17: _PyFunction_Vectorcall + 0x393 (0x5f5b33 in /usr/bin/python3.8)
frame #18: _PyEval_EvalFrameDefault + 0x906 (0x56acb6 in /usr/bin/python3.8)
frame #19: _PyEval_EvalCodeWithName + 0x26a (0x568d9a in /usr/bin/python3.8)
frame #20: _PyFunction_Vectorcall + 0x393 (0x5f5b33 in /usr/bin/python3.8)
frame #21: PyObject_Call + 0x62 (0x5f2702 in /usr/bin/python3.8)
frame #22: _PyEval_EvalFrameDefault + 0x1f82 (0x56c332 in /usr/bin/python3.8)
frame #23: _PyEval_EvalCodeWithName + 0x26a (0x568d9a in /usr/bin/python3.8)
frame #24: _PyFunction_Vectorcall + 0x393 (0x5f5b33 in /usr/bin/python3.8)
frame #25: _PyEval_EvalFrameDefault + 0x906 (0x56acb6 in /usr/bin/python3.8)
frame #26: _PyFunction_Vectorcall + 0x1b6 (0x5f5956 in /usr/bin/python3.8)
frame #27: _PyEval_EvalFrameDefault + 0x906 (0x56acb6 in /usr/bin/python3.8)
frame #28: _PyFunction_Vectorcall + 0x1b6 (0x5f5956 in /usr/bin/python3.8)
frame #29: _PyEval_EvalFrameDefault + 0x906 (0x56acb6 in /usr/bin/python3.8)
frame #30: _PyFunction_Vectorcall + 0x1b6 (0x5f5956 in /usr/bin/python3.8)
frame #31: PyObject_Call + 0x62 (0x5f2702 in /usr/bin/python3.8)
frame #32: _PyEval_EvalFrameDefault + 0x1f82 (0x56c332 in /usr/bin/python3.8)
frame #33: _PyEval_EvalCodeWithName + 0x26a (0x568d9a in /usr/bin/python3.8)
frame #34: _PyFunction_Vectorcall + 0x393 (0x5f5b33 in /usr/bin/python3.8)
frame #35: _PyEval_EvalFrameDefault + 0x906 (0x56acb6 in /usr/bin/python3.8)
frame #36: _PyFunction_Vectorcall + 0x1b6 (0x5f5956 in /usr/bin/python3.8)
frame #37: PyObject_Call + 0x62 (0x5f2702 in /usr/bin/python3.8)
frame #38: _PyEval_EvalFrameDefault + 0x1f82 (0x56c332 in /usr/bin/python3.8)
frame #39: _PyEval_EvalCodeWithName + 0x26a (0x568d9a in /usr/bin/python3.8)
frame #40: _PyFunction_Vectorcall + 0x393 (0x5f5b33 in /usr/bin/python3.8)
frame #41: _PyEval_EvalFrameDefault + 0x906 (0x56acb6 in /usr/bin/python3.8)
frame #42: _PyFunction_Vectorcall + 0x1b6 (0x5f5956 in /usr/bin/python3.8)
frame #43: _PyEval_EvalFrameDefault + 0x72f (0x56aadf in /usr/bin/python3.8)
frame #44: _PyFunction_Vectorcall + 0x1b6 (0x5f5956 in /usr/bin/python3.8)
frame #45: _PyEval_EvalFrameDefault + 0x72f (0x56aadf in /usr/bin/python3.8)
frame #46: _PyFunction_Vectorcall + 0x1b6 (0x5f5956 in /usr/bin/python3.8)
frame #47: _PyEval_EvalFrameDefault + 0x72f (0x56aadf in /usr/bin/python3.8)
frame #48: _PyEval_EvalCodeWithName + 0x26a (0x568d9a in /usr/bin/python3.8)
frame #49: PyEval_EvalCode + 0x27 (0x68cdc7 in /usr/bin/python3.8)
frame #50: /usr/bin/python3.8() [0x67e161]
frame #51: /usr/bin/python3.8() [0x67e1df]
frame #52: /usr/bin/python3.8() [0x67e281]
frame #53: PyRun_SimpleFileExFlags + 0x197 (0x67e627 in /usr/bin/python3.8)
frame #54: Py_RunMain + 0x212 (0x6b6e62 in /usr/bin/python3.8)
frame #55: Py_BytesMain + 0x2d (0x6b71ed in /usr/bin/python3.8)
frame #56: __libc_start_main + 0xf3 (0x7f7324d330b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #57: _start + 0x2e (0x5f96de in /usr/bin/python3.8)

Aborted (core dumped)

I am using Ubuntu 20.04 with AMD Ryzen 7 4800U with Radeon graphics.

Anyone have any idea how I could fix this?

ervteng commented 3 years ago

This seems like a PyTorch issue (and not ML-Agents). I'd try asking on the https://discuss.pytorch.org/ and posting your error.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.