Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
17.12k stars 4.15k forks source link

Issues with devices using cuda #6158

Closed rullo16 closed 2 weeks ago

rullo16 commented 2 weeks ago

Hello, I have been trying to follow the hummingbird tutorial and to train it using the gpu, however, when I start training I get a message saying that there are multiple devices "cuda:0" and "cpu". I have tried using the different InferenceDevices from the configuration but still get the same error.

This is the error:

Exception in thread Thread-2 (trainer_update_func):
Traceback (most recent call last):
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\threading.py", line 953, in run
    self._target(self._args, **self._kwargs)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 297, in trainer_update_func
    trainer.advance()
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\mlagents\trainers\trainer\rl_trainer.py", line 293, in advance
    self._process_trajectory(t)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\mlagents\trainers\ppo\trainer.py", line 91, in _process_trajectory
    ) = self.optimizer.get_trajectory_value_estimates(
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\mlagents\trainers\optimizer\torch_optimizer.py", line 190, in get_trajectory_value_estimates
    value_estimates, next_memory = self.critic.critic_pass(
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\mlagents\trainers\torch_entities\networks.py", line 487, in critic_pass
    value_outputs, critic_mem_out = self.forward(
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\mlagents\trainers\torch_entities\networks.py", line 499, in forward
    encoding, memories = self.network_body(
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(args, kwargs)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, kwargs)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\mlagents\trainers\torch_entities\networks.py", line 244, in forward
    encoding = self._body_endoder(encoded_self)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(args, **kwargs)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(args, kwargs)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\mlagents\trainers\torch_entities\layers.py", line 169, in forward
    return self.seq_layers(input_tensor)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, kwargs)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(args, **kwargs)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\container.py", line 217, in forward
    input = module(input)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(args, kwargs)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, kwargs)
  File "C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\torch\nn\modules\linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

Environment (please complete the following information):

HeyThisWasRandomlyMade commented 2 weeks ago

i've had this issue, there are two fixes for this:

  1. simply disabling threading in the config file for your agent

  2. if you wanna keep threading enabled, you could also use the (deprecated, so it may not work in the future) PyTorch API: Add this line torch.set_default_tensor_type(torch.cuda.FloatTensor) to this following file C:\Users\rullo\anaconda3\envs\ml_agents\lib\site-packages\mlagents\torch_utils\torch.py (assuming it's where it's located)

anyways here's an example of the fix for threading:

def set_torch_config(torch_settings: TorchSettings) -> None:
    global _device

    if torch_settings.device is None:
        device_str = "cuda" if torch.cuda.is_available() else "cpu"
    else:
        device_str = torch_settings.device

    _device = torch.device(device_str)

    if _device.type == "cuda":
        torch.set_default_device(_device.type)
        torch.set_default_dtype(torch.float32)
        torch.set_default_tensor_type(torch.cuda.FloatTensor) # deprecated
    else:
        torch.set_default_dtype(torch.float32)
    logger.debug(f"default Torch device: {_device}")
rullo16 commented 2 weeks ago

Ok thank you for your reply, this solved my issue :)