Device related issue when using GV with --gv-state-grid-model-type=cnn

amitfishy commented 2 years ago

Hi,

Again, just trying to reproduce things with the GV envs, I'm trying to run the following: python main_a2c.py ../gym-gridverse/gym_gridverse/registered_envs/gv_memory_four_rooms.7x7.yaml a2c --gv-state-grid-model-type cnn But I get some error because I have a gpu and cpu on my machine and the computations are not being done on the same device. I'm not sure exactly how to fix this.

Loading using gym.make
Environment with id ../gym-gridverse/gym_gridverse/registered_envs/gv_memory_four_rooms.7x7.yaml not found. Trying as a GV YAML environment.
Loading using YAML
/gv_memory_four_rooms.7x7.yaml a2c

Traceback (most recent call last):
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/main_a2c.py", line 684, in <module>
    raise SystemExit(main())
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/main_a2c.py", line 648, in main
    done = run(runstate)
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/main_a2c.py", line 471, in run
    episodes = sample_episodes(
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/sampling.py", line 61, in sample_episodes
    return [
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/sampling.py", line 62, in <listcomp>
    sample_episode(env, policy, render=render) for _ in range(num_episodes)
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/sampling.py", line 23, in sample_episode
    policy.reset(numpy2torch(observation))
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/policies.py", line 46, in reset
    self.history_integrator.reset(observation)
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/features.py", line 164, in reset
    input_features = self.compute_input_features(
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/features.py", line 126, in compute_input_features
    return compute_input_features(
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/features.py", line 21, in compute_input_features
    observation_features = observation_model(gtorch.to(observation, device))
  File "/home/fishy/python_ws/aais-baisero3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/representations/gv.py", line 97, in forward
    return self.fc_model(self.cat_representation(inputs))
  File "/home/fishy/python_ws/aais-baisero3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/representations/cat.py", line 21, in forward
    [representation(inputs) for representation in self.representations],
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/representations/cat.py", line 21, in <listcomp>
    [representation(inputs) for representation in self.representations],
  File "/home/fishy/python_ws/aais-baisero3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/representations/gv.py", line 256, in forward
    cnn_output = self.cnn(cnn_input)
  File "/home/fishy/python_ws/aais-baisero3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fishy/python_ws/aais-baisero3/lib/python3.10/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/fishy/python_ws/aais-baisero3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fishy/python_ws/aais-baisero3/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/fishy/python_ws/aais-baisero3/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

abaisero commented 2 years ago

Mmmh, full device support might be a bit wonky, because for reasons I won't get into, I've always run my experiments on CPU-only clusters. I wonder where the torch.cuda.FloatTensor is coming from, maybe there's some environment variable that defaults the creation of some tensors to GPU? I'll have to look into this.

amitfishy commented 2 years ago

I think some of the parameters of layers for the cnn case gets stored in gpu for some reason. Looking at this part

File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/representations/gv.py", line 256, in forward
    cnn_output = self.cnn(cnn_input)

and adding the following to that function

print('Device0: ', next(self.embedding.parameters()).device)

gives the result cuda at some point which then causes the issue while trying to process it with cpu based tensors.

abaisero commented 2 years ago

Maybe what is happening is that you're running the code in an environment which is set such that only GPU devices are available..? sadly, I don't have much experience with the mesh between CPU and GPU devices..

abaisero commented 2 years ago

Could you try to set the environment variable export CUDA_VISIBLE_DEVICES = "" before running the script? Just to see if there is a way to force everything to go through CPU?

amitfishy commented 2 years ago

Yes, that works! I suppose just submitting jobs with only cpu devices should also work right?

abaisero commented 2 years ago

I was running everything on CPU-only machines, so yes, I would assume so.

Apparently if GPU devices are available, some of the models are automatically created assigned to the GPU, while the rest of the code is not worrying about devices.. If you can run on CPU for the time being, that would be great. I'll add a note that I'm not sure GPU saves all that much time compared to CPU, since the RNN computations are sequential in nature anyway. But that's just a hunch.

I can come back to try and fix this at some point, it shouldn't be too hard overall, but I'll need to investigate a few things, and won't have the time in the next 1-2 weeks for sure.

amitfishy commented 2 years ago

Yes, I also don't think that running on GPU will be better by much. This is good enough for me, thank you!

amitfishy commented 2 years ago

So, I tried running this on a cluster with cpu only, but strangely it gets stuck during the training part of the loop with no error message (it just hangs and doesn't move forward). This only happens with the cnn options used for state instead of the fc option.

On my local machine it runs alright, so not really sure where the problem comes from on running it on the cloud with cpu only.

abaisero / asym-rlpo

Device related issue when using GV with --gv-state-grid-model-type=cnn #5