Open amitfishy opened 2 years ago
Mmmh, full device support might be a bit wonky, because for reasons I won't get into, I've always run my experiments on CPU-only clusters. I wonder where the torch.cuda.FloatTensor is coming from, maybe there's some environment variable that defaults the creation of some tensors to GPU? I'll have to look into this.
I think some of the parameters of layers for the cnn case gets stored in gpu for some reason. Looking at this part
File "/home/fishy/python_ws/aais-baisero3/code/asym-rlpo/asym_rlpo/representations/gv.py", line 256, in forward
cnn_output = self.cnn(cnn_input)
and adding the following to that function
print('Device0: ', next(self.embedding.parameters()).device)
gives the result cuda at some point which then causes the issue while trying to process it with cpu based tensors.
Maybe what is happening is that you're running the code in an environment which is set such that only GPU devices are available..? sadly, I don't have much experience with the mesh between CPU and GPU devices..
Could you try to set the environment variable export CUDA_VISIBLE_DEVICES = ""
before running the script? Just to see if there is a way to force everything to go through CPU?
Yes, that works! I suppose just submitting jobs with only cpu devices should also work right?
I was running everything on CPU-only machines, so yes, I would assume so.
Apparently if GPU devices are available, some of the models are automatically created assigned to the GPU, while the rest of the code is not worrying about devices.. If you can run on CPU for the time being, that would be great. I'll add a note that I'm not sure GPU saves all that much time compared to CPU, since the RNN computations are sequential in nature anyway. But that's just a hunch.
I can come back to try and fix this at some point, it shouldn't be too hard overall, but I'll need to investigate a few things, and won't have the time in the next 1-2 weeks for sure.
Yes, I also don't think that running on GPU will be better by much. This is good enough for me, thank you!
So, I tried running this on a cluster with cpu only, but strangely it gets stuck during the training part of the loop with no error message (it just hangs and doesn't move forward). This only happens with the cnn
options used for state instead of the fc
option.
On my local machine it runs alright, so not really sure where the problem comes from on running it on the cloud with cpu only.
Hi,
Again, just trying to reproduce things with the GV envs, I'm trying to run the following:
python main_a2c.py ../gym-gridverse/gym_gridverse/registered_envs/gv_memory_four_rooms.7x7.yaml a2c --gv-state-grid-model-type cnn
But I get some error because I have a gpu and cpu on my machine and the computations are not being done on the same device. I'm not sure exactly how to fix this.