Closed paullintilhac closed 2 months ago
Quick follow-up on this. When I make the edit to the file referenced above, /usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py
, by changing line 333 from if next_state.device.type != 'cuda'
to
if next_state.device != 'cuda:0'
, I can get it to run on google colab.Is anyone else experiencing this error with ffcv when trying to run the training code?
created a corresponding issue in the ffcv repository here: https://github.com/libffcv/ffcv/issues/380
You need to replace "cuda:0"
with ch.device("cuda:0")
here https://github.com/MadryLab/datamodels/blob/61e590a6d857b31b6b11be10800f7c9bba6b400e/examples/cifar10/train_cifar.py#L58 and here https://github.com/MadryLab/datamodels/blob/61e590a6d857b31b6b11be10800f7c9bba6b400e/examples/cifar10/train_cifar.py#L68.
Thank you! that works. Do you know what was the underlying cause of the issue?
I believe it was an update in ffcv.
I have tried to run the example in this repo both on my university's own slurm cluster and on google colab, and I keep ending up with the same error:
AttributeError: 'str' object has no attribute 'type'
I was able to edit one of the python package source files directly in order to get rid of this error, but then it predictably gave me another error
RuntimeError: No HIP GPUs are available.
That one I'm not sure how to solve. So it raises the question of why this error is happening in the first place.Steps to reproduce (the file system shown here is what I used for colab, but you can replace the paths with whatever download directory you use on whatever system you have):
I have tried many different conda environments, including with python 3.8 (as the repo suggests), and 3.9, cuda 12.1 and 12.2, and rocm 6 and 5.4. All of them give me one of the two above errors.
Any idea how I can get around this? Full stack trace: