Xilinx / logicnets

Apache License 2.0
81 stars 26 forks source link

PyTorch version doesn't support CUDA within the docker container #15

Closed Ali-Homsi closed 3 years ago

Ali-Homsi commented 3 years ago

Hello, I wanted to bring this issue to your attention. it looks like the PyTorch that is installed within the docker container doesn't support CUDA. so when trying to run the script train.py with the flag --cuda, the following error is being raised: AssertionError: Torch not compiled with CUDA enabled and the function torch.cuda.is_available() is always returning False when running it within the container. however, it does return True when running it outside of the container (I have PyTorch Version: 1.9.0 and CUDA 11.1 installed on my machine).

In step 22 after initializing the docker container, I noticed that you are installing the CPU-only version of PyTorch. is there a specific reason for that? Step 22/33 : RUN conda install -y pytorch==1.4.0 torchvision==0.5.0 cpuonly -c pytorch && conda clean -ya

I tried adding CudaToolkit to PyTorch by running: conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch but even after that, torch.cuda.is_available() still returned false.

Upgrading to a newer version of PyTorch (1.5.0 for example) with cudatoolkit=10.1 by running: conda install pytorch==1.5.0 torchvision==0.6.0 cudatoolkit=10.2 -c pytorch didn't solve the issue, torch.cuda.is_available() still returned false, and doing that seems to lead to a runtime error: https://pastebin.com/Gx0BvV7U

all these tries were in the hope of getting the model to train on the GPU. is there something that can be done about that?

nickfraser commented 3 years ago

In step 22 after initializing the docker container, I noticed that you are installing the CPU-only version of PyTorch. is there a specific reason for that? Step 22/33 : RUN conda install -y pytorch==1.4.0 torchvision==0.5.0 cpuonly -c pytorch && conda clean -ya

Thanks for creating this issue. This is a known issue, and the lack of GPU support is intentional. We occasionally experience large accuracy differences (~5-10%) in training between the CPU and GPU (it even varies between GPU version). This occurs even when we conform to the PyTorch suggested settings for reproducibility.

As a result, we've had to disable GPU support in the current release, to avoid users being unable to reproduce our results. We continue to work on this issue.

You're welcome to add GPU support yourself and send a pull request.

I tried adding CudaToolkit to PyTorch by running: conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch but even after that, torch.cuda.is_available() still returned false.

Upgrading to a newer version of PyTorch (1.5.0 for example) with cudatoolkit=10.1 by running: conda install pytorch==1.5.0 torchvision==0.6.0 cudatoolkit=10.2 -c pytorch didn't solve the issue, torch.cuda.is_available() still returned false, and doing that seems to lead to a runtime error: >https://pastebin.com/Gx0BvV7U

This sounds like an issue with your docker installation (your specific docker version), host environment (Nvidia driver version) or the packages you're adding within the docker environment (is the CUDA toolkit version you're installing compatible with your host driver?). Debugging this on your behalf is beyond the scope of this release. I suggest you troubleshoot here.

Ali-Homsi commented 3 years ago

Thank you for the information