Closed mianasbat closed 3 years ago
@tdowrick could you please test it on your computer when you get time.
I've been having an problem sorting out my docker installation, so will get round to testing on Monday.
I've had a look through everything, few notes below:
What OS have you tested on so far? I've been trying to run using Docker for Windows, but am having some issues, which may be Windows related, rather than anything wrong with the repo.
Can you modify the repo so that the example code for CPU/GPU is already included, either by direcrtly adding the files or by adding them as git submodules? It seems an unnecessary step to have to clone a separate repository and manually copy it over.
The documentation could be a bit clearer in places, but I don't mind giving it all a proof read once we've got a final verison in place.
@tdowrick Thank you for checking it out.
I tested on mac but it should be working on any OS because the docker is above OS (kind of). Ideally docker application should not have any compatibility issues so let me know or share the screenshots if you get the error again.
You are right, I also dont know that extra step but their are two reasons:
src
which will populate the place and the user will have to delete the existing stuff before adding the new one.
I will need to discuss it to see how should we do it.Yeah, I think there is a lot of scope for improvement. It will be nice to review it.
$ docker run -v "$PWD/input_data:/usr/program/input_data" -v "$PWD/output_data:/usr/program/output_data" my-project
A new file is created successfully in ../output_data/output_file.txt
The command runs successfully, but the output file isn't actually created on my local drive.
$ ls
CONTRIBUTING.md Dockerfile input_data/ 'input_data;C'/ LICENSE.md output_data/ 'output_data;C'/ project/ README.md src/
The output_data directory is empty, and there are also two new directories 'input_data;C' and 'output_data;C' that I don't think should be there?
$ docker run -p 5000:5000 --gpus all my-project2
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.
This might be related to my CUDA installation?
I guess I fixed that no output file creations on Issue #33 -- diff. Could you check if you are using the updated docs.
I think the cuda comes in the container so it should not be cuda problem. It looks like it is the NVIDIA driver issue. May be the docker is not having permission to access NVIDIA driver or there is no driver.
I will suggest running docker as administrator if it is not on windows.
I will try running it on one of the PT today or this weekend. I will update here.
@tdowrick
I checked the CPU example on PT5 and it worked fine. The instructions were not merged when you were testing I think. Give it a try and it should work fine.
The GPU program error that you got as a result of
docker run -p 5000:5000 --gpus all my-project2
Could be fixed by installing nvidia-container-toolkit
# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Reference: https://github.com/NVIDIA/nvidia-docker/issues/1186
Now I got another error 😄
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=30 : unknown error
Traceback (most recent call last):
File "app.py", line 26, in <module>
model = load_model(loadmodel='PSMNet/trained_models/pretrained_model_KITTI2015.tar')
File "/usr/program/src/PSMNet/Test_img.py", line 64, in load_model
model.cuda()
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 304, in cuda
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 201, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 201, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 201, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 223, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 304, in <lambda>
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 197, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:50
Checking the reason:
I figured out the CPU problem is related to this:
Running this command, with a /
added before $PWD solved the problem for me.
docker run -v /$PWD/input_data:/usr/program/input_data -v /$PWD/output_data:/usr/program/output_data my-project
If you can check this command still works on linux, we can update the docs.
Thanks @tdowrick I have to check it but yes you are right that perfectly make sense. Because Docker is platform dependant but in this case we are executing a command which is in windows cmd. I will test it on mac and linux if it works then we will update the documentation to make it work on all platforms.
May be if we run it from gitbash then it may run fine.
I was using git bash when I had this problem.
I checked the GPU example on PT2 and noticed 2 things.
Any way I am updating CUDA driver on PT5 and checking if it fix the issue.
Okay after update I am still getting the same error mentioned above. So the issue is not because of CUDA.
To further troubleshootI executed the stereo reconstruction application as it is on PT5 without docker and it worked... troubleshooting continue...
I spent quite some time exploring the error
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=30 : unknown error
I tested running gpu example on all pts. On white cube, pt2 and pt5 gpu example is working fine but on pt1, pt3, pt4 and pt6 I am getting the same error.
Some forum suggests that its a bug and it can be fixed by reboot. I rebooted pt6 this morning but I still got the same error. Troubleshooting continue...
Okay so finally the issue is resolved. The issue was due to pytorch and cuda compatibility. pytorch 1.4.0 is not compatible with cuda 11.1. I changed the cuda version to 10.2 with the same version of pytorch and the error disappeared. I will start updating the documentation now.
Great - let me know when it is ready and I can test on Windows again.
@tdowrick Please give both examples a test when you get time. I believe it should behave a bit better than before 😄
The GPU example still isn't working for me, with error:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.
But I think this might just be because Windows Docker doesn't fully support GPU virtualisation yet. They announced a preview build that has functionality, but it also requires installing a preview build of Windows Subsystem for Linux, so in practice I'm not sure anyone is going to be using this functionality at the moment.
https://www.docker.com/blog/wsl-2-gpu-support-is-here/ https://developer.nvidia.com/blog/announcing-cuda-on-windows-subsystem-for-linux-2/
@tdowrick Yes, I will check into it but I dont know if there is nvidia-container-toolkit for windows available or not. As discussed on issue #37 we need to install nvidia-container-toolkit on host machine to share nvidia properly. I will explore it and update
@tdowrick
I tried both the examples on windows 10 with gitbash installed.
The example 1 worked fine for me but in example 2 I got the same error you mentioned above.
It might be for 2 reasons:
1: The old GPUs are not able to work with nvidia containers concept.
I also dont have a GPU at home but I am checking to see if the nvidia-container-runtime is available for windows or mac.
Checking further.
Raised an issue with nvidia-container-runtime https://github.com/NVIDIA/nvidia-container-runtime/issues/133
Okay so as I expected the teams confirm that there is no nvidia-container-runtime for windows and mac.
So until we find another way its not possible to share the Windows GPU with docker.
We can close the issue for now. If you guys agree.
Yes, happy to close now, as long as the docs can be updated to make it clear that the GPU example won't run on Windows.
Run all the steps on multiple computers to see if everything is working. Ideally both cpu and gpu should work fine. Check the behaviour under different Nvidia driver or Cuda version.