UCL / scikit-surgerydocker

This repo describes with a simple example how to use docker to containerise your project/algorithm
https://scikit-surgerydocker.readthedocs.io/en/latest/
Other
1 stars 2 forks source link

Test the steps #30

Closed mianasbat closed 3 years ago

mianasbat commented 3 years ago

Run all the steps on multiple computers to see if everything is working. Ideally both cpu and gpu should work fine. Check the behaviour under different Nvidia driver or Cuda version.

mianasbat commented 3 years ago

@tdowrick could you please test it on your computer when you get time.

tdowrick commented 3 years ago

I've been having an problem sorting out my docker installation, so will get round to testing on Monday.

tdowrick commented 3 years ago

I've had a look through everything, few notes below:

  1. What OS have you tested on so far? I've been trying to run using Docker for Windows, but am having some issues, which may be Windows related, rather than anything wrong with the repo.

  2. Can you modify the repo so that the example code for CPU/GPU is already included, either by direcrtly adding the files or by adding them as git submodules? It seems an unnecessary step to have to clone a separate repository and manually copy it over.

  3. The documentation could be a bit clearer in places, but I don't mind giving it all a proof read once we've got a final verison in place.

mianasbat commented 3 years ago

@tdowrick Thank you for checking it out.

  1. I tested on mac but it should be working on any OS because the docker is above OS (kind of). Ideally docker application should not have any compatibility issues so let me know or share the screenshots if you get the error again.

  2. You are right, I also dont know that extra step but their are two reasons:

    1. Always a user will add his/her code to use it and he/she will be more interested to find out how to transfer my code into appropriate locations.
    2. If I transfer both or one example then I have to put it in src which will populate the place and the user will have to delete the existing stuff before adding the new one. I will need to discuss it to see how should we do it.
  3. Yeah, I think there is a lot of scope for improvement. It will be nice to review it.

tdowrick commented 3 years ago

CPU example:

$ docker run -v "$PWD/input_data:/usr/program/input_data" -v "$PWD/output_data:/usr/program/output_data" my-project
A new file is created successfully in ../output_data/output_file.txt

The command runs successfully, but the output file isn't actually created on my local drive.

$ ls
 CONTRIBUTING.md   Dockerfile   input_data/  'input_data;C'/   LICENSE.md   output_data/  'output_data;C'/   project/   README.md   src/

The output_data directory is empty, and there are also two new directories 'input_data;C' and 'output_data;C' that I don't think should be there?

GPU Example

$ docker run -p 5000:5000 --gpus all my-project2
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.

This might be related to my CUDA installation?

mianasbat commented 3 years ago

I guess I fixed that no output file creations on Issue #33 -- diff. Could you check if you are using the updated docs.

I think the cuda comes in the container so it should not be cuda problem. It looks like it is the NVIDIA driver issue. May be the docker is not having permission to access NVIDIA driver or there is no driver.

I will suggest running docker as administrator if it is not on windows.

I will try running it on one of the PT today or this weekend. I will update here.

mianasbat commented 3 years ago

@tdowrick
I checked the CPU example on PT5 and it worked fine. The instructions were not merged when you were testing I think. Give it a try and it should work fine.

mianasbat commented 3 years ago

The GPU program error that you got as a result of

docker run -p 5000:5000 --gpus all my-project2

Could be fixed by installing nvidia-container-toolkit

# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Reference: https://github.com/NVIDIA/nvidia-docker/issues/1186

mianasbat commented 3 years ago

Now I got another error 😄

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=30 : unknown error
Traceback (most recent call last):
  File "app.py", line 26, in <module>
    model = load_model(loadmodel='PSMNet/trained_models/pretrained_model_KITTI2015.tar')
  File "/usr/program/src/PSMNet/Test_img.py", line 64, in load_model
    model.cuda()
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 304, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 201, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 201, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 201, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 223, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 304, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 197, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:50

Checking the reason:

tdowrick commented 3 years ago

I figured out the CPU problem is related to this:

https://stackoverflow.com/questions/50608301/docker-mounted-volume-adds-c-to-end-of-windows-path-when-translating-from-linux

Running this command, with a / added before $PWD solved the problem for me.

docker run -v /$PWD/input_data:/usr/program/input_data -v /$PWD/output_data:/usr/program/output_data my-project

If you can check this command still works on linux, we can update the docs.

mianasbat commented 3 years ago

Thanks @tdowrick I have to check it but yes you are right that perfectly make sense. Because Docker is platform dependant but in this case we are executing a command which is in windows cmd. I will test it on mac and linux if it works then we will update the documentation to make it work on all platforms.

May be if we run it from gitbash then it may run fine.

tdowrick commented 3 years ago

I was using git bash when I had this problem.

mianasbat commented 3 years ago

I checked the GPU example on PT2 and noticed 2 things.

  1. is that nvidia-container-toolkit is required to be installed on the machine running docker.
  2. With latest cuda 11 GPU example worked fine on PT2. So I think this is the only difference between PT5 and PT2 i.e. CUDA version. PT5 is having CUDA version 9. Since it is too old it may not support docker (not sure).

Any way I am updating CUDA driver on PT5 and checking if it fix the issue.

mianasbat commented 3 years ago

Okay after update I am still getting the same error mentioned above. So the issue is not because of CUDA.

To further troubleshootI executed the stereo reconstruction application as it is on PT5 without docker and it worked... troubleshooting continue...

mianasbat commented 3 years ago

I spent quite some time exploring the error

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=30 : unknown error

I tested running gpu example on all pts. On white cube, pt2 and pt5 gpu example is working fine but on pt1, pt3, pt4 and pt6 I am getting the same error.

Some forum suggests that its a bug and it can be fixed by reboot. I rebooted pt6 this morning but I still got the same error. Troubleshooting continue...

mianasbat commented 3 years ago

Okay so finally the issue is resolved. The issue was due to pytorch and cuda compatibility. pytorch 1.4.0 is not compatible with cuda 11.1. I changed the cuda version to 10.2 with the same version of pytorch and the error disappeared. I will start updating the documentation now.

tdowrick commented 3 years ago

Great - let me know when it is ready and I can test on Windows again.

mianasbat commented 3 years ago

@tdowrick Please give both examples a test when you get time. I believe it should behave a bit better than before 😄

tdowrick commented 3 years ago

The GPU example still isn't working for me, with error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.

But I think this might just be because Windows Docker doesn't fully support GPU virtualisation yet. They announced a preview build that has functionality, but it also requires installing a preview build of Windows Subsystem for Linux, so in practice I'm not sure anyone is going to be using this functionality at the moment.

https://www.docker.com/blog/wsl-2-gpu-support-is-here/ https://developer.nvidia.com/blog/announcing-cuda-on-windows-subsystem-for-linux-2/

mianasbat commented 3 years ago

@tdowrick Yes, I will check into it but I dont know if there is nvidia-container-toolkit for windows available or not. As discussed on issue #37 we need to install nvidia-container-toolkit on host machine to share nvidia properly. I will explore it and update

mianasbat commented 3 years ago

@tdowrick
I tried both the examples on windows 10 with gitbash installed.
The example 1 worked fine for me but in example 2 I got the same error you mentioned above.
It might be for 2 reasons:

1: The old GPUs are not able to work with nvidia containers concept.

  1. It may be because nvidia-container-runtime is missing on windows.
    I tried to find nvidia-container-runtime for windows but I think this package is only available for Linux.

I also dont have a GPU at home but I am checking to see if the nvidia-container-runtime is available for windows or mac.

Checking further.

mianasbat commented 3 years ago

Raised an issue with nvidia-container-runtime https://github.com/NVIDIA/nvidia-container-runtime/issues/133

mianasbat commented 3 years ago

Okay so as I expected the teams confirm that there is no nvidia-container-runtime for windows and mac.

So until we find another way its not possible to share the Windows GPU with docker.

We can close the issue for now. If you guys agree.

tdowrick commented 3 years ago

Yes, happy to close now, as long as the docs can be updated to make it clear that the GPU example won't run on Windows.