Closed silvandeleemput closed 4 years ago
I think this is not an issue. The current container runs succesfully on GC, which only has turing cards at the moment, right?
I think you confused NVidia drivers with CUDA toolkits a bit. Older CUDA toolkit versions (like we are currently running in the processor container) still work on newer cards as long as the NVidia drivers have a version higher than the version listed in Table 1 here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html.
We are running the tensorflow_pytorch_diag_dicomloader_python2 base image here, which is derived from nvidia/cuda:9.1-cudnn7-runtime-ubuntu16.04, so CUDA9.1, so as long as James is using a NVidia driver version more recent than 390.46, we are fine. He is definitely using a very new version, because otherwise, there is no support for newer versions of CUDA and not for Turing cards, so we are fine, I think.
See section 1.4 here: https://docs.nvidia.com/cuda/turing-compatibility-guide/index.html
If you agree, please close this issue.
Upgrading to Python3 and using the diagdicomloader are still relevant, but can be in separate issues.
Yes, grand challenge only uses Turing cards right now.
Thanks @jmsmkn !
@silvandeleemput , please comment, see above (forgot to mention you previously)
@cjacobs1 I see that my previous analysis in the post above is indeed incorrect since docker images do not include the drivers. In principle, things should be running fine then (as we have observed), however, I inspected and tested the latest image on various machines and found that it apparently runs fine on cards like the gtx1080ti, but raises some errors on the rtx2080ti cards:
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=844 error=11 : invalid argument
These errors appear to be non-critical since the correct output is still generated.
I expect that these issues might be related to the used PyTorch version and that reinstalling a newer version of PyTorch in the Dockerfile might be all that's needed, but it doesn't seem really necessary at the moment.
Splitted issues in #44 #45, and closing this for now.
The current Dockerfile is incompatible with the modern Turing cards since the compute capabilities don't match: https://docs.nvidia.com/deploy/cuda-compatibility/index.html
The drivers in the container are for CUDA 9.1.85 (drivers 390.46), but for Turing architectures at least drivers 410.48+ are required.
Since the Dockerfile relies on doduo1.umcn.nl/uokbaseimage/tensorflow_pytorch_diag_dicomloader_python2:2 This can become quite involved.