Processor/algorithm is incompatible with newer Turing cards

silvandeleemput commented 4 years ago

The current Dockerfile is incompatible with the modern Turing cards since the compute capabilities don't match: https://docs.nvidia.com/deploy/cuda-compatibility/index.html

The drivers in the container are for CUDA 9.1.85 (drivers 390.46), but for Turing architectures at least drivers 410.48+ are required.

Since the Dockerfile relies on doduo1.umcn.nl/uokbaseimage/tensorflow_pytorch_diag_dicomloader_python2:2 This can become quite involved.

[ ] Upgrade codebase to Python 3 (could be a new issue)
[ ] Update the Dockerfile to use the newest CUDA drivers
[ ] Install diag_dicomloader (use a new updated base image or copy diag_dicomloader build artifacts only)

cjacobs1 commented 4 years ago

I think this is not an issue. The current container runs succesfully on GC, which only has turing cards at the moment, right?

I think you confused NVidia drivers with CUDA toolkits a bit. Older CUDA toolkit versions (like we are currently running in the processor container) still work on newer cards as long as the NVidia drivers have a version higher than the version listed in Table 1 here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html.

We are running the tensorflow_pytorch_diag_dicomloader_python2 base image here, which is derived from nvidia/cuda:9.1-cudnn7-runtime-ubuntu16.04, so CUDA9.1, so as long as James is using a NVidia driver version more recent than 390.46, we are fine. He is definitely using a very new version, because otherwise, there is no support for newer versions of CUDA and not for Turing cards, so we are fine, I think.

See section 1.4 here: https://docs.nvidia.com/cuda/turing-compatibility-guide/index.html

If you agree, please close this issue.

Upgrading to Python3 and using the diagdicomloader are still relevant, but can be in separate issues.

jmsmkn commented 4 years ago

Yes, grand challenge only uses Turing cards right now.

cjacobs1 commented 4 years ago

Thanks @jmsmkn !

cjacobs1 commented 4 years ago

@silvandeleemput , please comment, see above (forgot to mention you previously)

silvandeleemput commented 4 years ago

@cjacobs1 I see that my previous analysis in the post above is indeed incorrect since docker images do not include the drivers. In principle, things should be running fine then (as we have observed), however, I inspected and tested the latest image on various machines and found that it apparently runs fine on cards like the gtx1080ti, but raises some errors on the rtx2080ti cards:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=844 error=11 : invalid argument

These errors appear to be non-critical since the correct output is still generated.

I expect that these issues might be related to the used PyTorch version and that reinstalling a newer version of PyTorch in the Dockerfile might be all that's needed, but it doesn't seem really necessary at the moment.

silvandeleemput commented 4 years ago

Splitted issues in #44 #45, and closing this for now.

DIAGNijmegen / bodyct-dsb2017-grt123

Processor/algorithm is incompatible with newer Turing cards #38