CHTC / templates-GPUs

Template job submissions using GPUs in CHTC
MIT License
39 stars 11 forks source link

NVIDIA GPU Cloud container example #9

Closed agitter closed 4 years ago

agitter commented 4 years ago

This example serves as a proof of concept that we can run jobs that use containers from NVIDIA GPU Cloud (#4). The example uses PyTorch because we already have example scripts and input data for PyTorch. However, this strongly suggests that we could also run the NGC versions of GROMACS and other containers as well.

One difficulty when testing this is that I initially tried the newest version of the container, 20.06-py3. However, it is built upon CUDA 11.0. Adding a requirement in my submit file for CUDA >= 11.0 did not match any GPU servers. Is it possible we could actually run this version on our GPU servers? The NVIDIA requirements state:

Release 20.06 is based on CUDA <<11.0.167>>, which requires NVIDIA driver release <<450.36>>.However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30.The CUDA driver's compatibility package only supports particular drivers.

If our servers have NVIDIA driver >= 450.36, they should be able to run the container even without CUDA 11.0 installed.

We can also consider documenting the availability of these NGC containers at the CHTC website. Or we can curate relevant containers in #4 so that facilitators can direct researchers to NGC if it has a container that may help them.

As a bookkeeping note, this branch is based off agitter:conda because I wanted to use the shared PyTorch examples. After we merge #8 it will be easier to review this. All the changed files are in docker/pytorch_ngc.

agitter commented 4 years ago

8 was merged, so I rebased and force pushed to clean up the commit history here.