ecmwf-lab / ai-models-graphcast

Apache License 2.0
64 stars 19 forks source link

RuntimeError: Unknown backend gpu #16

Closed JalSuth closed 2 weeks ago

JalSuth commented 6 months ago

Hey everyone,

I've ran into a few issues running Graphcast, mostly related to JAX or GPU usage.

When running the command: ai-models --input cds --download-assets --date 20240415 graphcast

I'm returned with the error:

RuntimeError: Unknown backend gpu For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

even when I set JAX_PLATFORMS=cpu I receive the same error.

Any guidance would be appreciated.

Regards

ncubukcu commented 5 months ago

if you want to run on GPU then the first thing you would do is to check your GPU driver and CUDA version. For nvidia for example I run nvidia-smi to check that. Depending on that information you should install the following list (assuming you have CUDA==12.2) pip install nvidia-cuda-cupti-cu12==12.2.131 pip install nvidia-cuda-nvcc-cu12==12.2.140 pip install nvidia-cuda-nvrtc-cu12==12.2.140 pip install nvidia-cuda-runtime-cu12==12.2.140 pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html This did it for me.. You will then need to make sure you have enough GPU memory. I found it extremely difficult to port this code on aws GPU images. Maybe you will have a better luck with your system..

If you run the code on CPU then these two worked for me.. pip install jax==0.4.23 pip install jaxlib==0.4.23

Hope this helps..

See below for the output of nvidia-smi nvidia-smi Wed May 15 00:19:37 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-16GB Off | 00000000:00:1B.0 Off | 0 | | N/A 48C P0 56W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2-16GB Off | 00000000:00:1C.0 Off | 0 | | N/A 48C P0 56W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2-16GB Off | 00000000:00:1D.0 Off | 0 | | N/A 45C P0 58W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2-16GB Off | 00000000:00:1E.0 Off | 0 | | N/A 47C P0 55W / 300W | 0MiB / 16384MiB | 2% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+