Issue with GPU memory allocation for pangu and graphcast

Lucas-Hardy commented 4 months ago

Hi

I'm trying to set up GraphCast and Pangu to run on a 3060 12GB GPU and am getting memory allocation errors for both models.

Pangu:

2024-07-05 14:59:18,484 INFO Writing results to pangu_outputs/20240626_1200_6h_pangu.grib
2024-07-05 14:59:18,485 INFO Loading pressure fields from CDS
2024-07-05 14:59:18,814 INFO Loading surface fields from CDS
2024-07-05 14:59:18,840 INFO Using device 'GPU'. The speed of inference depends greatly on the device.
2024-07-05 14:59:18,840 INFO ONNXRuntime providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
2024-07-05 14:59:24,177 INFO Loading pangu_assets/pangu_weather_24.onnx: 5 seconds.
2024-07-05 14:59:29,314 INFO Loading pangu_assets/pangu_weather_6.onnx: 5 seconds.
2024-07-05 14:59:29,822 INFO Writing step 0: 0.5 second.
2024-07-05 14:59:29,822 INFO Model initialisation: 11 seconds
2024-07-05 14:59:29,822 INFO Starting inference for 1 steps (6h).
2024-07-05 14:59:30.391755898 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running FusedMatMul node. Name:'/b1/MatMul/MatmulTransposeFusion//MatMulScaleFusion/' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080

2024-07-05 14:59:30,391 INFO Elapsed: 0.6 second.
2024-07-05 14:59:30,391 INFO Average: 0.6 second per step.
2024-07-05 14:59:30,391 INFO Total time: 11 seconds.
Traceback (most recent call last):
  File "/home/ock/anaconda3/envs/ai-models-pangu/bin/ai-models", line 8, in <module>
    sys.exit(main())
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models/__main__.py", line 358, in main
    _main(sys.argv[1:])
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models/__main__.py", line 306, in _main
    run(vars(args), unknownargs)
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models/__main__.py", line 331, in run
    model.run()
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models_panguweather/model.py", line 107, in run
    output, output_surface = ort_session_6.run(
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running FusedMatMul node. Name:'/b1/MatMul/MatmulTransposeFusion//MatMulScaleFusion/' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080

GraphCast:

2024-07-05 14:42:35.208814: I external/xla/xla/stream_executor/cuda/cuda_driver.cc:1558] failed to allocate 2.97GiB (3189473280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-05 14:42:35,226 INFO Doing full rollout prediction in JAX: 57 seconds.
2024-07-05 14:42:35,226 INFO Total time: 1 minute.
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ock/anaconda3/envs/ai-models-graphcast/bin/ai-models", line 8, in <module>
    sys.exit(main())
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models/__main__.py", line 358, in main
    _main(sys.argv[1:])
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models/__main__.py", line 306, in _main
    run(vars(args), unknownargs)
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models/__main__.py", line 331, in run
    model.run()
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models_graphcast/model.py", line 240, in run
    output = self.model(
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models_graphcast/model.py", line 114, in <lambda>
    return lambda **kw: fn(**kw)[0]
MemoryError: std::bad_alloc

I am using Cuda 12.4 in Pangu and 12.3 with GraphCast, I have tried using Cuda 11 and it does not recognise my GPU. I am using cudnn=8.9.7.29. I have also tried setting XLA_PYTHON_CLIENT_PREALLOCATE=false, setting XLA_PYTHON_CLIENT_MEM_FRACTION to smaller values and XLA_PYTHON_CLIENT_ALLOCATOR=platform. The model also runs fine on the CPU, just very slow. Is there a fix to this or it just simply that my GPU has not got enough VRAM?

Thanks.

decadeneo commented 4 months ago

I met the same problem in pangu on my device 4060 32G GPU today,but yestaday the model work. so weird

decadeneo commented 4 months ago

I tried creating a new Python environment and after pip install ai-models , l installed onnxruntime via conda, suspecting that the issue might be related to the version of numpy (the version running smoothly for me is 2.0.0).

YUTAIPAN commented 1 week ago

I also met the problem of insufficient GPU memory, I'm wonder if there are any methods to decrease the batch size when doing the prediction

ecmwf-lab / ai-models

Issue with GPU memory allocation for pangu and graphcast #48