NVIDIA / earth2mip

Earth-2 Model Intercomparison Project (MIP) is a python framework that enables climate researchers and scientists to inter-compare AI models for weather and climate.
https://nvidia.github.io/earth2mip/
Apache License 2.0
183 stars 40 forks source link

🐛[BUG]: Failed to allocate memory for requested buffer of size 1851310080 #193

Open melodicdeath opened 1 month ago

melodicdeath commented 1 month ago

Version

source - main

On which installation method(s) does this occur?

Pip

Describe the issue

I run the example 02_model_comparison:

print("Running Pangu inference") pangu_ds = inference_ensemble.run_basic_inference( pangu_inference_model, n=24, # Note we run 24 steps here because Pangu is at 6 hour dt (6 day forecast) data_source=pangu_data_source, time=time, ) pangu_ds.to_netcdf(f"{output_dir}/pangu_inference_out.nc") print(pangu_ds)


RuntimeError Traceback (most recent call last)

in () 1 print("Running Pangu inference") ----> 2 pangu_ds = inference_ensemble.run_basic_inference( 3 pangu_inference_model, 4 n=24, # Note we run 24 steps here because Pangu is at 6 hour dt (6 day forecast) 5 data_source=pangu_data_source, 5 frames /usr/local/lib/python3.10/dist-packages/earth2mip/inference_ensemble.py in run_basic_inference(model, n, data_source, time) 284 arrays = [] 285 times = [] --> 286 for k, (time, data, _) in enumerate(model(time, x)): 287 arrays.append(data.cpu().numpy()) 288 times.append(time) /usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in __call__(self, time, x, normalize, restart) 247 dt = torch.tensor(self.time_step.total_seconds()) 248 x1 += self.source(x1, time1) * dt --> 249 x1 = self.model_6(x1) 250 yield time1, x1, restart_data 251 /usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in __call__(self, x) 142 143 def __call__(self, x): --> 144 return self.forward(x) 145 146 def to(self): /usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in forward(self, x) 156 pl = pl.resize(*pl_shape) 157 sl = surface[0] --> 158 plo, slo = self.model(pl, sl) 159 return torch.cat( 160 [ /usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in __call__(self, fields_pl, fields_sfc) 122 output = bind_output("output", like=fields_pl) 123 output_sfc = bind_output("output_surface", like=fields_sfc) --> 124 self.ort_session.run_with_iobinding(binding) 125 return output, output_sfc 126 /usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in run_with_iobinding(self, iobinding, run_options) 329 :param run_options: See :class:`onnxruntime.RunOptions`. 330 """ --> 331 self._sess.run_with_iobinding(iobinding._iobinding, run_options) 332 333 def get_tuning_results(self): RuntimeError: Error in execution: Non-zero status code returned while running BiasSoftmax node. Name:'BiasSoftmax' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080 I don't know what went wrong? But I used the same environment to try directly loading pangu_weather_6.onnx and inference,the results are normal. ### Environment details ```shell Kaggle,GPU T4 * 2 !pip install ort-nightly-gpu --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-12-nightly/pypi/simple/ $ nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 49C P0 26W / 70W | 13623MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 | | N/A 38C P8 9W / 70W | 3MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+ $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0 ```
melodicdeath commented 1 month ago

Sorry,it's not a bug.

  1. Install optional dependencies for Pangu weather: $ pip install .[pangu]
  2. changed n from 24 to 12
  3. only load pangu_weather_6.onnx pangu.load_6(package)

Then that's it.