ecmwf-lab / ai-models-graphcast

Apache License 2.0
57 stars 19 forks source link

Graphcast run fails (for me) at dict_merge #1

Closed Dadoof closed 10 months ago

Dadoof commented 12 months ago

Good day all,

Giving graphcast a go, nearly there, but ended up with this: ai-models --date 20230906 --time 0000 --input cds --assets grc graphcast

File "/usr/local/lib/python3.10/site-packages/cfgrib/dataset.py", line 753, in open_fieldset return open_from_index(filtered_index, read_keys, time_dims, extra_coords, **kwargs) File "/usr/local/lib/python3.10/site-packages/cfgrib/dataset.py", line 729, in open_from_index dimensions, variables, attributes, encoding = build_dataset_components( File "/usr/local/lib/python3.10/site-packages/cfgrib/dataset.py", line 683, in build_dataset_components dict_merge(variables, coord_vars) File "/usr/local/lib/python3.10/site-packages/cfgrib/dataset.py", line 614, in dict_merge raise DatasetBuildError( cfgrib.dataset.DatasetBuildError: key present and new value is different: key='time' value=Variable(dimensions=('time',), data=array([1693936800,1693958400])) new_value=Variable(dimensions=('time',), data=array([1693893600, 1693936800]))

Full log file (with debug set to on) below: gcd.txt

A log file with the output of the build process (showing which packages got installed) below: gcb.txt

idharssi2020 commented 11 months ago

I get the same error with command ai-models --input cds --date 20230101 --time 0000 graphcast

reporting error cfgrib.dataset.DatasetBuildError: key present and new value is different: key='time' value=Variable(dimensions=('time',), data=array([1672509600, 1672531200])) new_value=Variable(dimensions=('time',), data=array([1672466400, 1672509600]))

jacob-radford commented 11 months ago

Confirming I get the same error with @idharssi2020's command. Any ideas here?

mchantry commented 11 months ago

Thanks for the feedback, we are investigating this issue and will provide an update soon.

Dadoof commented 11 months ago

I did a bit of digging. One thing I did find out is that in the process to download the initial conditions, the precipitation is a little different than the other variables.

The precipitation is the 5-6-hour forecast from 6 hours previous. All others are analyses.

1:0:d=23092600:LSM:kpds5=172:kpds6=1:kpds7=0:TR=0:P1=0:P2=0:TimeU=1:sfc:anl:type=analysis:NAve=0 2:3114840:d=23092600:2T:kpds5=167:kpds6=1:kpds7=0:TR=0:P1=0:P2=0:TimeU=1:sfc:anl:type=analysis:NAve=0 3:5191440:d=23092600:MSL:kpds5=151:kpds6=1:kpds7=0:TR=0:P1=0:P2=0:TimeU=1:sfc:anl:type=analysis:NAve=0 4:7268040:d=23092600:10U:kpds5=165:kpds6=1:kpds7=0:TR=0:P1=0:P2=0:TimeU=1:sfc:anl:type=analysis:NAve=0 5:9344640:d=23092600:10V:kpds5=166:kpds6=1:kpds7=0:TR=0:P1=0:P2=0:TimeU=1:sfc:anl:type=analysis:NAve=0 6:11421240:d=23092518:TP:kpds5=228:kpds6=1:kpds7=0:TR=4:P1=5:P2=6:TimeU=1:sfc:5-6hr acc:type=9:NAve=0 7:13497840:d=23092600:Z:kpds5=129:kpds6=1:kpds7=0:TR=0:P1=0:P2=0:TimeU=1:sfc:anl:type=analysis:NAve=0

Dadoof commented 11 months ago

Hello all,

It does appear that the error I, and others, posted comes from the precipitation fields in the grib files as downloaded from CDS. What I get from CDS is the 5-6 hour forecast with an end time that matches the analysis time. It appears that this is not what the code wants.

Commenting out 'tp' in /.local/lib/python3.10/site-packages/ai_models_graphcast/convert.py got me around the error above, but then the model does not work as the field is missing.

To the ECMWF folks: I am pretty sure that the download of the precip 'analysis' is what needs fixing.

b8raoult commented 11 months ago

We have updated the code. Can you try again?

Dadoof commented 11 months ago

Hello there,

I gave it a go, but no change in the results. I am wondering if I am not getting the latest version.

I do this: git clone https://github.com/ecmwf-lab/ai-models.git sudo pip3 install ai-models (does this actually use the git clone contents?)

Then later this: pip3 install ai-models-graphcast git clone https://github.com/ecmwf-lab/ai-models-graphcast.git cd ai-models-graphcast pip3 install -r requirements-gpu.txt -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Unsure if I need any special tag to get the revised code.

Regards, Brian E.

b8raoult commented 11 months ago

It you clone the code, don't install from pipy as well.

cd ~/git
git clone https://github.com/ecmwf-lab/ai-models.git
cd ai-models
pip3 install --upgrade -e .

and

cd ~/git
git clone https://github.com/ecmwf-lab/ai-models-graphcast.git
cd ai-models-graphcast
pip3 install --upgrade -e .
pip3 install -r requirements-gpu.txt -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
b8raoult commented 11 months ago

If you already have cloned the code

cd ~/git/some-repo
git pull
pip3 install --upgrade -e .
Dadoof commented 11 months ago

Good afternoon (morning here in Colorado),

I have made the changes you suggested. They did indeed get me to the most recent/latest version. Ran things on this hardware: NVIDIA Tesla V100 (on AWS)

Ran using this command: ai-models --date 20230927 --time 0000 --input cds --assets grc graphcast

Sadly I ran out of memory, I shall try again to get beefier hardware.

Sardingfish commented 11 months ago

Hello, I also encountered this problem, "jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory allocating 163104825552 bytes. ", which equates to 152Gb of memory required, does this mean that PCs can't run graphcast?

idharssi2020 commented 11 months ago

I get a similar error message, I'm trying the suggestions in https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html

Setting these below appears to fix the memory issues.

export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platform
Dadoof commented 10 months ago

To all: I also did those exports above, and can comfirm that, plus increasing my memory, led to success.

Sardingfish commented 10 months ago

Can I ask what your total memory was after you increased it?

idharssi2020 commented 10 months ago

image

An example plot from my near real time running of graphcast .

Dadoof commented 10 months ago

Can I ask what your total memory was after you increased it?

I have been running on Amazon, AWS. I was not able to do so with their 'p3' instance, but was with their 'p5' instance. p5

Dadoof commented 10 months ago

As it does work on the p5 instance, this ticket is hereby closed.