ecmwf-lab / ai-models-graphcast

Apache License 2.0
64 stars 19 forks source link

Running on more than 1 core #5

Closed Dadoof closed 10 months ago

Dadoof commented 10 months ago

Good day all,

I am at present attempting graphcast on a node which has 8 GPUs, each with 32GB of memory That totals enough memory to run. My problem is: I cannot get graphcast to run on more than one single core.

After first doing: export XLA_PYTHON_CLIENT_PREALLOCATE=false export XLA_PYTHON_CLIENT_ALLOCATOR=platform export OMP_NUM_THREADS=8

I believe that to run graphcast on all 8 GPU cores, I would use this command: ai-models --debug --num-threads 8 --date 20231201 --time 0000 --input cds --assets grc graphcast

But it crashes, saying it ran out of memory, with the memory amount being equal to what I have on ONE gpu, not the sum of all 8.

Do I need to do more? mpirun the command above? Some other exports?

mchantry commented 10 months ago

Hi Brian, Multi-GPU is not currently supported. Could you explain your use-case? Thanks, Mat

Dadoof commented 10 months ago

Hi there,

I am unsure how to respond properly to the use-case question. I will offer my best guess below.

From an AWS EC2 instance of the "g5.24xlarge" type ( AWS NVIDIA A10G Tensor Core GPU , 24 GB memory per GPU) https://aws.amazon.com/ec2/instance-types/g5/

I type this command: ai-models --debug --num-threads 8 --date 20231201 --time 0000 --input cds --assets grc graphcast

And the model crashes due to running out of memory. If I run on a different EC2 instance, the p5, it does fine - as the p5 has 80 GB memory per GPU.

For cost reasons, I would rather run on the g5 than the p5, but I cannot get the graphcast to use the memory from all available GPUs on the node, only the 1 GPU assigned to the run. It seems the 'num-threads' bit makes no impact.

It might not be possible to do what I am attempting, but I thought I would ask before assuming that.

Thanks, Brian E.

mchantry commented 10 months ago

Short answer. Not easily. You could in theory divide the GraphCast model across GPUs, but there is no current support for this from ai-models, or more importantly from the GraphCast codebase which drives this plugin. If GraphCast add support for this in their codebase https://github.com/google-deepmind/graphcast then we could add support to utilise this.

Dadoof commented 10 months ago

Good day,

This is... consistent with what I had found so far. That, at present, there is no mechanism to split graphcast across GPUs. As such, the only way to run it is to... have a lot of memory for the one GPU that does the run.

Thanks for your insights. I appreciate getting confirmation from people who know more about all this than I do.

Best regards, Brian E.