NVIDIA / earth2mip

Earth-2 Model Intercomparison Project (MIP) is a python framework that enables climate researchers and scientists to inter-compare AI models for weather and climate.
https://nvidia.github.io/earth2mip/
Apache License 2.0
183 stars 40 forks source link

🚀[FEA]: Running on Multi GPU A100 #190

Closed manmeet3591 closed 2 months ago

manmeet3591 commented 2 months ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

How to run earth2mip on multi GPU A100, i.e. I want to do distributed training. What changes would be required in

from earth2mip.inference_medium_range import score_deterministic import numpy as np scores = score_deterministic(time_loop, data_source=data_source, n=10, initial_times=[datetime.datetime(2018, 1, 1)],

fill in zeros for time-mean, will typically be grabbed from data.

time_mean=np.zeros((7, 721, 1440)) ) scores

Dimensions: (lead_time: 11, channel: 7, initial_time: 1) Coordinates: * lead_time (lead_time) timedelta64[ns] 0 days 00:00:00 ... 5 days 00:... * channel (channel) array([ 0. , 150.83014446, 212.07880612, 304.98592282, 381.36510987, 453.31516952, 506.01464974, 537.11092269, 564.79603347, 557.22871627, 586.44691243]) Coordinates: * lead_time (lead_time) timedelta64[ns] 0 days 00:00:00 ... 5 days 00:00:00 channel

Describe any alternatives you have considered

No response

nbren12 commented 2 months ago

score_deterministic supports multiple GPU using torch.distributed.

The script should work pretty much out of the box if run the script with mpi or torchrun and initialize torch distributed before beginning scoring. e.g.

torch.distributed.init_process_group()
# the rest of the script

For parallel use, I might recommend using https://github.com/NVIDIA/earth2mip/blob/86b11fe4ba2f19641802112e8b0ba6b962123130/earth2mip/inference_medium_range.py#L254 instead. This will save one csv file per rank with separate scores for (initial_time, lead_time). Unlike score_deterministic, this approach does not time average these scores.

nbren12 commented 2 months ago

Closing since the feature already exists.