🚀[FEA]: Running on Multi GPU A100

manmeet3591 commented 2 months ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

How to run earth2mip on multi GPU A100, i.e. I want to do distributed training. What changes would be required in

from earth2mip.inference_medium_range import score_deterministic import numpy as np scores = score_deterministic(time_loop, data_source=data_source, n=10, initial_times=[datetime.datetime(2018, 1, 1)],

fill in zeros for time-mean, will typically be grabbed from data.

time_mean=np.zeros((7, 721, 1440)) ) scores
Dimensions: (lead_time: 11, channel: 7, initial_time: 1) Coordinates: * lead_time (lead_time) timedelta64[ns] 0 days 00:00:00 ... 5 days 00:... * channel (channel) array([ 0. , 150.83014446, 212.07880612, 304.98592282, 381.36510987, 453.31516952, 506.01464974, 537.11092269, 564.79603347, 557.22871627, 586.44691243]) Coordinates: * lead_time (lead_time) timedelta64[ns] 0 days 00:00:00 ... 5 days 00:00:00 channel

Describe any alternatives you have considered

No response

nbren12 commented 2 months ago

score_deterministic supports multiple GPU using torch.distributed.

The script should work pretty much out of the box if run the script with mpi or torchrun and initialize torch distributed before beginning scoring. e.g.

torch.distributed.init_process_group()
# the rest of the script

For parallel use, I might recommend using https://github.com/NVIDIA/earth2mip/blob/86b11fe4ba2f19641802112e8b0ba6b962123130/earth2mip/inference_medium_range.py#L254 instead. This will save one csv file per rank with separate scores for (initial_time, lead_time). Unlike score_deterministic, this approach does not time average these scores.

nbren12 commented 2 months ago

Closing since the feature already exists.

NVIDIA / earth2mip