NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
5 stars 0 forks source link

Multi-GPU strategies #344

Closed sigmafelix closed 4 months ago

sigmafelix commented 4 months ago

As we have multiple GPUs in HPC nodes, there should be a proper strategy to train base models in multiple GPUs. It is neither given nor cost free, and perhaps we need to be bilingual with R and Python to attain this.

XGBoost and MLP models are available for leveraging GPUs, thus my quick idea to use all GPUs in targets way (albeit less elegant than a torch native way) is to specify the CUDA device to fit the model by branching with device numbers, for example, tar_target(char_cuda_device, c("cuda:0", "cuda:1", "cuda:2", "cuda:3")) when there are four CUDA devices in a certain node.

sigmafelix commented 4 months ago

Branching is implemented by separating learn_rate from the other hyperparameters in fit_base_brulee and fit_base_xgb. Each model object size is expected to be reduced and it will lessen GPU memory pressure.