IBM / terratorch

a Python toolkit for fine-tuning Geospatial Foundation Models (GFMs).
Apache License 2.0
161 stars 23 forks source link

Allow for multiple GPU training #143

Open romeokienzler opened 2 months ago

romeokienzler commented 2 months ago

Is your feature request related to a problem? Please describe. Cannot use multiple GPU

Describe the solution you'd like Allow for multiple GPU training

reported by @biancazadrozny

takaomoriyama commented 1 month ago

Testing with multi-GPU multi-node environment (CCC/LSF). Use blaunch.sh and ddp_wrapper.py.

Example: 4 nodes, each node has 8 cores and 2 GPUs

$ jbsub -q x86_6h -cores 4x8+2 -require a100,infiniband -mem 40G blaunch.sh $PWD/ddp_wrapper.py terratorch fit --config sen1floods11_vit.yaml --trainer.num_nodes 4
romeokienzler commented 3 days ago

@takaomoriyama please create PR and add the launch script(s) to the examples folder