Closed carbonscott closed 4 months ago
a6f216fd6733e49cdead61e4ca8b55d38e3f6f25 adds a function to initialize distributed environment variables on summit.
5ce3180aa87c0ae60298b777202ad286e14c1aee updates the BSUB template.
Tests confirmed the changes work!
https://docs.olcf.ornl.gov/software/python/pytorch_frontier.html#torchrun indicates
torchrun
is not optimized for running distributed pytorch processes on summit (IBM LSF job scheduler).@frobnitzem has suggested the following changes as a solution:
Need to adapt this code to the script
train/train.fsdp.py
.