Add option to launch distributed runs locally with >1 GPU

rayg1234 commented 4 weeks ago

Add option to launch distributed runs locally with >1 GPU. Useful for testing parallel algorithms locally. This uses torch elastic API which just spawns python multiprocesses under the hood.

This is equivalent to calling our application with torchrun ie: torchrun fairchem ..., but makes the interface cleaner so we dont need to work with 2 launchers Note: torchrun just calls the elastic launch API under the hood

There's a bug where LMBDs cannot be pickled (needed for multiprocessing), this is resolvable by setting num_workers to 0 which is ok for local mode testing.

examples: To run locally on 2 GPUs with distributed: fairchem --debug --mode train --identifier gp_test --config-yml src/fairchem/experimental/rgao/configs/equiformer_v2_N\@8_L\@4_M\@2_31M.yml --amp --distributed --num-gpus=2

To run locally without distributed: fairchem --debug --mode train --identifier gp_test --config-yml src/fairchem/experimental/rgao/configs/equiformer_v2_N\@8_L\@4_M\@2_31M.yml --amp

Testing:

Add simple test for test_cli.py for now which mocks the runner, should add tests later for actual simple runs

codecov[bot] commented 4 weeks ago

Codecov Report

Attention: Patch coverage is 76.47059% with 4 lines in your changes missing coverage. Please review.

Files	Coverage Δ
src/fairchem/core/_cli.py	`64.51% <86.66%> (+18.68%)`	:arrow_up:
src/fairchem/core/common/distutils.py	`30.00% <0.00%> (ø)`

misko commented 3 weeks ago

This is awesome! exactly what we need 🤩 LGTM!

FAIR-Chem / fairchem

Add option to launch distributed runs locally with >1 GPU #733

Codecov Report