Closed rayg1234 closed 3 weeks ago
Attention: Patch coverage is 76.47059%
with 4 lines
in your changes missing coverage. Please review.
Files | Coverage Δ | |
---|---|---|
src/fairchem/core/_cli.py | 64.51% <86.66%> (+18.68%) |
:arrow_up: |
src/fairchem/core/common/distutils.py | 30.00% <0.00%> (ø) |
This is awesome! exactly what we need 🤩 LGTM!
Add option to launch distributed runs locally with >1 GPU. Useful for testing parallel algorithms locally. This uses torch elastic API which just spawns python multiprocesses under the hood.
This is equivalent to calling our application with torchrun ie:
torchrun fairchem ...
, but makes the interface cleaner so we dont need to work with 2 launchers Note: torchrun just calls the elastic launch API under the hoodThere's a bug where LMBDs cannot be pickled (needed for multiprocessing), this is resolvable by setting num_workers to 0 which is ok for local mode testing.
examples: To run locally on 2 GPUs with distributed:
fairchem --debug --mode train --identifier gp_test --config-yml src/fairchem/experimental/rgao/configs/equiformer_v2_N\@8_L\@4_M\@2_31M.yml --amp --distributed --num-gpus=2
To run locally without distributed:
fairchem --debug --mode train --identifier gp_test --config-yml src/fairchem/experimental/rgao/configs/equiformer_v2_N\@8_L\@4_M\@2_31M.yml --amp
Testing:
Add simple test for test_cli.py for now which mocks the runner, should add tests later for actual simple runs