Test multi-GPU training and get it working if not working

yawenzzzz commented 1 month ago

Looks like there're some issue with multi-GPU run:

With single GPU -

2024-10-11T18:42:25.848991169Z INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
2024-10-11T18:42:25.849301623Z INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
2024-10-11T18:42:26.544851974Z INFO: 
2024-10-11T18:42:26.544920501Z   | Name         | Type             | Params | Mode 
2024-10-11T18:42:26.544924074Z ----------------------------------------------------------
2024-10-11T18:42:26.544927193Z 0 | model        | MultiTaskModel   | 87.9 M | train
2024-10-11T18:42:26.544932171Z 1 | val_metrics  | MetricCollection | 0      | train
2024-10-11T18:42:26.544936524Z 2 | test_metrics | MetricCollection | 0      | train
2024-10-11T18:42:26.544942009Z ----------------------------------------------------------
2024-10-11T18:42:26.544945813Z 87.9 M    Trainable params
2024-10-11T18:42:26.544949919Z 0         Non-trainable params
2024-10-11T18:42:26.544971533Z 87.9 M    Total params
2024-10-11T18:42:26.544974303Z 351.764   Total estimated model params size (MB)
2024-10-11T18:42:26.544977350Z 480       Modules in train mode
2024-10-11T18:42:26.544980352Z 0         Modules in eval mode
2024-10-11T18:42:26.544993302Z INFO:lightning.pytorch.callbacks.model_summary:
2024-10-11T18:42:26.544997514Z   | Name         | Type             | Params | Mode 
2024-10-11T18:42:26.545001811Z ----------------------------------------------------------
2024-10-11T18:42:26.545005899Z 0 | model        | MultiTaskModel   | 87.9 M | train
2024-10-11T18:42:26.545008368Z 1 | val_metrics  | MetricCollection | 0      | train
2024-10-11T18:42:26.545010744Z 2 | test_metrics | MetricCollection | 0      | train
2024-10-11T18:42:26.545013195Z ----------------------------------------------------------
2024-10-11T18:42:26.545017609Z 87.9 M    Trainable params
2024-10-11T18:42:26.545021678Z 0         Non-trainable params
2024-10-11T18:42:26.545026100Z 87.9 M    Total params
2024-10-11T18:42:26.545031384Z 351.764   Total estimated model params size (MB)
2024-10-11T18:42:26.545034164Z 480       Modules in train mode
2024-10-11T18:42:26.545036577Z 0         Modules in eval mode
2024-10-11T18:42:26.547792366Z got 167 examples in split val

With multi-GPU -

2024-10-11T18:37:52.621589524Z INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2024-10-11T18:37:52.621892550Z INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2024-10-11T18:37:52.622332016Z INFO: LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
2024-10-11T18:37:52.622342639Z INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
2024-10-11T18:37:53.246377508Z [rank1]: Traceback (most recent call last):
2024-10-11T18:37:53.246417299Z [rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
2024-10-11T18:37:53.246420109Z [rank1]:   File "<frozen runpy>", line 88, in _run_code
2024-10-11T18:37:53.246422038Z [rank1]:   File "/opt/rslearn_projects/rslp/docker_entrypoint.py", line 27, in <module>
2024-10-11T18:37:53.246424128Z [rank1]:     main()
2024-10-11T18:37:53.246425837Z [rank1]:   File "/opt/rslearn_projects/rslp/docker_entrypoint.py", line 22, in main
2024-10-11T18:37:53.246427483Z [rank1]:     rslp.rslearn_main.main()
2024-10-11T18:37:53.246429043Z [rank1]:   File "/opt/rslearn_projects/rslp/rslearn_main.py", line 23, in main
2024-10-11T18:37:53.246430707Z [rank1]:     rslearn.main.main()
2024-10-11T18:37:53.246432380Z [rank1]:   File "/opt/rslearn_projects/rslearn/rslearn/main.py", line 641, in main
2024-10-11T18:37:53.246434104Z [rank1]:     handler()
2024-10-11T18:37:53.246436033Z [rank1]:   File "/opt/rslearn_projects/rslearn/rslearn/main.py", line 606, in model_fit
2024-10-11T18:37:53.246437833Z [rank1]:     model_handler()
2024-10-11T18:37:53.246540292Z [rank1]:   File "/opt/rslearn_projects/rslearn/rslearn/main.py", line 593, in model_handler
2024-10-11T18:37:53.246561039Z [rank1]:     RslearnLightningCLI(
2024-10-11T18:37:53.246563721Z [rank1]:   File "/opt/conda/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 394, in __init__
2024-10-11T18:37:53.246566477Z [rank1]:     self._run_subcommand(self.subcommand)
2024-10-11T18:37:53.246569102Z [rank1]:   File "/opt/conda/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 701, in _run_subcommand
2024-10-11T18:37:53.246571862Z [rank1]:     fn(**fn_kwargs)
2024-10-11T18:37:53.246574852Z [rank1]:   File "/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
2024-10-11T18:37:53.246577938Z [rank1]:     call._call_and_handle_interrupt(
2024-10-11T18:37:53.246580784Z [rank1]:   File "/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
2024-10-11T18:37:53.246583941Z [rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
2024-10-11T18:37:53.246586565Z [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-11T18:37:53.246589446Z [rank1]:   File "/opt/conda/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
2024-10-11T18:37:53.246592199Z [rank1]:     return function(*args, **kwargs)
2024-10-11T18:37:53.246594616Z [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-11T18:37:53.246596869Z [rank1]:   File "/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
2024-10-11T18:37:53.246599293Z [rank1]:     self._run(model, ckpt_path=ckpt_path)
2024-10-11T18:37:53.246607779Z [rank1]:   File "/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 961, in _run
2024-10-11T18:37:53.246610362Z [rank1]:     call._call_callback_hooks(self, "on_fit_start")
2024-10-11T18:37:53.246613503Z [rank1]:   File "/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 218, in _call_callback_hooks
2024-10-11T18:37:53.246616185Z [rank1]:     fn(trainer, trainer.lightning_module, *args, **kwargs)
2024-10-11T18:37:53.246618623Z [rank1]:   File "/opt/rslearn_projects/rslp/lightning_cli.py", line 36, in on_fit_start
2024-10-11T18:37:53.246621113Z [rank1]:     run_id = wandb.run.id
2024-10-11T18:37:53.246623879Z [rank1]:              ^^^^^^^^^^^^
2024-10-11T18:37:53.246626541Z [rank1]: AttributeError: 'NoneType' object has no attribute 'id'
2024-10-11T18:37:53.301602821Z INFO: 
2024-10-11T18:37:53.301634535Z   | Name         | Type             | Params | Mode 
2024-10-11T18:37:53.301636785Z ----------------------------------------------------------
2024-10-11T18:37:53.301639505Z 0 | model        | MultiTaskModel   | 87.9 M | train
2024-10-11T18:37:53.301671902Z 1 | val_metrics  | MetricCollection | 0      | train
2024-10-11T18:37:53.301674170Z 2 | test_metrics | MetricCollection | 0      | train
2024-10-11T18:37:53.301678328Z ----------------------------------------------------------
2024-10-11T18:37:53.301680743Z 87.9 M    Trainable params
2024-10-11T18:37:53.301682599Z 0         Non-trainable params
2024-10-11T18:37:53.301685212Z 87.9 M    Total params
2024-10-11T18:37:53.301687615Z 351.764   Total estimated model params size (MB)
2024-10-11T18:37:53.301689905Z 480       Modules in train mode
2024-10-11T18:37:53.301692283Z 0         Modules in eval mode
2024-10-11T18:37:53.301709401Z INFO:lightning.pytorch.callbacks.model_summary:
2024-10-11T18:37:53.301711650Z   | Name         | Type             | Params | Mode 
2024-10-11T18:37:53.301713963Z ----------------------------------------------------------
2024-10-11T18:37:53.301716281Z 0 | model        | MultiTaskModel   | 87.9 M | train
2024-10-11T18:37:53.301718434Z 1 | val_metrics  | MetricCollection | 0      | train
2024-10-11T18:37:53.301720842Z 2 | test_metrics | MetricCollection | 0      | train
2024-10-11T18:37:53.301722863Z ----------------------------------------------------------
2024-10-11T18:37:53.301724878Z 87.9 M    Trainable params
2024-10-11T18:37:53.301727477Z 0         Non-trainable params
2024-10-11T18:37:53.301729536Z 87.9 M    Total params
2024-10-11T18:37:53.301731617Z 351.764   Total estimated model params size (MB)
2024-10-11T18:37:53.301733566Z 480       Modules in train mode
2024-10-11T18:37:53.301735628Z 0         Modules in eval mode
2024-10-11T18:37:53.793554261Z got 167 examples in split val
2024-10-11T18:38:00.035387474Z INFO: [rank: 1] Child process with PID 114 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
2024-10-11T18:38:00.035710878Z INFO:lightning.fabric.strategies.launchers.subprocess_script:[rank: 1] Child process with PID 114 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
2024-10-11T18:38:00.235769507Z /opt/conda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
2024-10-11T18:38:00.235830952Z   warnings.warn('resource_tracker: There appear to be %d '

This is use the same config, just changed gpu_count in launch_beaker.py, command: python -m rslp.launch_beaker --config_path landsat/recheck_landsat_labels/phase123_config.yaml

yawenzzzz commented 1 month ago

Fixed two issues:

wandb init

2024-10-11T20:10:11.640908017Z [rank1]:     run_id = wandb.run.id
2024-10-11T20:10:11.640909595Z [rank1]:              ^^^^^^^^^^^^
2024-10-11T20:10:11.640910986Z [rank1]: AttributeError: 'NoneType' object has no attribute 'id'

Fix:

@rank_zero_only
    def on_fit_start(self, trainer, pl_module):

This decorator ensures that the on_fit_start method is only executed on the main process (rank 0). This is crucial in a multi-GPU setup to prevent multiple processes from trying to access or modify shared resources like W&B runs.

ddp training with unused params

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.

Fix:

# Configure DDP strategy with find_unused_parameters=True
            c.trainer.strategy = jsonargparse.Namespace(
                {
                    "class_path": "lightning.pytorch.strategies.DDPStrategy",
                    "init_args": jsonargparse.Namespace(
                        {
                            "find_unused_parameters": True
                        }
                    ),
                }
            )

yawenzzzz commented 1 month ago

beaker run: https://beaker.org/ex/01J9YMRH2DBEGJ8J2HXQAN5VTF/tasks/01J9YMRH2MVTWKFKSHJGV03WAQ/job/01J9YMRH7D0Q364CZ5WKG3VXF4

yawenzzzz commented 1 month ago

Resolved in rslearn_projects: https://github.com/allenai/rslearn_projects/pull/37

allenai / rslearn

Test multi-GPU training and get it working if not working #49