Closed favyen2 closed 1 month ago
Fixed two issues:
2024-10-11T20:10:11.640908017Z [rank1]: run_id = wandb.run.id
2024-10-11T20:10:11.640909595Z [rank1]: ^^^^^^^^^^^^
2024-10-11T20:10:11.640910986Z [rank1]: AttributeError: 'NoneType' object has no attribute 'id'
Fix:
@rank_zero_only
def on_fit_start(self, trainer, pl_module):
This decorator ensures that the on_fit_start method is only executed on the main process (rank 0). This is crucial in a multi-GPU setup to prevent multiple processes from trying to access or modify shared resources like W&B runs.
RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
Fix:
# Configure DDP strategy with find_unused_parameters=True
c.trainer.strategy = jsonargparse.Namespace(
{
"class_path": "lightning.pytorch.strategies.DDPStrategy",
"init_args": jsonargparse.Namespace(
{
"find_unused_parameters": True
}
),
}
)
Resolved in rslearn_projects: https://github.com/allenai/rslearn_projects/pull/37
Looks like there're some issue with multi-GPU run:
With single GPU -
With multi-GPU -
This is use the same config, just changed
gpu_count
in launch_beaker.py, command:python -m rslp.launch_beaker --config_path landsat/recheck_landsat_labels/phase123_config.yaml