Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.6k stars 3.31k forks source link

SaveConfigCallback.save_config is conflict with DDP #19754

Open KeplerWang opened 3 months ago

KeplerWang commented 3 months ago

Bug description

The config_path is not desired while using WandBLogger. So I create my own SaveConfigCallback called WandBSaveConfigCallback. But when starting training with DDP, I encounter a RuntimeError. Details are as follows:

  1. Start a training logic from LightningCLI, and pass WandBSaveConfigCallback (extend class SaveConfigCallback and implement save_config func (interface left for different saving path, I guess) to modify the config_path of config.yaml). Codes are listed below.
    
    # main.py
    from lightning.pytorch.cli import LightningCLI, ArgsType

from utils import WandBSaveConfigCallback

def cli_main(args: ArgsType = None): cli = LightningCLI( seed_everything_default=42, save_config_callback=WandBSaveConfigCallback, save_config_kwargs={'save_to_log_dir': False}, args=args, )

if name == 'main': cli_main()

```python
# utils.py (WandBSaveConfigCallback)
import os

from lightning.pytorch.cli import SaveConfigCallback
from lightning.fabric.utilities.cloud_io import get_filesystem

class WandBSaveConfigCallback(SaveConfigCallback):
    def save_config(self, trainer, pl_module, stage):
        ##### only changed the config_path
        assert trainer.log_dir is not None
        assert not self.save_to_log_dir, '{save_to_log_dir} must be False for WandBSaveConfigCallback'

        log_dir = os.path.join(trainer.log_dir, trainer.logger.name, trainer.logger.version)
        config_path = os.path.join(log_dir, self.config_filename)
        #####
        #### code below is same as SaveConfigCallback.setup
        fs = get_filesystem(log_dir)

        if not self.overwrite:
            # check if the file exists on rank 0
            file_exists = fs.isfile(config_path) if trainer.is_global_zero else False
            # broadcast whether to fail to all ranks
            file_exists = trainer.strategy.broadcast(file_exists)
            if file_exists:
                raise RuntimeError(
                    f"{self.__class__.__name__} expected {config_path} to NOT exist. Aborting to avoid overwriting"
                    " results of a previous run. You can delete the previous config file,"
                    " set `LightningCLI(save_config_callback=None)` to disable config saving,"
                    ' or set `LightningCLI(save_config_kwargs={"overwrite": True})` to overwrite the config file.'
                )

        if trainer.is_global_zero:
            # save only on rank zero to avoid race conditions.
            # the `log_dir` needs to be created as we rely on the logger to do it usually
            # but it hasn't logged anything at this point
            fs.makedirs(log_dir, exist_ok=True)
            self.parser.save(
                self.config, config_path, skip_none=False, overwrite=self.overwrite, multifile=self.multifile
            )
  1. Set the trainer.yaml, devices=2/3/4, strategy='ddp', like this
    accelerator: gpu
    devices: 2
    num_nodes: 1
    strategy: ddp
    precision: 16-mixed
    logger:
    class_path: lightning.pytorch.loggers.WandbLogger
    init_args:
    save_dir: lightning_logs
    project: test
    offline: true
    max_epochs: 4
    log_every_n_steps: 1
    check_val_every_n_epoch: 1
    deterministic: true
  2. After python main.py fit --model cfg/model.yaml --data cfg/data.yaml --trainer cfg/trainer.yaml, I encounter such RuntimeError: DDP expects same model across all ranks, but Rank 1 has 2 params, while rank 0 has inconsistent 4 params.
  3. If commenting this two lines in main.py, everything will be ok.
    save_config_callback=WandBSaveConfigCallback,
    save_config_kwargs={'save_to_log_dir': False},
  4. Or if creating a new class WandBSaveConfigCallback_2 (extend SaveConfigCallback and reimplement setup logic to modify the config_path), it's all ok, too.

What version are you seeing the problem on?

v2.1

How to reproduce the bug

see Bug description.
github repo 
https://github.com/KeplerWang/lightning_test_example

Error messages and logs

Seed set to 42
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Seed set to 42
[rank: 1] Seed set to 42
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
wandb: WARNING Path lightning_logs/wandb/ wasn't writable, using system temp directory.
wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id yx4ql3i4.
wandb: WARNING Path lightning_logs/wandb/ wasn't writable, using system temp directory
wandb: Tracking run with wandb version 0.15.8
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
  File "/data/ldap_shared/synology_shared/wzc/Projects/stage_proj/lightning_test_example/main.py", line 16, in <module>
    cli_main()
  File "/data/ldap_shared/synology_shared/wzc/Projects/stage_proj/lightning_test_example/main.py", line 7, in cli_main
    cli = LightningCLI(
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/cli.py", line 386, in __init__
    self._run_subcommand(self.subcommand)
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/cli.py", line 677, in _run_subcommand
    fn(**fn_kwargs)
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 965, in _run
    self.strategy.setup(self)
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 168, in setup
    self.configure_ddp()
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 277, in configure_ddp
    self.model = self._setup_model(self.model)
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 190, in _setup_model
    return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/ldap_shared/synology_shared/wzc/miniconda3/envs/vision/lib/python3.9/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: DDP expects same model across all ranks, but Rank 1 has 2 params, while rank 0 has inconsistent 4 params.

  | Name  | Type   | Params
---------------------------------
0 | model | Linear | 2     
---------------------------------
2         Trainable params
0         Non-trainable params
2         Total params
0.000     Total estimated model params size (MB)
[rank: 1] Child process with PID 175075 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
[1]    175058 killed     python main.py fit --model cfg/model.yaml --data cfg/data.yaml --trainer

Environment

Current environment ``` * CUDA: - GPU: - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - available: True - version: 11.3 * Lightning: - lightning: 2.1.2 - lightning-cloud: 0.5.38 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.1.4 - torch: 1.12.1+cu113 - torchaudio: 0.12.1 - torchmetrics: 0.11.4 - torchvision: 0.13.1+cu113 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.9.16 - release: 5.15.0-101-generic - version: #111~20.04.1-Ubuntu SMP Mon Mar 11 15:44:43 UTC 2024 ```

More info

No response

carlesoctav commented 3 months ago

i have the same issue (with xla strategy).

brod4910 commented 2 months ago

Hello, for anyone finding this task in the future, I ran into the same issue with this whilst implementing a custom SaveConfigCallback. The issue stemmed from the custom callback using the decorator @rank_zero_only when setting up state for pushing to a cloud server. The obvious reason I did this was so that each GPU process doesn't create the state required for pushing to the sever since only one was needed.

However, when using DDP, the extra "parameters" created by the callback would only get created on the rank zero process thus resulting in the parameter count mismatch. Removing this decorator, and wrapping the functions on_{stage}_end with the @rank_zero_only decorator fixed the issue. Every GPU creates the state this way but only rank zero pushes to the server.

Hope this helps!