Open KeplerWang opened 3 months ago
i have the same issue (with xla strategy).
Hello, for anyone finding this task in the future, I ran into the same issue with this whilst implementing a custom SaveConfigCallback. The issue stemmed from the custom callback using the decorator @rank_zero_only
when setting up state for pushing to a cloud server. The obvious reason I did this was so that each GPU process doesn't create the state required for pushing to the sever since only one was needed.
However, when using DDP, the extra "parameters" created by the callback would only get created on the rank zero process thus resulting in the parameter count mismatch. Removing this decorator, and wrapping the functions on_{stage}_end
with the @rank_zero_only
decorator fixed the issue. Every GPU creates the state this way but only rank zero pushes to the server.
Hope this helps!
Bug description
The
config_path
is not desired while usingWandBLogger
. So I create my own SaveConfigCallback calledWandBSaveConfigCallback
. But when starting training with DDP, I encounter a RuntimeError. Details are as follows:LightningCLI
, and passWandBSaveConfigCallback
(extend classSaveConfigCallback
and implementsave_config
func (interface left for different saving path, I guess) to modify theconfig_path
ofconfig.yaml
). Codes are listed below.from utils import WandBSaveConfigCallback
def cli_main(args: ArgsType = None): cli = LightningCLI( seed_everything_default=42, save_config_callback=WandBSaveConfigCallback, save_config_kwargs={'save_to_log_dir': False}, args=args, )
if name == 'main': cli_main()
trainer.yaml
, devices=2/3/4, strategy='ddp', like thispython main.py fit --model cfg/model.yaml --data cfg/data.yaml --trainer cfg/trainer.yaml
, I encounter suchRuntimeError: DDP expects same model across all ranks, but Rank 1 has 2 params, while rank 0 has inconsistent 4 params.
main.py
, everything will be ok.WandBSaveConfigCallback_2
(extendSaveConfigCallback
and reimplementsetup
logic to modify theconfig_path
), it's all ok, too.What version are you seeing the problem on?
v2.1
How to reproduce the bug
Error messages and logs
Environment
Current environment
``` * CUDA: - GPU: - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - NVIDIA GeForce RTX 4090 - available: True - version: 11.3 * Lightning: - lightning: 2.1.2 - lightning-cloud: 0.5.38 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.1.4 - torch: 1.12.1+cu113 - torchaudio: 0.12.1 - torchmetrics: 0.11.4 - torchvision: 0.13.1+cu113 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.9.16 - release: 5.15.0-101-generic - version: #111~20.04.1-Ubuntu SMP Mon Mar 11 15:44:43 UTC 2024 ```More info
No response