Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.93k stars 3.34k forks source link

Another profiling tool is already active #19983

Open zhaohm14 opened 2 months ago

zhaohm14 commented 2 months ago

Bug description

When I try to use profiler='advanced' when creating a trainer, this error will be raised inside trainer.fit():

ValueError: Another profiling tool is already active

It will be ok if use profiler='simple'

What version are you seeing the problem on?

master

How to reproduce the bug

trainer = L.Trainer(
        default_root_dir=config.train.save_dir,
        callbacks=[
            ModelCheckpoint(
                dirpath=config.train.save_dir,
                every_n_train_steps=config.train.save_step,
                save_top_k=config.train.save_ckpt_keep_num,
                mode='max',
                monitor='global_step'
            ),
            ModelSummary(max_depth=9)
        ],
        logger=WandbLogger(log_model="all"),
        **config.train.trainer
    )
    if config.train.resume_from_ckpt:
        trainer.fit(
            model=model,
            train_dataloaders=train_loader,  # TODO: dose dataloader needed?
            val_dataloaders=val_loader,
            ckpt_path=config.train.resume_from_ckpt
        )
    else:
        trainer.fit(
            model=model,
            train_dataloaders=train_loader,
            val_dataloaders=val_loader
        )

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

cc @carmocca

awaelchli commented 1 month ago

The explanation for why this happens is here: https://github.com/python/cpython/issues/110770#issuecomment-1759986100

The AdvancedProfiler in Lightning enables multiple profilers in a nested fashion, which is apparently not supported by Python but so far was not complaining, until Python 3.12. To resolve this, the AdvancedProfiler will have to be reworked somehow. So there is some work needed here.

zhaohm14 commented 1 month ago

Thanks a lot for your help!

awaelchli commented 1 month ago

I'd like to keep the issue open, because we need to work on this (help from community appreciated).