Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.92k stars 3.34k forks source link

Profiler Error: record.__enter__() missing 1 equired #19253

Open AnnaTrainingG opened 8 months ago

AnnaTrainingG commented 8 months ago

Bug description

image

What version are you seeing the problem on?

v2.1

How to reproduce the bug

just set the profiler like this: schedule_config = {'wait': 10, 'warmup': 1, 'active': 1, 'repeat': 1} profiler = PyTorchProfiler( dirpath=save_dir, filename=filename, record_shapes=True, with_stack=True, with_flops=True, with_modules=True, schedule=torch.profiler.schedule(**schedule_config)) trainer = pl.Trainer( .... profiler=profiler)

Error messages and logs

# Error messages and logs here please

image

Environment

Current environment ``` - Lightning Component Trainer - PyTorch Lightning Version 2.0.8: #- Lightning App Version (e.g., 0.5.2): - PyTorch Version (e.g., 2.1.1): - Python version (e.g., 3.9): - OS :Linux - CUDA/cuDNN version: 11.8 #- GPU models and configuration: - How you installed Lightning pip - Running environment of LightningApp (e.g. local, cloud): cloud ```

More info

No response

cc @carmocca

awaelchli commented 8 months ago

@niuliling123 Can you provide more information please. We can't action this ticket otherwise.

AnnaTrainingG commented 8 months ago

hello when I use the profiler function in lightning, I meet this error: in lightning/pytorch/profilers/pytorch.py", line 74 record.__enter__() TypeError: __enter__() missing 1 required positional argument: 'self'

the enter() in record_function must be called with self.enter()

AnnaTrainingG commented 8 months ago

schedule_config = {'wait': 10, 'warmup': 1, 'active': 1, 'repeat': 1} profiler = PyTorchProfiler( dirpath=save_dir, filename=filename, record_shapes=True, with_stack=True, with_flops=True, with_modules=True, schedule=torch.profiler.schedule(**schedule_config)) trainer = pl.Trainer( .... profiler=profiler)

awaelchli commented 8 months ago

Could you update this code example here to demonstrate your issue? I added your profiler settings, but so far this doesn't reproduce (Lightning 2.1, PyTorch 2.1.1):

import torch
from lightning.pytorch import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset

from lightning.pytorch.profilers import PyTorchProfiler

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    schedule_config = {'wait': 10, 'warmup': 1, 'active': 1, 'repeat': 1}
    profiler = PyTorchProfiler(
        dirpath="profile_here",
        filename="profile",
        record_shapes=True,
        with_stack=True,
        with_flops=True,
        with_modules=True,
        schedule=torch.profiler.schedule(**schedule_config),
    )

    model = BoringModel()
    trainer = Trainer(max_epochs=3, profiler=profiler)
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)

if __name__ == "__main__":
    run()
AnnaTrainingG commented 8 months ago

Thank you so much, I also write a demo, it run ok. But the module still error, after check and print, I find that :

when record_name is model._orig_mod [torch param] , the record in record_function;

after that the record will be <class 'contextlib.nullcontext'>

`

def _start_recordingforward(self, : nn.Module, input: Tensor, record_name: str) -> Tensor:

    # Add [pl][module] in name for pytorch profiler to recognize

    record = record_function("[pl][module]" + record_name)

    record.__enter__()  # record is <class 'contextlib.nullcontext'>

    self._records[record_name] = record

    return input

// record_name = model._orig_mod notice: this is not string `

awaelchli commented 8 months ago

I see, this is because you are trying to run the torch profiler with a compiled module (torch.compile). If you want to use the profiler, you will have to temporarily comment out the torch.compile call in your program.

AnnaTrainingG commented 8 months ago

You are great, and I have solved the problem according to what you said, thank you very much!