AdvancedProfiler: manually calling `stop` causes crash #17333

Open tbenst opened 1 year ago

tbenst commented 1 year ago

Bug description

Manually calling the stop method of AdvancedProfiler triggers a crash (SimpleProfiler, PyTorchProfiler, PassThroughProfiler all work correctly). I have written some new test functions to demonstrate this, happy to make a PR with these tests if helpful, although I'm not sure how to fix this bug!

What version are you seeing the problem on?


How to reproduce the bug

def test_manual_profiler_call(profiler, tmpdir):

  class MyModel(BoringModel):
    def on_validation_epoch_start(self):
      profiler.start(f"validation loop")

    def on_validation_epoch_end(self) -> None:
      profiler.stop(f"validation loop")

  model = MyModel()
  trainer = pl.Trainer(
      max_epochs=1, limit_train_batches=1, limit_val_batches=1,

tmpdir = "/tmp"
profiler = AdvancedProfiler(dirpath=tmpdir, filename="AdvancedProfiler")
test_manual_profiler_call(profiler, tmpdir)

Error messages and logs

ValueError                                Traceback (most recent call last)

[<ipython-input-14-86b2ebcd0a37>](https://localhost:8080/#) in <cell line: 2>()
      1 profiler = AdvancedProfiler(dirpath=tmpdir, filename="AdvancedProfiler")
----> 2 test_manual_profiler_call(profiler, tmpdir)

14 frames

[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/profilers/](https://localhost:8080/#) in stop(self, action_name)
     67         pr = self.profiled_actions.get(action_name)
     68         if pr is None:
---> 69             raise ValueError(f"Attempting to stop recording an action ({action_name}) which was never started.")
     70         pr.disable()

ValueError: Attempting to stop recording an action ([LightningModule]MyModel.on_validation_epoch_end) which was never started.


More info

I believe this is a MRE of bug

cc @carmocca @nbcsm @guotuofeng

bkiat1123 commented 1 year ago

profiler.stop does not triggers a crash, profiler.describe does. profiler.describe trigger teardown method that close open file and stream. Advance profiler has additional step of emptying the profiled_actions during teardown.

def teardown(self, stage: Optional[str]) -> None:
        self.profiled_actions = {}

profiled_action is emptied in on_validation_epoch_end, so when profiler trying to stop on_validation_epoch_end action, it gets an error.

ValueError: Attempting to stop recording an action ([LightningModule]MyModel.on_validation_epoch_end) which was never started.

In your case, you can just remove profiler.describe in the code. Trainer will run profiler.describe in _call_teardown_hook in post-training step.

Actually, calling AdvanceProfiler describe directly won't work now. Call it in the middle of profiling, it breaks. Calling it after the training, we get empty summary, the profiled_actions is already emptied during trainer post-training teardown.