Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.78k stars 3.34k forks source link

Keep User-Defined Order of Callbacks #15026

Open wistuba opened 1 year ago

wistuba commented 1 year ago

Proposed refactor

Provide a way to disable automatic reordering or completely remove it: https://github.com/Lightning-AI/lightning/blob/7fed7a12c56c6b0e08f0223f68d7d311ea7ac90f/src/pytorch_lightning/trainer/connectors/callback_connector.py#L88 Trainer could take an additional argument reorder_arguments=True.

Motivation

It is not always the case that ModelCheckpoint needs to be the last callback. For example, hyperparameter optimization libraries might externally trigger to kill a Python job based on a feedback before we can checkpoint.

Pitch

Let me use the exact same order of callbacks I've passed to the trainer.

Additional context

Related: #10260

cc @borda @awaelchli @justusschock

carmocca commented 1 year ago

wrt the motivation, is the hyperparameter optimization library integrated as a Callback? Why does it not save a checkpoint when it signals to kill the job?

wistuba commented 1 year ago

Yes, the communication with the library works via callback. Please check this example from Ray Tune: https://docs.ray.io/en/latest/_modules/ray/tune/integration/pytorch_lightning.html The callback simply reports the metric. The HPO library does not provide any signals to the Python code. When using Hyperband, the HPO scheduler might decide to kill the Python job directly. The library cannot checkpoint since it has no access to the model and might not even run on the same machine.

carmocca commented 1 year ago

The TuneReportCheckpointCallback and _TuneCheckpointCallback should subclass Checkpoint so that they get put last when re-ordered: https://pytorch-lightning.readthedocs.io/en/stable/common/checkpointing_expert.html#writing-your-own-checkpoint-class https://github.com/Lightning-AI/lightning/blob/0b04aa879f39504ce83d0fc355bd141aba801d6d/src/pytorch_lightning/callbacks/checkpoint.py

This should solve your issue.

wistuba commented 1 year ago

I see how this works and of course I could create something like

class KeepMyOrderPlease(Callback):
  def __init__(self, callbacks: List[Callback]):
    super().__init__()
    self._callbacks = callbacks

and then pass

callbacks = KeepMyOrderPlease(my_callbacks)

to the trainer class as a callback.

However, this issue was not created to find some solution to a problem but rather about making the use of callbacks more elegant. Why is there a need for reordering and why not leave the order to the user? It does not seem very OO to merge two callbacks into one even though they have nothing to do with each other.

Furthermore, even though the Ray Tune example adds checkpointing this wouldn't be my use case. I want to provide an HPO Callback class and leave it to the user how they checkpoint or if they want to. Again, not looking for a hack here.

carmocca commented 1 year ago

The checkpointing callbacks need to go last because the rest of the callbacks can save state. If another kind of Callback was after, then it wouldn't get saved with the checkpoints.

This is why we added the empty Checkpoint class, it's the one we use internally to know whether a Callback should go at the end.

For the other kinds of callbacks, the order passed to the Trainer is preserved.

wistuba commented 1 year ago

Are you saying that if my callback extends Checkpoint it will not get reordered and can go last? It does not need to checkpoint and I can use another Checkpoint callback?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

abhay-agarwal commented 1 year ago

The checkpointing callbacks need to go last because the rest of the callbacks can save state. If another kind of Callback was after, then it wouldn't get saved with the checkpoints.

This is why we added the empty Checkpoint class, it's the one we use internally to know whether a Callback should go at the end.

For the other kinds of callbacks, the order passed to the Trainer is preserved.

There are many instances where you want a callback to occur after model checkpointing, for example a task that remotely executes that checkpoint on some other task, etc. The alternative is to place such calls in a different callback such as on_train_epoch_start but that is then breaking the paradigm of callbacks entirely. I would want there to be a way to define callbacks to occur in the order the user placed them in.

carmocca commented 1 year ago

The Trainer needs a mechanism to force a hook order (opt-in). https://github.com/Lightning-AI/lightning/issues/17131 suggests this.

Feel free to leave your thoughts or proposal

abhay-agarwal commented 1 year ago

We are going with a callback that extends the Checkpoint stub class. My recommendation to unblock this issue (which is going to remain a sticking point for the community IMO -- it's one of those "opaque magical things that frameworks do" that can piss off devs) is as follows:

  1. keep whatever you have in place for 2.0 level workarounds
  2. Implement a set of stub callbacks -- PreCheckpointCallback CheckpointingCallback PostCheckpointingCallback, PreLightningModuleCallback, PostLightningModuleCallback that operate with stricter guarantees.
  3. In the next breaking release (i.e. 3.0) possibly adjust all callbacks in the standard library to subclass these stubs.

I am unfortunately unable to participate in the OSS code repo here, but I appreciate your work on this framework (which we use heavily)!