Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.51k stars 3.39k forks source link

Runtime events with support for custom handlers #8895

Open yifuwang opened 3 years ago

yifuwang commented 3 years ago

🚀 Feature

Motivation

For large scale Lightning deployments, it is very useful to collect certain types of trainer runtime event into an analytics system, so engineers can monitor the overall healthiness of the deployment, as well as obtain useful information for troubleshooting failed/stuck jobs.

Currently, Lightning mainly relies on logging for communicating trainer runtime information to the users. However, logging is not ideal for large scale deployments for the following reasons:

Pitch

Alternatives

Additional context

A similar prior feature request: https://github.com/PyTorchLightning/pytorch-lightning/issues/8186


If you enjoy Lightning, check out our other projects! âš¡

ananthsub commented 3 years ago

I think this is a great proposal! Making the trainer easier to debug will tremendously benefit both users and developers.

Regarding the event & event handlers, such a design was also employed for torchelastic here: https://pytorch.org/elastic/0.2.0rc1/events.html (since upstreamed to torch.distributed.elastic)

While Lightning cannot assume the usage of torchelastic, this framing can be beneficial for how we design such a system in Lightning, especially since torchelastic is a mature system.

The idea of defining events (with names, types/priority levels, and metadata), event handlers that subscribe to these events, and a registration system like fsspec's to easily enable custom backends that don't belong in the core is very compelling.

From the end user POV, I imagine we could do something like change the debug level and immediately have the full trace of execution across different ranks. This will make debugging issues like collective hangs so much easier, since we can trace through a logfile and see which rank was out of sync, and when exactly it failed. This would definitely establish a stronger foundation than what we have today.

Internally in the trainer, there is this abstraction called InternalDebugger: https://github.com/PyTorchLightning/pytorch-lightning/blob/e0605472306d6b95bf2616ab88f8c29f4498402e/pytorch_lightning/utilities/debugging.py#L43

which coincidentally, defines a similar notion of events. However, it does not support event handlers as all the events accumulate on the trainer. This loses out on a lot of benefits, like being able to record events from the lightning module, datamodule, callbacks, or other components.

One thing which will come up is when to define an event handler vs when to use a Lightning Logger - do you see these as sitting on the same level? Could loggers be extended to be these event handlers? Or do you see these as completely separate?

cc @carmocca since I think we'd discussed this a while back as well cc @awaelchli since we were discussing the internal debugger and removing it from the trainer

carmocca commented 3 years ago

Some potential requirements for the events could be:

discussing the internal debugger and removing it from the trainer

IMO the InternalDebugger should be fully removed. It's just there to avoid writting better tests / better structured code :P

ananthsub commented 3 years ago

IMO the InternalDebugger should be fully removed. It's just there to avoid writting better tests / better structured code :P

+1

tchaton commented 3 years ago

@carmocca Yes, the internal debugger should be fully removed. It caused more pain than helped :)

edward-io commented 3 years ago

Quick update on this, I'm working on an internal prototype and will post a follow up with more detailed design.

yifuwang commented 3 years ago

cc: @daniellepintz the exception collection feature you are looking into should most likely use the same plugin mechanism as described in the issue.

tchaton commented 3 years ago

Dear @edward-io,

Could you sync with @carmocca about your design (maybe a quick call) ? I would like to make sure we start to collaborate with proper pairing on both side.

Best, T.C

edward-io commented 3 years ago

@tchaton, sure will reach out to @carmocca when I have something to share :)

edward-io commented 3 years ago

Hi all, here's the proposal doc for events: https://docs.google.com/document/d/1DwmkGODQRgx0-QQL5HsO_okHI3iRjl1dallyCy85QmE/edit#heading=h.6cp7sicq2fv. Please share your thoughts on the doc :) Thanks!