Runtime events with support for custom handlers

yifuwang commented 3 years ago

🚀 Feature

Motivation

For large scale Lightning deployments, it is very useful to collect certain types of trainer runtime event into an analytics system, so engineers can monitor the overall healthiness of the deployment, as well as obtain useful information for troubleshooting failed/stuck jobs.

Currently, Lightning mainly relies on logging for communicating trainer runtime information to the users. However, logging is not ideal for large scale deployments for the following reasons:

Many training jobs in large scale deployments are automated, long-running batch jobs. It is impractical to expect the users to notice certain warnings (e.g. API deprecation warning) in a timely fashion.
The desirable amount of events for debuggability purposes may be too verbose for logging.
Logging is generally less structured.

Pitch

Introduce the concept of runtime event in Lightning
Emit appropriate runtime events at appropriate places (e.g. deprecated API usage)
Provide a mechanism to register custom handlers for runtime events
- Ideally, we'd like a mechanism to transparently register default handlers (unlike Plugin which requires users to explicitly pass to Trainer) which don’t belong in the core trainer directly. This could make Lightning more attractive for larger organizations that relies on shared tooling.
- The backend registration mechanism of fsspec is an ideal candidate solution: https://filesystem-spec.readthedocs.io/en/latest/developer.html#implementing-a-backend

Alternatives

Additional context

A similar prior feature request: https://github.com/PyTorchLightning/pytorch-lightning/issues/8186

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

ananthsub commented 3 years ago

I think this is a great proposal! Making the trainer easier to debug will tremendously benefit both users and developers.

Regarding the event & event handlers, such a design was also employed for torchelastic here: https://pytorch.org/elastic/0.2.0rc1/events.html (since upstreamed to torch.distributed.elastic)

While Lightning cannot assume the usage of torchelastic, this framing can be beneficial for how we design such a system in Lightning, especially since torchelastic is a mature system.

The idea of defining events (with names, types/priority levels, and metadata), event handlers that subscribe to these events, and a registration system like fsspec's to easily enable custom backends that don't belong in the core is very compelling.

From the end user POV, I imagine we could do something like change the debug level and immediately have the full trace of execution across different ranks. This will make debugging issues like collective hangs so much easier, since we can trace through a logfile and see which rank was out of sync, and when exactly it failed. This would definitely establish a stronger foundation than what we have today.

Internally in the trainer, there is this abstraction called InternalDebugger: https://github.com/PyTorchLightning/pytorch-lightning/blob/e0605472306d6b95bf2616ab88f8c29f4498402e/pytorch_lightning/utilities/debugging.py#L43

which coincidentally, defines a similar notion of events. However, it does not support event handlers as all the events accumulate on the trainer. This loses out on a lot of benefits, like being able to record events from the lightning module, datamodule, callbacks, or other components.

One thing which will come up is when to define an event handler vs when to use a Lightning Logger - do you see these as sitting on the same level? Could loggers be extended to be these event handlers? Or do you see these as completely separate?

cc @carmocca since I think we'd discussed this a while back as well cc @awaelchli since we were discussing the internal debugger and removing it from the trainer

carmocca commented 3 years ago

Some potential requirements for the events could be:

Supporting structured data: dataclasses/dictionaries/named tuples
Supporting text data: human-readable messages
Supporting handlers to files: JSON, plain text
Integrating with profilers
Supporting logging any byte data (artifacts): Either the events are able to serialize these or there's a mechanism to link saved data to an event

discussing the internal debugger and removing it from the trainer

IMO the InternalDebugger should be fully removed. It's just there to avoid writting better tests / better structured code :P

ananthsub commented 3 years ago

IMO the InternalDebugger should be fully removed. It's just there to avoid writting better tests / better structured code :P

+1

tchaton commented 3 years ago

@carmocca Yes, the internal debugger should be fully removed. It caused more pain than helped :)

edward-io commented 3 years ago

Quick update on this, I'm working on an internal prototype and will post a follow up with more detailed design.

yifuwang commented 3 years ago

cc: @daniellepintz the exception collection feature you are looking into should most likely use the same plugin mechanism as described in the issue.

tchaton commented 3 years ago

Dear @edward-io,

Could you sync with @carmocca about your design (maybe a quick call) ? I would like to make sure we start to collaborate with proper pairing on both side.

Best, T.C

edward-io commented 3 years ago

@tchaton, sure will reach out to @carmocca when I have something to share :)

edward-io commented 3 years ago

Hi all, here's the proposal doc for events: https://docs.google.com/document/d/1DwmkGODQRgx0-QQL5HsO_okHI3iRjl1dallyCy85QmE/edit#heading=h.6cp7sicq2fv. Please share your thoughts on the doc :) Thanks!

Lightning-AI / pytorch-lightning