Open yifuwang opened 3 years ago
I think this is a great proposal! Making the trainer easier to debug will tremendously benefit both users and developers.
Regarding the event & event handlers, such a design was also employed for torchelastic here: https://pytorch.org/elastic/0.2.0rc1/events.html (since upstreamed to torch.distributed.elastic)
While Lightning cannot assume the usage of torchelastic, this framing can be beneficial for how we design such a system in Lightning, especially since torchelastic is a mature system.
The idea of defining events (with names, types/priority levels, and metadata), event handlers that subscribe to these events, and a registration system like fsspec's to easily enable custom backends that don't belong in the core is very compelling.
From the end user POV, I imagine we could do something like change the debug level and immediately have the full trace of execution across different ranks. This will make debugging issues like collective hangs so much easier, since we can trace through a logfile and see which rank was out of sync, and when exactly it failed. This would definitely establish a stronger foundation than what we have today.
Internally in the trainer, there is this abstraction called InternalDebugger
: https://github.com/PyTorchLightning/pytorch-lightning/blob/e0605472306d6b95bf2616ab88f8c29f4498402e/pytorch_lightning/utilities/debugging.py#L43
which coincidentally, defines a similar notion of events. However, it does not support event handlers as all the events accumulate on the trainer. This loses out on a lot of benefits, like being able to record events from the lightning module, datamodule, callbacks, or other components.
One thing which will come up is when to define an event handler vs when to use a Lightning Logger - do you see these as sitting on the same level? Could loggers be extended to be these event handlers? Or do you see these as completely separate?
cc @carmocca since I think we'd discussed this a while back as well cc @awaelchli since we were discussing the internal debugger and removing it from the trainer
Some potential requirements for the events could be:
discussing the internal debugger and removing it from the trainer
IMO the InternalDebugger
should be fully removed. It's just there to avoid writting better tests / better structured code :P
IMO the InternalDebugger should be fully removed. It's just there to avoid writting better tests / better structured code :P
+1
@carmocca Yes, the internal debugger should be fully removed. It caused more pain than helped :)
Quick update on this, I'm working on an internal prototype and will post a follow up with more detailed design.
cc: @daniellepintz the exception collection feature you are looking into should most likely use the same plugin mechanism as described in the issue.
Dear @edward-io,
Could you sync with @carmocca about your design (maybe a quick call) ? I would like to make sure we start to collaborate with proper pairing on both side.
Best, T.C
@tchaton, sure will reach out to @carmocca when I have something to share :)
Hi all, here's the proposal doc for events: https://docs.google.com/document/d/1DwmkGODQRgx0-QQL5HsO_okHI3iRjl1dallyCy85QmE/edit#heading=h.6cp7sicq2fv. Please share your thoughts on the doc :) Thanks!
🚀 Feature
Motivation
For large scale Lightning deployments, it is very useful to collect certain types of trainer runtime event into an analytics system, so engineers can monitor the overall healthiness of the deployment, as well as obtain useful information for troubleshooting failed/stuck jobs.
Currently, Lightning mainly relies on logging for communicating trainer runtime information to the users. However, logging is not ideal for large scale deployments for the following reasons:
Pitch
Plugin
which requires users to explicitly pass toTrainer
) which don’t belong in the core trainer directly. This could make Lightning more attractive for larger organizations that relies on shared tooling.fsspec
is an ideal candidate solution: https://filesystem-spec.readthedocs.io/en/latest/developer.html#implementing-a-backendAlternatives
Additional context
A similar prior feature request: https://github.com/PyTorchLightning/pytorch-lightning/issues/8186
If you enjoy Lightning, check out our other projects! âš¡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.