[RFC] Profiler Metrics - Githubissues

tchaton commented 3 years ago

🚀 Feature

Lighting profilers generates summaries which are important for analysing the code execution and find bottleneck. However, it might be useful for users to make metrics available, so users can take decision based the speed execution such as logging.

Motivation

Provide an interface for the Profiler to share their metrics with the LoggerConnector.

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

ananthsub commented 3 years ago

+1 - This would be very useful. It came up in our overview here: https://docs.google.com/document/d/1xHU7-iQSpp9KJTjI3As2EM0mfNHHr37WZYpDpwLkivA/edit#heading=h.thyk5srjrhp7 / #7740

I think there are 2 options:

Option 1: The profiler stores a list of records/events after profile is yielded, and these records can be fetched by callers
Option 2: the profiler acepts a Logger and uses it to call log_metrics inside of profile

Pros for option 1:

It keeps the profiler self-contained: there is no dependence on external components

Cons:

Requires additional orchestration in the trainer to fetch profiler records & push them to the logger connector
Unclear what the buffering policy is or when are records cleared. Is there a memory issue?
What should be the schema for these profiler records? Dict[str, float] as the payload?

Pros for option 2: Cons:

We have to plumb data for logger APIs through the profiler (e.g. what step to log for the metrics)
Requires attaching the logger & profiler in the trainer, so some orchestration is still required
Unclear if all loggers work with all profilers

@tchaton - another option is if we're specifically looking to calculate latencies, we could have a Timer alongside the profiler, and push the timer data to the loggers. I think the timer would have the exact same API as the profiler, but with a restricted set of what's actually calculated/returned. I wonder how we could fold this in. Here's a very related issue: https://github.com/PyTorchLightning/pytorch-lightning/issues/8817

tchaton commented 3 years ago

Yes, @ananthsub.

Thanks for describing your thoughts there.

I believe solution 2 would be more scalable in the future. We could add support for SimpleProfiler &/or AdvancedProfiler first.

Best, T.C

kaushikb11 commented 3 years ago

@ananthsub As we will be initially only supporting SimpleProfiler to log profiler metrics. Not really a fan of changing the SimpleProfiler interface to support logging profiler metrics. Also, we would need to connect the logger to the profiler as well. And having that specific for SimpleProfiler would turn ugly.

As a User, I would implement this for logging profiler metrics https://github.com/PyTorchLightning/pytorch-lightning/commit/568d18960e4e7fb68813f079e3442bd5266a61c3. First simple POC Could be easily configurable and everyone is familiar with the callback interface.

Wdyt?

Lightning-AI / pytorch-lightning

[RFC] Profiler Metrics #9041