Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28k stars 3.35k forks source link

[RFC] Profiler Metrics #9041

Open tchaton opened 3 years ago

tchaton commented 3 years ago

🚀 Feature

Lighting profilers generates summaries which are important for analysing the code execution and find bottleneck. However, it might be useful for users to make metrics available, so users can take decision based the speed execution such as logging.

Motivation

Provide an interface for the Profiler to share their metrics with the LoggerConnector.

Pitch

Alternatives

Additional context


If you enjoy Lightning, check out our other projects! âš¡

ananthsub commented 3 years ago

+1 - This would be very useful. It came up in our overview here: https://docs.google.com/document/d/1xHU7-iQSpp9KJTjI3As2EM0mfNHHr37WZYpDpwLkivA/edit#heading=h.thyk5srjrhp7 / #7740

I think there are 2 options:

Pros for option 1:

Cons:

Pros for option 2: Cons:

@tchaton - another option is if we're specifically looking to calculate latencies, we could have a Timer alongside the profiler, and push the timer data to the loggers. I think the timer would have the exact same API as the profiler, but with a restricted set of what's actually calculated/returned. I wonder how we could fold this in. Here's a very related issue: https://github.com/PyTorchLightning/pytorch-lightning/issues/8817

tchaton commented 3 years ago

Yes, @ananthsub.

Thanks for describing your thoughts there.

I believe solution 2 would be more scalable in the future. We could add support for SimpleProfiler &/or AdvancedProfiler first.

Best, T.C

kaushikb11 commented 3 years ago

@ananthsub As we will be initially only supporting SimpleProfiler to log profiler metrics. Not really a fan of changing the SimpleProfiler interface to support logging profiler metrics. Also, we would need to connect the logger to the profiler as well. And having that specific for SimpleProfiler would turn ugly.

As a User, I would implement this for logging profiler metrics https://github.com/PyTorchLightning/pytorch-lightning/commit/568d18960e4e7fb68813f079e3442bd5266a61c3. First simple POC Could be easily configurable and everyone is familiar with the callback interface.

Wdyt?