Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.07k stars 61 forks source link

Automated benchmarking and reporting #183

Open riccardofelluga opened 2 months ago

riccardofelluga commented 2 months ago

🚀 Feature

I would like to have automated benchmarks for selected models to allow for performance tracking.

Work items

Automated benchmarking in this context means two things:

The benchmarking suite should enable developers to write scripts and report metrics. More specifically the automation here means that the benchmarking suite is able to be ran from CI and create a summary of the benchmark results. As an output I would like to have an easy to read summary highlighting differences in metrics such as iteration time or memory usage.

cc @crcrpar

IvanYashchuk commented 2 months ago

Automation is always interesting. Could you please expand on what you mean by "automated" specifically? What are the manual steps that you'd like to see automated?

riccardofelluga commented 2 months ago

Sure! I've updated the description with more info

xwang233 commented 2 months ago

How is this different from running

pytest thunder/benchmarks/targets.py
python thunder/benchmarks/distributed.py

, which we already have running nightly and benchmark data collected?

cc @crcrpar @tfogal

IvanYashchuk commented 1 month ago

@riccardofelluga, can you answer the question above from Xiao? Providing detailed information on what's on your mind and what you would like to achieve would be very helpful here.

riccardofelluga commented 1 month ago

@IvanYashchuk At the moment we are still in pre- design review phase so the ideas around this issue are being consolidated. I would have preferred to reply when there is a more concrete idea/proposal. For the time being, I am sorry for the late reply @xwang233. In the current state there are some benchmarks that at the moment are not being reported yet so one of the objective of this issue is to add those. Another objective is precisely to explore what is being benchmarked and what not and then take action based on that. I will add more comments once the OKR is sorted out with more information.