Kernel Breakdown by Annotation Range

jeromeku commented 3 months ago

🚀 Motivation and context

Is it possible to correlate kernel distribution with ranges annotated either through torch.cuda.nvtx or torch.profiler.profile?

The use case is model architecture optimization. I'd like a to understand where the bottlenecks are in a model forward / backwards and where the opportunities are for kernel fusion, cuda graphs, etc. Exporting a chrome / tensorboard trace can be helpful for visualizing such areas when model regions are annotated with torch.profiler.record_function (or nvtx) but it would be helpful to have this information available for further analysis as a dataframe.

Description

It would be useful to have kernel breakdown by annotation range aggregated into a dataframe to further investigate problematic modules and layers within the model:

kernel breakdown by annotation region
full correlation trace of the aten / torch ops that dispatched these kernels
additional kernel stats: cudaLaunch time, launch stats (occupancy, grid dim, block dim, kernel args), latency, FLOPs, I/O, etc.

Alternatives

No response

Additional context

No response

briancoutinho commented 2 months ago

@jeromeku Are you expecting something like a kernel_dataframe with a callstack column = ["aten:op1", "aten:op", "module name"..]

Does the call_stack logic help to achieve something similar to your request? https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/hta/common/call_stack.py

It should be able to link from the kernel up to the operators (and likely user annotations like profiler.profile)

jeromeku commented 2 months ago

Something like the "Events" view in nsys, where you can see a trace of kernels by time, grouped by nvtx range. See this for example from this thread.

Essentially what you see when you do prof.key_averages().print_table except:

not aggregated -- full trace by time
retains the nested structure of the annotated range and call stack - e.g., if I annotate a range with record_function('my_range'), I should see a top-level my_range followed by the entire call stack of operators and the kernels they ultimately dispatch to ordered by time along with other collected stats.
can be exported as a pd.Dataframe

facebookresearch / HolisticTraceAnalysis