Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.16k stars 77 forks source link

Provide debugging traces and options as a ENV variable or JIT option #304

Open parthmannan opened 5 months ago

parthmannan commented 5 months ago

🚀 Feature

An environment variable that dumps out the various Thunder provided debug traces to a log file. This can have variable levels like export THUNDER_DEBUG=<option>

0/'' : Disable
1/'trace' : Enable and dump Thunder generated trace. Can be limited to the trace after delete last used
2/'nvfuser_region' : Enable and dump nvFuser captured regions in addition to 1
3/'nvfuser_code' : Enable and dump nvFuser generated CUDA kernel code in addition to 1 and 2
4/'torch_compile_debug' : Enable the torch.compile debug logging (TORCH_COMPILE_DEBUG=1)

This is a narrow example of the possible debug log levels. Each of these logs can be in a different log file.

Motivation

To get the trace and other debugging information today, we need to add code that captures the trace and prints it after running a model iteration with the inputs.

cc - @mruberry

cc @carmocca @apaz-cli

crcrpar commented 4 months ago

@kshitij12345 would you think https://github.com/Lightning-AI/lightning-thunder/blob/21a222b180009616a4cc48176958b4506894a330/thunder/core/transforms.py#L468 would let us write a simple callback that just saves the given traces?

kshitij12345 commented 4 months ago

I see three issues using add_post_optimization_transform

https://github.com/Lightning-AI/lightning-thunder/blob/21a222b180009616a4cc48176958b4506894a330/thunder/__init__.py#L603-L610

Also, using add_post_optimization_transform would still require some changes to training code.