Provide debugging traces and options as a ENV variable or JIT option

parthmannan commented 5 months ago

🚀 Feature

An environment variable that dumps out the various Thunder provided debug traces to a log file. This can have variable levels like export THUNDER_DEBUG=<option>

0/'' : Disable
1/'trace' : Enable and dump Thunder generated trace. Can be limited to the trace after delete last used
2/'nvfuser_region' : Enable and dump nvFuser captured regions in addition to 1
3/'nvfuser_code' : Enable and dump nvFuser generated CUDA kernel code in addition to 1 and 2
4/'torch_compile_debug' : Enable the torch.compile debug logging (TORCH_COMPILE_DEBUG=1)

This is a narrow example of the possible debug log levels. Each of these logs can be in a different log file.

Motivation

To get the trace and other debugging information today, we need to add code that captures the trace and prints it after running a model iteration with the inputs.

This is cumbersome as the training code needs to be edited to enable tracing and re-edited when finished.
The ability to find when an iteration has finished and add the tracing code at the appropriate location may not always be possible as Thunder aims to compile more and more convoluted set of repositories. For example, when using libraries like Lightning Trainer, the user may want to just call model.train() but editing the iteration loop can be difficult.

cc - @mruberry

cc @carmocca @apaz-cli

crcrpar commented 4 months ago

@kshitij12345 would you think https://github.com/Lightning-AI/lightning-thunder/blob/21a222b180009616a4cc48176958b4506894a330/thunder/core/transforms.py#L468 would let us write a simple callback that just saves the given traces?

kshitij12345 commented 4 months ago

I see three issues using add_post_optimization_transform

Currently, post_optimization_transform is not applied to prologue_trace - so we won't be able to save it.
The transform is applied to forward and backward trace independently (but we don't explicitly say if given trace is forward or backward). We can probably derive it from trace signature but I don't think it is a good idea.
Also, if using multiple post_optimization_transforms, user will have to make sure that this saving transform would be last, otherwise, it would miss saving information from other transforms which were applied after this one.

https://github.com/Lightning-AI/lightning-thunder/blob/21a222b180009616a4cc48176958b4506894a330/thunder/__init__.py#L603-L610

Also, using add_post_optimization_transform would still require some changes to training code.

Lightning-AI / lightning-thunder

Provide debugging traces and options as a ENV variable or JIT option #304

🚀 Feature

Motivation