Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.07k stars 60 forks source link

[Feature request] Optional debugging option to get trace with information on tensor strides along with tensor shapes #614

Open parthmannan opened 1 week ago

parthmannan commented 1 week ago

🚀 Feature Request

Currently we have computation traces with the generated tensor shapes as part of comments next to the computation like

t908 = torch.nn.functional.linear(t907, t19, t17)  # t908: "cuda:0 bf16[1, 2048, 4096]"
    # t908 = ltorch.linear(t907, t19, t17)  # t908: "cuda:0 bf16[1, 2048, 4096]"
      # t908 = prims.linear(t907, t19, t17)  # t908: "cuda:0 bf16[1, 2048, 4096]"

However, there are some situations where stride information becomes necessary for debugging like #583 where a stride difference was creating an illegal memory access in one of the executors. While my understanding is that Thunder consciously has made a decision to not include stride information in order to let backends manage strides on their own and not limit constructed traces to stride requirements. This feature request does not require changing that.

Given a set of fixed input tensors, can there be a way of generating computation traces with the tensor shapes and strides? There can be a mandatory requirement that such a trace can only be generated after one full iteration has executed so that strides can be recorded or such a trace can only be generated until a failed execution?

cc @carmocca @apaz-cli

t-vi commented 1 week ago

Hi @parthmannan , thank you for filing this!

Let me pick your brain a bit: Maybe we could have an advanced debugging tutorial where we define a symbol with an impl printing tensor information (such as stride) and a transformation that inserts calling that symbol for every TensorProxy result. Would that help you?

parthmannan commented 1 week ago

That sounds pretty useful and should suffice the requirement. Would calling this transformation generate a full computation trace for every TensorProxy result with the required tensor information?

And I am guessing this is far more useful than just having a trace with strides as one can define a symbol that prints any attribute of a tensor like requires_grad etc.?

(Side comment) Re: tutorial - While a tutorial would be awesome to start but it does require users to modify model execution code. The easiest debugging method is just enabling debug logs using env variables like TORCH_COMPILE_DEBUG or CUBLASLT_LOG_LEVEL etc. This doesn't need to be an option for something niche like strides but in general, our debugging is a little more complex with users required to call functions to grab traces, print them out etc.

mruberry commented 4 days ago

triage review: