Open IvanYashchuk opened 3 days ago
Oh, maybe on Colab with T4 GPU Thunder decides not to use Flash Attention and decomposes the SDPA call...
Oh, maybe on Colab with T4 GPU Thunder decides not to use Flash Attention and decomposes the SDPA call...
Right, there is no sdpafx_grad_forward_scaled_dot_product_efficient_attention
in the trace. So Colab's estimate makes sense with the decomposition Thunder produces. What can we do to ask Thunder to produce the same trace as would be on H100 when running on T4 or with no GPU at all?
Thank would be very cool. I guess one would need to override some behavior in the checkers.
Right now the claim made by checkers is twofold:
We probably need to decouple the two, so that only the first check is done, irrespective of the second. It could be a flag we pass or something else.
To my mind, it would be feasible to include some form of "assume this hardware" in compile data and then divert queries for hardware properties to that, falling back to real hardware.
It would be great to have a dict that we can set from the outside that describes the hardware capabilities, so backends can use that in the checker functions.
In general, the more we avoid relying on the actual underlying hardware to reason about the computation, the better it is. This should be an explicit goal.
I agree that the checkers should be more controllable and there should be a way to describe the target hardware. It could also be useful in a future export scenario. Let's create a separate issue to track accomplishing this goal.
For this particular operation, it's even more complicated because we redirect to PyTorch to tell us what scaled_dot_product_attention to use https://github.com/Lightning-AI/lightning-thunder/blob/5fc67dcba844554c8a2390ef8775594e61f18737/thunder/executors/sdpaex.py#L654-L661 However, it shouldn't be too hard to force the selection of a particular backend in the checker https://github.com/Lightning-AI/lightning-thunder/blob/5fc67dcba844554c8a2390ef8775594e61f18737/thunder/executors/sdpaex.py#L683-L699
🐛 Bug
get_alloc_memory
has a potential to be a great tool for estimating the effect on memory usage after transforming traces. For those who don't know about this function it lives here: https://github.com/Lightning-AI/lightning-thunder/blob/8c5905fd1a93145e690791a7c7a3c3e10b16b32b/thunder/examine/memory_caculation.py#L120-L137I've tried generating the execution trace with Fake CUDA Tensors so that it's possible to analyze the final trace without the actual execution even on low-memory GPUs like in Colab:
The estimated peak memory for this trace is for some reason different on Colab and locally
on Colab I see
and locally:
On Colab litgpt, nvfuser, and Thunder are installed with:
cc @apaz-cli