Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.13k stars 69 forks source link

Support for reshape+in-place #957

Closed IvanYashchuk closed 2 weeks ago

IvanYashchuk commented 1 month ago

🚀 Feature

torch.reshape (and all operations using reshape internally) is interesting because depending on PyTorch's internal conditions it may or may not return a tensor view. PyTorch documentation says:

reshape(), reshape_as() and flatten() can return either a view or new tensor, user code shouldn’t rely on whether it’s view or not.

@jjsjann123 discovered that with the current functionalization enabled the behavior doesn't match with PyTorch silently producing unexpected results and matches when it's disabled (https://github.com/Lightning-AI/lightning-thunder/issues/657#issuecomment-2274293433).

Short-term solution: Thunder should raise an error if reshape+in-place occurs in the user script irrespective of whether the reshape is safely resolvable to be a non-viewing reshape or not. There should be an option of turning this error into a warning for developers. What do people think about this?

For a long-term solution, I see two paths:

  1. Thunder acts as a linter for PyTorch code and raises an error when reshape+in-place results in a side effect observable outside the computation function. It should be suggested to users to change the code to use torch.view or torch.reshape_copy.
  2. Thunder mimicks PyTorch's behavior. Even though it's discouraged to rely on reshape+in-place side effects it may smoothen Thunder adoption if Thunder behaves like PyTorch.

I'd like people to focus on what the user experience with Thunder should be and not how more technically challenging (or not) one solution is over another. Maybe others can come up with more alternatives?

cc @tfogal, @lantiga, @t-vi, @crcrpar

jjsjann123 commented 4 weeks ago

user code shouldn’t rely on whether it’s view or not.

I'm wondering how this is perceived by users today? i.e. extreme cases are still tricky to resolve, I'm wondering with fake tensor, how reliable can torch's analysis resolve reshape-as-view vs reshape-as-copy? cc'ing @kiya00

IvanYashchuk commented 4 weeks ago

Fake tensors do propagate the stride information but it may do it differently from Eager and then anyway we don't have stride information for the inputs so we cannot initialize correct fake tensors.

There are other ways to get the information whether PyTorch Eager produces a copy or a view reliably. I'd like the discussion here to be focused on the UX and expected behavior, not the technical challenges.

IvanYashchuk commented 4 weeks ago

I'm wondering how this is perceived by users today?

There's no documentation in PyTorch similar to NumPy about viewing vs non-viewing reshape https://numpy.org/doc/stable/user/basics.copies.html#how-to-tell-if-the-array-is-a-view-or-a-copy so non-advanced users can be confused why they observe side-effects sometimes and sometimes not.

kiya00 commented 3 weeks ago

I didn't exclude the operators return view in the auto registration, as Ivan mentioned the stride information is not used. And now I find I didn't add the tensor view operators appropriately in the _syms_returning_runtime_dependently_views and _syms_returning_views (e.g.: there's torch.Tensor.reshape_as auto registered, which has the same problem as reshape+in-place) , do you think it makes more sense if I remove them or just add them in the above mentioned 2 lists and let the corresponding passes handle them?

mruberry commented 3 weeks ago

triage review: an error seems like a good solution for now, it will be interesting to see if this happens in practice

IvanYashchuk commented 2 weeks ago

In the Lightning AI Slack #thunder channel @t-vi wrote:

Personally, I would error if inplace is right after reshape with a stern message educating the user.

@lantiga added:

Same I would error. Like, most correct PyTorch code will have worked its way to finding inplace on views led to a bug and the would have fixed it, so we should just error and save them the trouble.

@mruberry's comment:

I think the only options are to error or mimic PyTorch's behavior in this case. Otherwise we'd have a silent incorrectness issue

Let's raise an error then. I'll create a separate issue to track it. Thanks to everyone who participated in the discussion!