Add allow_cpu_scalar_tensors for shape,device check and broadcast functions

Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.

Apache License 2.0

1.2k stars 80 forks source link

Add allow_cpu_scalar_tensors for shape,device check and broadcast functions #1095

Closed kiya00 closed 2 months ago

kiya00 commented 2 months ago

Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos and docs improvements) - [ ] Did you read the [contributor guideline](https://github.com/Lightning-AI/pytorch-lightning/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests?

What does this PR do?

Fixes #941.

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

kiya00 commented 2 months ago

Hi @t-vi @mruberry , Torch can run binary operators on cpu scalar tensor and cuda tensor, Thunder can't (an example here), I tried to simply add the allow_cpu_scalar_tensors flag similar to torch to solve this problem, but it affects the main code path and breaks a lot of test cases. Do you think it's the right way to fix the bug?

t-vi commented 2 months ago

Thank you, @kiya00 , for looking into this! I think having a allow_cpu_scalar_tensors is a good ingredient for the solution, but in addition we need to take scalar.item() and then process treating it as a number. The tricky part will be that this needs to either

be inside the prims because we cannot leak the .item() value into the trace, or
building on @jjsjann123 's number proxy work, make the .item() return a number proxy and make that work.

I imagine that that either option will be not be terribly easy.

The alternative (quite inferior to my mind) is that we need to move it to GPU and broadcast to treat it as a tensor (to my mind not desirable).

jjsjann123 commented 2 months ago

we cannot leak the .item() value into the trace

QQ: what's the issue with having item in the trace? I thought we do have PrimIDs.ITEM, which does exactly that?

Specifically, what does this mean in the example that Yan has vvv

def fun(x):
    x = x * torch.tensor(0.5, dtype=x.dtype)
    return x

Are we trying to not have the item() inside torch.mul?

t-vi commented 2 months ago

QQ: what's the issue with having item in the trace? I thought we do have PrimIDs.ITEM, which does exactly that?

Sorry, I was too terse: we can have .item() in the trace, we cannot have the returned value as a constant value in the trace. So if we want to have the .item() in the trace, the return value needs to be a number proxy.

jjsjann123 commented 2 months ago

QQ: what's the issue with having item in the trace? I thought we do have PrimIDs.ITEM, which does exactly that?

Sorry, I was too terse: we can have .item() in the trace, we cannot have the returned value as a constant value in the trace. So if we want to have the .item() in the trace, the return value needs to be a number proxy.

Got'ya.

Do we need to lift cpu scalar tensor into numberproxy? If that's going to be treated as a dynamic thing, why not just leave it as-is? Looking at the changes in this draft PR, why would we have an extra field to allow cpu scalar? Are we seeing any errors when we stopped broadcasting cpu scalar tensors?

My naive understanding is that the changes here should be the default behavior. PyTorch has been treating cpu scalar tensors as scalars. So I would expect backends to play well with that already. cc'ing @Priya2698 for visibility.

t-vi commented 2 months ago

Do we need to lift cpu scalar tensor into numberproxy? If that's going to be treated as a dynamic thing, why not just leave it as-is?

To my mind, PyTorch's handling of cpu scalar->scalar is similar to upcasting or broadcasting of tensors and implicit. It then takes the overloads with Scalar args, not the Tensor ones: So for the torch.add(a, b), PyTorch will call a 2-tensor vectorized_elementwise kernel if b is a number or a cpu scalar tensor and a 3-tensor kernel if b is a cuda tensor (different ones depending on whether there is broadcasting).

So if we want to mirror this, we would want the prims with number arguments to be called in the case of a scalar cpu tensor.

(Also, I'm wondering what the implicitness means e.g. for CUDAGraphs.)

Maybe a summary:

I don't have a full picture and am not set of a solution,
My intuition is that the combination of "do like PyTorch and make it explicit in the trace" has worked well for us with broadcasting and upcasting, so I would probably start exploring that.

kiya00 commented 2 months ago

Do we need to lift cpu scalar tensor into numberproxy? If that's going to be treated as a dynamic thing, why not just leave it as-is? Looking at the changes in this draft PR, why would we have an extra field to allow cpu scalar? Are we seeing any errors when we stopped broadcasting cpu scalar tensors?

Before the changes in this draft, the binary operator broadcast the cpu scalar tensor and check the same device and same shape, the changes try to skip those for cpu scalar tensor in the meta function propagation, and in the runtime since the torch/nvfuser backend can take care of the binary ops using the mix of cpu scalar tensor and cuda tensor(is that right?). Then maybe we can leave it as-is instead of take out the value as numberproxy? what do you think? @t-vi @jjsjann123

mruberry commented 2 months ago

Hey @kiya00!

It seems like we have two options for dealing with this issue:

We can treat cpu tensors with zero dimensions like numbers in most cases
We can convert scalar cpu tensors to numbers in torch operations

The first option requires the primitives be able to understand some cpu tensors as numbers. This is not a totally unreasonable request to make of them. Instead of saying "allow_cpu_scalar_tensors", an argument like "treat_cpu_scalar_tensors_as_numbers" might be more accurate. We should be careful to note that type promotion is different for cpu tensors with zero dimensions and numbers, because the former has an actual tensor dtype. Our current type promotion probably handles this incorrectly, but maybe we could update

https://github.com/Lightning-AI/lightning-thunder/blob/f945409580e555b010f5272eb19dcba744ae3196/thunder/core/dtypes.py#L311

so that it returns a weak dtype for any cpu scalar tensor? We want the behavior to mimic PyTorch's in cases like:

a = torch.randn(1, dtype=torch.float16).squeeze()
b = torch.randn((2, 2), dtype=torch.bfloat16)

# The tensor dtype of b is used because a is a cpu scalar tensor
a + b
tensor([[-0.5312, -3.5781],
        [-1.3906,  0.1211]], dtype=torch.bfloat16)

# The dtype of a is the result, not float32 as would be the case if 1. + 2 were turned into a tensor
a + 2
tensor(1.1211, dtype=torch.float16)

Alternatively, we could change the torch operations or maybe the clang operations to convert scalar cpu tensors into number proxies. There are a couple challenges with this, however. First, it requires adding code to a lot of operations. Second, the conversion to a number would lose the tensor dtype information, because our number proxies currently only have number types (fyi @jjsjann123, it could be interesting to think about changing this). Converting tensors to number proxies might also require updating the conversion from thunder objects to torch objects, because a numberproxy with a tensor dtype would have to be wrapped in a cpu tensor before being returned to the practitioner.

I think either approach is possible. I guess my question is how easy would it be for executors, like nvFuser, to use a scalar cpu tensor as a number? @kevinstephano @jjsjann123. If that is easy, then using the approach in this PR, with treat_cpu_scalar_tensors_as_numbers, is probably the simpler approach. We should add tests to ensure it works and file an issue for nvfuser support. Maybe there are some custom tests we can write and extend some OpInfo sample inputs?

If we take this approach, then will operations like adding a cuda tensor and a cpu scalar tensor always be dispatched to the PyTorch executor and not the nvFuser executor, because one of the inputs is a cpu scalar tensor?

jjsjann123 commented 2 months ago

Thanks a lot for the awesome summary and listing out options @t-vi @mruberry :clap:

I guess my question is how easy would it be for executors, like nvFuser, to use a scalar cpu tensor as a number?

nvfuser today already support scalar cpu tensor. Hence my earlier suggestion to just leave it as-is and just update the broadcast/type promotion in proxy to properly handle cpu scalar tensors. :)

another question I have is re: allow_cpu_scalar_tensors vs treat_cpu_scalar_tensors_as_numbers

I don't see the reason of having these field in the first place. Shouldn't it be a generic feature for promotion to handle? IMHO. if there are operations that doesn't allow cpu scalar tensor, the assert should happen inside the specific operation instead.

do like PyTorch and make it explicit in the trace" has worked well for us with broadcasting and upcasting

Re: @t-vi 's earlier comment, I don't disagree with this. However, in the cpu scalar specific things, I don't see a clear win by explicitly converting it to a number proxy in the trace. I expect that's a lot more work up front to tackle, comparing with fixing broadcast/type promotion, which can be done incrementally.

mruberry commented 2 months ago

Thanks a lot for the awesome summary and listing out options @t-vi @mruberry 👏

I guess my question is how easy would it be for executors, like nvFuser, to use a scalar cpu tensor as a number?

nvfuser today already support scalar cpu tensor. Hence my earlier suggestion to just leave it as-is and just update the broadcast/type promotion in proxy to properly handle cpu scalar tensors. :)

Great!

another question I have is re: allow_cpu_scalar_tensors vs treat_cpu_scalar_tensors_as_numbers

I don't see the reason of having these field in the first place. Shouldn't it be a generic feature for promotion to handle? IMHO. if there are operations that doesn't allow cpu scalar tensor, the assert should happen inside the specific operation instead.

I agree; having no option (or defaulting the option to on if it's required for some reason) sounds good.

kiya00 commented 2 months ago

Hi all, thank you all for your valuable insights and helpful tips!

another question I have is re: allow_cpu_scalar_tensors vs treat_cpu_scalar_tensors_as_numbers

I don't see the reason of having these field in the first place. Shouldn't it be a generic feature for promotion to handle? IMHO. if there are operations that doesn't allow cpu scalar tensor, the assert should happen inside the specific operation instead.

I agree; having no option (or defaulting the option to on if it's required for some reason) sounds good.

One example that the cpu scalar tensors need to be broadcast is when it's used in a list of cpu tensor indexes(like index_put). Maybe it's more flexible to have this field and set the default to be True since the functions are general util function, and if there are operations that don't allow cpu scalar tensor they can still use the util function?

If we take this approach, then will operations like adding a cuda tensor and a cpu scalar tensor always be dispatched to the PyTorch executor and not the nvFuser executor, because one of the inputs is a cpu scalar tensor?

Thanks for the point! after modifing the check function for nvfuser, it can be fused into nvfuser

@torch.no_grad()
@no_autocast
def augmented_forward_fn(x, y):
  # x: "cuda:0 f32[2, 2]"
  # y: "cpu f32[]"
  [t0] = nvFusion0(x, y)
    # t0 = prims.mul(x, y)  # t0: "cuda:0 f32[2, 2]"
  return {'output': t0, 'flat_args': [x, y], 'flat_output': (t0,)}, ((x, y), ())

We should add tests to ensure it works and file an issue for nvfuser support. Maybe there are some custom tests we can write and extend some OpInfo sample inputs?

The test case of a cuda tensor and a cpu scalar tensor is added to binary op tests, and many cases fail (results shape don't match or on different devices), after debugging one by one, they are fixed by some small pieces of code. There're still 2 tests that fail on torch-nightly but pass on torch2.4.0, I haven't figured out why

FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_pow_nvfuser_cuda_thunder.dtypes.int64 - AssertionError: Tensor-likes are not equal!

Mismatched elements: 1 / 16 (6.2%)
Greatest absolute difference: 9223372036854775807 at index (2, 3)
Greatest relative difference: 1.0 at index (2, 3)
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_pow_torch_cuda_thunder.dtypes.int8 - RuntimeError: "reciprocal_cuda" not implemented for 'Char'
= 2 failed, 6108 passed, 1127 skipped, 115 xfailed, 108 xpassed, 136258 warnings in 1059.23s (0:17:39) =

But I think maybe we could do a round of review first since I'll be on vacation for the next 2 weeks? :) @mruberry @jjsjann123 @t-vi @IvanYashchuk

mruberry commented 2 months ago

I think the test error is unrelated:

b = tensor([[                  0,                   0,                   0,
                           0],
        [      ...       [                  0,                   0,                   0,
                           0]], device='cuda:0')

>       lambda a, b: comp(a, b, equal_nan=True),
    )
E   AssertionError: Tensor-likes are not equal!
E   
E   Mismatched elements: 1 / 16 (6.2%)
E   Greatest absolute difference: 9223372036854775807 at index (2, 3)
E   Greatest relative difference: 1.0 at index (2, 3)

thunder/tests/test_ops.py:84: AssertionError

That's quite a big error! I wonder what's going on with that test. I filed https://github.com/Lightning-AI/lightning-thunder/issues/1130 to track it.

kiya00 commented 2 months ago

Hi @IvanYashchuk , could you take a look if it's ready to merge?