Add a validation pass - Githubissues

tfogal commented 8 months ago

We should have a validation pass to ensure that the input program makes sense. For example, the following program:

from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id1(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1,-1], contiguity=[True,True], dtype=DataType.Float, is_cpu=False, stride_order=[1,0])
    T1 = fd.define_tensor(shape=[-1,-1], contiguity=[True,True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[1,0])
    T2 = fd.ops.add(T0, T1)
    fd.add_output(T2)

with FusionDefinition() as fd:
    nvfuser_fusion_id1(fd)

import torch
inputs = [
    torch.randn((4,), dtype=torch.float32, device='cuda:0').as_strided((2, 2), (2, 1)),
    torch.randn((4,), dtype=torch.float32, device='cuda:0').as_strided((2, 2), (2, 1)),
]   
fd.execute(inputs)

crashes when run with:

RuntimeError: Expected T1_g[ iS2{i3}, iS3{i4} ] to be bound to a tensor of dtype __bfloat, but got a tensor of dtype float

We should fail more gracefully:

we should not tear down the whole process when we fail; report an error back up instead
the user can't do anything with references like iS2{i3}; error messages more at the level of the program are what they would want

Incomplete list of what we should consider in a validation pass:

[ ] Check operand types
[ ] Ensure support matches the GPU targeted: we can't create bf16 kernels on volta hardware
[ ] Ensure support matches the GPU targeted: we can't create fp8 kernels on ampere hardware

naoyam commented 8 months ago

I think we automatically insert a necessary cast operation. The error in this case is about the T1 input being float32.

For the fp type support, that should be relatively trivial.

More generally, though, it's a hard problem to report a meaning error message in Python from a C++ error. Is anybody familiar with general design guidelines? For example, should raising more specific exceptions help? What should be the protocol of error reporting look like between C++ and Python? Currently, the C++ side just uses NVF_ERROR, which throws an nvfuser::nvfError exception. Maybe we should have some more subclasses of this exception, like nvfSchedulingError, nvfLoweringError?

kevinstephano commented 8 months ago

These are cases that are only identifiable at runtime as we tried to catch definition time issues in Python to make the errors better for the ones we could, easily. We probably need to investigate potentially how exceptions get translated to python. Maybe this is looking into how people do exceptions through bindings or changing the exception error type from C++.

Another issue is that the Tensor identifiers between C++ and Python do not match up.

tfogal commented 8 months ago

I think we automatically insert a necessary cast operation. The error in this case is about the T1 input being float32.

oops, yep! Thanks Naoya. Edited to fix.

tfogal commented 8 months ago

More generally, though, it's a hard problem to report a meaning error message in Python from a C++ error. Is anybody familiar with general design guidelines? For example, should raising more specific exceptions help? What should be the protocol of error reporting look like between C++ and Python? Currently, the C++ side just uses NVF_ERROR, which throws an nvfuser::nvfError exception. Maybe we should have some more subclasses of this exception, like nvfSchedulingError, nvfLoweringError?

Yeah, it's definitely hard!

I think new types are useful insofar as a user wants to handle them differently. For example in POSIX, EAGAIN vs. EACCES make sense as distinct errors because the user may reasonably want to retry in the first error, vs. give up in the second. In our case, it might then depend on whether the client has any recourse as to what to do about it.

Tagging @nouiz as I know he's been having discussions on better error handling APIs as of late. What are your thoughts, Frédéric?

naoyam commented 8 months ago

I separated out the validation issue as #1760.

nouiz commented 8 months ago

I think the C/C++ code shouldn't use error code anywhere. The important is to have a detailed error string and pass it around. XLA use absl::Status instead of exception. I like that, but the important is error string from where the error happen and making sure it is reported back to the user. Anything that does this is a pretty good state. This should trigger a Python exception in the end with that string.

It is hard for higher level to add the correct detail when the error happen at a low level. And when they try to do this, it is a maintenance night-mare as this can change at all release (and isn't always well documented).

So I would go to make sure the lowest place that detect the error create detailed error string and make sure it is passed above. Error code can be used to select the right python exception type. But not more then that.

naoyam commented 7 months ago

As for the original issue, @jacobhinkle improved the error message in #1784. This is where we validate required device properties:

https://github.com/NVIDIA/Fuser/blob/main/csrc/executor.cpp#L394-L405

The more general issue of better reporting from nvfuser to a higher level framework remains.

kevinstephano commented 7 months ago

I will note I opened an issue on the Python API, we might want to look into validating some of these attributes at the definition level so the error messages are nicer. #1856

NVIDIA / Fuser

Add a validation pass #1758