Open tfogal opened 8 months ago
I think we automatically insert a necessary cast operation. The error in this case is about the T1
input being float32
.
For the fp type support, that should be relatively trivial.
More generally, though, it's a hard problem to report a meaning error message in Python from a C++ error. Is anybody familiar with general design guidelines? For example, should raising more specific exceptions help? What should be the protocol of error reporting look like between C++ and Python? Currently, the C++ side just uses NVF_ERROR
, which throws an nvfuser::nvfError
exception. Maybe we should have some more subclasses of this exception, like nvfSchedulingError
, nvfLoweringError
?
These are cases that are only identifiable at runtime as we tried to catch definition time issues in Python to make the errors better for the ones we could, easily. We probably need to investigate potentially how exceptions get translated to python. Maybe this is looking into how people do exceptions through bindings or changing the exception error type from C++.
Another issue is that the Tensor identifiers between C++ and Python do not match up.
I think we automatically insert a necessary cast operation. The error in this case is about the T1 input being float32.
oops, yep! Thanks Naoya. Edited to fix.
More generally, though, it's a hard problem to report a meaning error message in Python from a C++ error. Is anybody familiar with general design guidelines? For example, should raising more specific exceptions help? What should be the protocol of error reporting look like between C++ and Python? Currently, the C++ side just uses NVF_ERROR, which throws an nvfuser::nvfError exception. Maybe we should have some more subclasses of this exception, like nvfSchedulingError, nvfLoweringError?
Yeah, it's definitely hard!
I think new types are useful insofar as a user wants to handle them differently. For example in POSIX, EAGAIN
vs. EACCES
make sense as distinct errors because the user may reasonably want to retry in the first error, vs. give up in the second. In our case, it might then depend on whether the client has any recourse as to what to do about it.
Tagging @nouiz as I know he's been having discussions on better error handling APIs as of late. What are your thoughts, Frédéric?
I separated out the validation issue as #1760.
I think the C/C++ code shouldn't use error code anywhere. The important is to have a detailed error string and pass it around. XLA use absl::Status instead of exception. I like that, but the important is error string from where the error happen and making sure it is reported back to the user. Anything that does this is a pretty good state. This should trigger a Python exception in the end with that string.
It is hard for higher level to add the correct detail when the error happen at a low level. And when they try to do this, it is a maintenance night-mare as this can change at all release (and isn't always well documented).
So I would go to make sure the lowest place that detect the error create detailed error string and make sure it is passed above. Error code can be used to select the right python exception type. But not more then that.
As for the original issue, @jacobhinkle improved the error message in #1784. This is where we validate required device properties:
https://github.com/NVIDIA/Fuser/blob/main/csrc/executor.cpp#L394-L405
The more general issue of better reporting from nvfuser to a higher level framework remains.
I will note I opened an issue on the Python API, we might want to look into validating some of these attributes at the definition level so the error messages are nicer. #1856
We should have a validation pass to ensure that the input program makes sense. For example, the following program:
crashes when run with:
We should fail more gracefully:
iS2{i3}
; error messages more at the level of the program are what they would wantIncomplete list of what we should consider in a validation pass: