Open jjsjann123 opened 2 years ago
cc'ing @kevinstephano on the python_definition
issue, which spits out a fusion script that missed the const python scalar arguments to batch_norm.
Are we sure the math issue isn't an indicator that something else is broken given we consciously decided not to implement FP16 math ops in the runtime as they should be upcasted?
Sounds like Jie is going to look into this in the frontend.
Are we sure the math issue isn't an indicator that something else is broken given we consciously decided not to implement FP16 math ops in the runtime as they should be upcasted?
Sort of yes and no.
The lack of fp16 support is real, when user explicitly define something like the script we have here, codegen is going to scream and fail.
The patch I have in #2104 added explicitly type promotion for nvprims.native_batch_norm
, so at least running through dynamo stack, we won't run into these problems any more.
I thought codegen auto promotes the operations to FP32 with half inputs. This might be an issue with segmentation.
No, I think I'm wrong and we auto demote intermediates to FP16 in segmentation.
No, I think I'm wrong and we auto demote intermediates to FP16 in segmentation. Sorry I missed you comment @csarofeen , but it only happens when intermediates in the original fusion definition is marked with fp16. Anyway, there's two problem we have with fp16 support:
- We need to have dtype correctness on outputs. (linking #2115). We'd be better off have something done in the integration as a catch-it-all safe net.
- fp16 math is likely needed as well?! a). Ruberry made an argument on user explicit control of math type. Whether we prioritize it or not is a different question; b). IIUC, fp16 math could be perf critical in longer term.
Just to break it down simply for myself there's: (1) Type promotion - To define what precision a computation should take place as (2) Output type - Define what precision the output should be in (1) and (2) don't have to match.
I think there's actually three issues here because of fp16/bf16: (1) Type Promotion - Promotes input types before computations which can impact the output type. This doesn't use, user defined casts, but would use primTorch casts. (2) Output Type - Promotes the output of a computation, and explicitly specifies the output type of a computation. (3) AMP behavior - ...
If we just think about the program involving fp16/bf16, is that for every operator with only fp16 or bf16 inputs we should do: fp16 inputs -> promote inputs to fp32 -> compute -> set output type of fp16
Unless we have an explicit out type (treating cast as a set + specified out type). Then we actually convert that out type to a non fp16/bf16 value.
The optimization we effectively want, is if we have series of the above: fp16 inputs -> promote inputs to fp32 -> compute -> set output type of fp16 -> fp16 inputs -> promote inputs to fp32 -> compute -> set output type of fp16
We want to translate it to: fp16 inputs -> promote inputs to fp32 -> compute -> compute -> set output type of fp16
Side note: this seems to be very similar to what people want/need in quantization.
So the real question here is how do we want to do the above optimization, because nvFuser only has explicit casts, except for the segmentation rule, which may or may not put an intermediate at segmentation boundaries as fp16/bf16 depending on how that value's computed.
If we just think about the program involving fp16/bf16, is that for every operator with only fp16 or bf16 inputs we should do: fp16 inputs -> promote inputs to fp32 -> compute -> set output type of fp16
I don't have a problem with this as the default behavior. Users / framework might have a stronger opinion on us respecting compute math type, but that's a different conversation and it could also be refactored back when we added fp16 math support.
The optimization we effectively want, is if we have series of the above:
fp16 inputs -> promote inputs to fp32 -> compute -> set output type of fp16 -> fp16 inputs -> promote inputs to fp32 -> compute -> set output type of fp16
We want to translate it to:fp16 inputs -> promote inputs to fp32 -> compute -> compute -> set output type of fp16
Do we want to handle this in integration? We are given an FX graph, so we can cancel out neighboring casts, even propagating casts if we want to go extra fancy. I'm also leaning towards having this done inside codegen IR, would transformation like this at codegen be tricky to pull off? cc'ing @naoyam @zasdfgbnm
🐛 Describe the bug
For the issues I'm seeing in our benchmark, I think we can work around it by making batch_norm promotion explicit in the graph. So I'll try to work around it in nvprim/primtorch.
Note: fp16 math can be explicitly put in our python API. Currently we don't have codegen support for these and it's going to be a real problem.
Repro script and error message below.
Versions
torchbenchPerf
commit 124fba69b1101cece6d6d0b781c32a11e481dbc7 (HEAD, csarofeen/torchbenchPerf) Author: Ivan Yashchuk ivan.yashchuk@aalto.fi Date: Mon Oct 17 16:34:38 2022 +0300