Open hansely opened 2 years ago
@hansely I do not think this is known issue.
Can you please provide us with reproduce instructions? This is absolutely necessary in order to start working on it.
Hi @atamazov . You'll have to build MIVisionX first. You can either build it on your system or use the docker.
After building MIVisionX, you can try running the unit tests for scale layer & batch norm layer.
Scale layer unit test : Link. BatchNorm layer unit test : Link.
Please let me know if you don't have an authorization to this repository. The output of the unit test includes Inf values for scale layer where as the output of batch norm layer matches the golden results.
@hansely Could you please rule out issue from your side as mentioned by you because both functions call same kernel?
Both slice & batchnorm call the same function but with different parameters. Slice pass nullptr for mean & varaince and 0 for epsilon.
@hansely Can you please provide us input arguments in both slice and batchnorm calls? You may dump them in both calls.
Batchnorm:
miopenBatchNormalizationForwardInference(miopenHandle, miopenBNSpatial, &data->alpha, &data->beta, data->input_desc, data->input_mem, data->output_desc, data->output_mem, data->bnScaleBiasMeanVarDesc, data->bnScale, data->bnBias, data->bnMean, data->bnVariance, data->eps)
Slice:
miopenBatchNormalizationForwardInference(miopenHandle, miopenBNSpatial, &data->alpha, &data->beta, data->input_desc, data->input_mem, data->output_desc, data->output_mem, data->bnScaleBiasMeanVarDesc, data->bnScale, data->bnBias, nullptr, nullptr, 0));
FYI: Providing epsilon value as 0.00001 instead of 0 has removed the Inf & NaN values in the output but still did not match the expected output.
@hansely I looked at your input arguments. I would suggest not to use batchnorm op for Slice. Batchnorm has division by squareroot of variance. If variance is zero, it would result in Inf & NAN values which is expected in this case. Even if we add epsilon value as 0.00001, input value is changed.
@muralinr I understand. However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.
@muralinr I understand. However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.
@hansely We will change documentation. Batchnorm code uses mean and variance.
@atamazov @junliume How do I change documentation here? I need to update the line saying that nullptr is not allowed for mean and variance.
@muralinr @atamazov You can change it here: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/include/miopen/miopen.h#L2410
Thanks. Closing the issue.
@muralinr
However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.
~If this is not allowed, then MIOpen should MIOPEN_THROW with error message.~
We will change documentation. Batchnorm code uses mean and variance.
Let's close the issue when:
@muralinr Are you saying that the else
branch is not needed here:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/src/ocl/batchnormocl.cpp#L175 ... https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/src/ocl/batchnormocl.cpp#L224-L227 ...
No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.
@atamazov Do you want me to raise a PR with comment changed here on develop? MIOpen/include/miopen/miopen.h
@muralinr
No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.
Does this mean that having Inf & NaN as output values is a normal behavior?
Is passing nullptr to mean & variance an allowed behavior? If not, why is there a else
part in the code?
Would you be able to create sample test cases to test the else
part?
@muralinr
@atamazov Do you want me to raise a PR with comment changed here on develop?
I think yes, because you said that:
We will change documentation...
@muralinr
No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.
Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior? If not, why is there a
else
part in the code? Would you be able to create sample test cases to test theelse
part?
@hansely We will phase out current batchnorm code soon and are in the process of implementing new batchnorm code. We are currently supporting blocking issues. I will update documentation with correct API usage.
@muralinr
Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior?
These questions are good ones. AFAIU these have nothing to do with BN implementation, but closely realted to the API of BN. Can you provide answers please? Maybe the best solution is inclusion of answers into documentation in the upcoming PR ;)
@muralinr I think we should raise a PR to update the documentation first. The mentioned questions are good ones too, when we are implementing the new BNs. Thanks!
@muralinr I think we should raise a PR to update the documentation first. The mentioned questions are good ones too, when we are implementing the new BNs. Thanks!
Sure folks. I will raise a PR to update the documentation first.
This PR is raised to update API documentation. https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1447
@muralinr
The mentioned questions are good ones too, when we are implementing the new BNs.
Excuse me, Murali, but I still have one Q: Is current implementation correct?
If it is correct, then the new BN shall just replicate existing behavior, and the questions have nothing to do with it.
Am I missing something?
@muralinr
The mentioned questions are good ones too, when we are implementing the new BNs.
Excuse me, Murali, but I still have one Q: Is current implementation correct?
If it is correct, then the new BN shall just replicate existing behavior, and the questions have nothing to do with it.
Am I missing something?
We will add more error checks in new BN implementation to avoid NANs for cases when espilon is ZERO and calculated variance is 0. I will try to update current BN code itself to report espilon issues and avoid NANs.
@muralinr When do you expect the new implementation to be released? Do you have at least the documentation that explains the new BN implementation? Our team(MIVIsionX) might also need to update BN & scale layer code accordingly based on the new implementation.
@muralinr When do you expect the new implementation to be released? Do you have at least the documentation that explains the new BN implementation? Our team(MIVIsionX) might also need to update BN & scale layer code accordingly based on the new implementation.
@junliume Can you please comment on new BN implementation readiness date?
Epsilon is specifically designed to prevent numerical instability. It isn't recommended to have it be zero. In addition, recalculating mean and variance during inference is extremely inefficient.
@hansely @junliume Internal ticket has been created to fix documentation. Thanks!
miopenBatchNormalizationForwardInference
function is failing when the nullptr is passed for mean & variance.https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/amd_openvx_extensions/amd_nn/src/batch_normalization_layer.cpp#L121
The above function works fine but
https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/amd_openvx_extensions/amd_nn/src/scale_layer.cpp#L103
this function returns Inf values.
They both call the same function (
miopenBatchNormalizationForwardInference
) but the second one passes nullptr for mean & variance.I'm still working on my end to see if the issue is on our side but wanted to check if this is a known issue to miopen. I was able to find some tickets opened regarding Inf & NaN values so this issue might be related to that?