ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.06k stars 221 forks source link

miopenBatchNormalizationForwardInference() failing #1441

Open hansely opened 2 years ago

hansely commented 2 years ago

miopenBatchNormalizationForwardInference function is failing when the nullptr is passed for mean & variance.

https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/amd_openvx_extensions/amd_nn/src/batch_normalization_layer.cpp#L121

The above function works fine but

https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/amd_openvx_extensions/amd_nn/src/scale_layer.cpp#L103

this function returns Inf values.

They both call the same function (miopenBatchNormalizationForwardInference) but the second one passes nullptr for mean & variance.

I'm still working on my end to see if the issue is on our side but wanted to check if this is a known issue to miopen. I was able to find some tickets opened regarding Inf & NaN values so this issue might be related to that?

atamazov commented 2 years ago

@hansely I do not think this is known issue.

Can you please provide us with reproduce instructions? This is absolutely necessary in order to start working on it.

hansely commented 2 years ago

Hi @atamazov . You'll have to build MIVisionX first. You can either build it on your system or use the docker.

After building MIVisionX, you can try running the unit tests for scale layer & batch norm layer.

Scale layer unit test : Link. BatchNorm layer unit test : Link.

Please let me know if you don't have an authorization to this repository. The output of the unit test includes Inf values for scale layer where as the output of batch norm layer matches the golden results.

muralinr commented 2 years ago

@hansely Could you please rule out issue from your side as mentioned by you because both functions call same kernel?

hansely commented 2 years ago

Both slice & batchnorm call the same function but with different parameters. Slice pass nullptr for mean & varaince and 0 for epsilon.

muralinr commented 2 years ago

@hansely Can you please provide us input arguments in both slice and batchnorm calls? You may dump them in both calls.

hansely commented 2 years ago

Batchnorm:

miopenBatchNormalizationForwardInference(miopenHandle, miopenBNSpatial, &data->alpha, &data->beta, data->input_desc, data->input_mem, data->output_desc, data->output_mem, data->bnScaleBiasMeanVarDesc, data->bnScale, data->bnBias, data->bnMean, data->bnVariance, data->eps)

Slice:

miopenBatchNormalizationForwardInference(miopenHandle, miopenBNSpatial, &data->alpha, &data->beta, data->input_desc, data->input_mem, data->output_desc, data->output_mem, data->bnScaleBiasMeanVarDesc, data->bnScale, data->bnBias, nullptr, nullptr, 0));

FYI: Providing epsilon value as 0.00001 instead of 0 has removed the Inf & NaN values in the output but still did not match the expected output.

muralinr commented 2 years ago

@hansely I looked at your input arguments. I would suggest not to use batchnorm op for Slice. Batchnorm has division by squareroot of variance. If variance is zero, it would result in Inf & NAN values which is expected in this case. Even if we add epsilon value as 0.00001, input value is changed.

Batchnorm
hansely commented 2 years ago

@muralinr I understand. However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.

muralinr commented 2 years ago

@muralinr I understand. However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.

@hansely We will change documentation. Batchnorm code uses mean and variance.

@atamazov @junliume How do I change documentation here? I need to update the line saying that nullptr is not allowed for mean and variance.

https://rocmsoftwareplatform.github.io/MIOpen/doc/html/batchnorm.html?highlight=miopenbatchnorm#miopenbatchnormalizationforwardinference

junliume commented 2 years ago

@muralinr @atamazov You can change it here: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/include/miopen/miopen.h#L2410

hansely commented 2 years ago

Thanks. Closing the issue.

atamazov commented 2 years ago

@muralinr

However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.

~If this is not allowed, then MIOpen should MIOPEN_THROW with error message.~

We will change documentation. Batchnorm code uses mean and variance.

Let's close the issue when:

atamazov commented 2 years ago

@muralinr Are you saying that the else branch is not needed here:

https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/src/ocl/batchnormocl.cpp#L175 ... https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/src/ocl/batchnormocl.cpp#L224-L227 ...

muralinr commented 2 years ago

No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.

muralinr commented 2 years ago

@atamazov Do you want me to raise a PR with comment changed here on develop? MIOpen/include/miopen/miopen.h

hansely commented 2 years ago

@muralinr

No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.

Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior? If not, why is there a else part in the code? Would you be able to create sample test cases to test the else part?

atamazov commented 2 years ago

@muralinr

@atamazov Do you want me to raise a PR with comment changed here on develop?

I think yes, because you said that:

We will change documentation...

muralinr commented 2 years ago

@muralinr

No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.

Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior? If not, why is there a else part in the code? Would you be able to create sample test cases to test the else part?

@hansely We will phase out current batchnorm code soon and are in the process of implementing new batchnorm code. We are currently supporting blocking issues. I will update documentation with correct API usage.

atamazov commented 2 years ago

@muralinr

Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior?

These questions are good ones. AFAIU these have nothing to do with BN implementation, but closely realted to the API of BN. Can you provide answers please? Maybe the best solution is inclusion of answers into documentation in the upcoming PR ;)

junliume commented 2 years ago

@muralinr I think we should raise a PR to update the documentation first. The mentioned questions are good ones too, when we are implementing the new BNs. Thanks!

muralinr commented 2 years ago

@muralinr I think we should raise a PR to update the documentation first. The mentioned questions are good ones too, when we are implementing the new BNs. Thanks!

Sure folks. I will raise a PR to update the documentation first.

muralinr commented 2 years ago

This PR is raised to update API documentation. https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1447

atamazov commented 2 years ago

@muralinr

The mentioned questions are good ones too, when we are implementing the new BNs.

Excuse me, Murali, but I still have one Q: Is current implementation correct?

If it is correct, then the new BN shall just replicate existing behavior, and the questions have nothing to do with it.

Am I missing something?

muralinr commented 2 years ago

@muralinr

The mentioned questions are good ones too, when we are implementing the new BNs.

Excuse me, Murali, but I still have one Q: Is current implementation correct?

If it is correct, then the new BN shall just replicate existing behavior, and the questions have nothing to do with it.

Am I missing something?

We will add more error checks in new BN implementation to avoid NANs for cases when espilon is ZERO and calculated variance is 0. I will try to update current BN code itself to report espilon issues and avoid NANs.

hansely commented 2 years ago

@muralinr When do you expect the new implementation to be released? Do you have at least the documentation that explains the new BN implementation? Our team(MIVIsionX) might also need to update BN & scale layer code accordingly based on the new implementation.

muralinr commented 2 years ago

@muralinr When do you expect the new implementation to be released? Do you have at least the documentation that explains the new BN implementation? Our team(MIVIsionX) might also need to update BN & scale layer code accordingly based on the new implementation.

@junliume Can you please comment on new BN implementation readiness date?

daniellowell commented 2 years ago

Epsilon is specifically designed to prevent numerical instability. It isn't recommended to have it be zero. In addition, recalculating mean and variance during inference is extremely inefficient.

ppanchad-amd commented 5 months ago

@hansely @junliume Internal ticket has been created to fix documentation. Thanks!