ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.08k stars 230 forks source link

lstm_test: backward data fails if the workspace is not zeroed out #2532

Open amberhassaan opened 1 year ago

amberhassaan commented 1 year ago

In function verify_bacward_data_lstm<T>::gpu() we seemingly inadvertently rely on the workspace being zeroed out. We create a std::vector for workspace just to create a gpu buffer workspace_dev with handle.Write(). Creating this vector has a subtle effect of zeroing out the GPU buffer when the copy happens. If we don't zero out the workspace, the verify_backward_data_lstm fails when it tries to compare the ::gpu workspace with ::cpu workspace.

CC: @JehandadKhan .

amberhassaan commented 1 year ago

I should also add that the output tensor dx for backward pass also fails verification when workspace is not zeroed out. CC: @JehandadKhan

shurale-nkn commented 1 year ago

@JehandadKhan @amberhassaan The workspace content should not be part of the verification process at all, this buffer is necessary for the library to store intermediate calculations, they may differ in different solvers and this is normal.

If an error is found in the library, accompany your messages with an example of code or command line how to reproduce it.

ppanchad-amd commented 6 months ago

@amberhassaan Is this ticket still relevant? Thanks!

amberhassaan commented 6 months ago

I believe it is.