Open iq136boy opened 6 months ago
Hi @iq136boy, due to error message:
max err: 0.0001745224, number of errors: 1760, 0.02872243% wrong values
I suppose it is catastrophic cancellation. CPU can produce better results due to internal 64/80 bits fpu calculations.
We can handle it in two ways:
@bartekxk Thanks for the update. For 1, how much the slowdown it could cause? For 2, is there a default value of the splitK that can handle most of the case? So that the user does not need to choose the value everytime.
I found our ckProfiler also had mismatches between gpu and cpu when using the cmd from this issue. I investigated ckProfiler mismatch and concluded that it could be related to floating point rounding for large tensors. When initializing the tensors with integer values, they matched. It also passed when reducing tensor size.
We found numerical errors in convolution backward data kernel when running test with the detectron2 data. I put the data and the error log file here. . The log contains one of the miopen driver command that failed with numerical error. The "conv124_dy.bin" is the output tensor data and "conv124_w.bin" is the weight tensor data.