Floating point comparison

cornell-brg / hb-pytorch

Repo to hold HammerBlade PyTorch port. Based on PyTorch v1.4.0

Other

13 stars 10 forks source link

Floating point comparison #67

Open yodada opened 4 years ago

yodada commented 4 years ago

❓ This issue is meant for a discussion on how should we handle floating point comparison.

A few hypothesis / random tests, like sum_22 and mean_hypothesis_3d, may fail. The reason is such:

CPU kernels and HB kernels use different algorithms
Sometimes the random data includes a mixture of large numbers (like 88 which is favored by hypothesis) and very small numbers (1e-04).
During the sum operation, the partial result can become as large as 1000, which creates inaccuracy in the decimals. However, the final result will be small (around 10 or 20).
Then the relative difference between CPU and HB results are large (>1%)

Not sure what is the best way to handle this problem ... We could

Let hypothesis generates small inputs only, this is also the case in upstream
Increase the tolerance

vb000 commented 4 years ago

My view is that we should ~~reduce~~ increase the tolerance and have a centralized value for that tolerance. Having a centralized value enables us to gauge where we stand in terms of accuracy. Later, when we swap the FPU with hardfloat, we could reduce the tolerance.

That's in cosimulation. But I couldn't figure why we are seeing mismatches in emulation!

yodada commented 4 years ago

Emm ... it was closed automatically by the PR but I think this is still a valid issue.

sampsyo commented 4 years ago

IMO, the wiliness of floating point means that there is no silver bullet here. The variations between FPUs is actually the least of our worries—different implementations (i.e., algorithms or variations on algorithms) mean that the results can differ arbitrarily for a given input. Without painstakingly analyzing individual algorithms, there's no way to put a bound on how big the difference can be (neither a relative nor an absolute bound).

Just for fun, here's someone else discovering the difficulty of combining PBT and FP.

I think the practical solution, then, is probably just to widen the tolerance until it works. A slightly fancier solution would be to use an average over multiple runs. That is, Hypothesis randomly chooses inputs—just randomly choose inputs many times and take the average. A little statistical analysis could tell you a confidence bound: for example, when we have 99.99% confidence that the average difference will be within a given bound.

cbatten commented 4 years ago

Can we create integer tensors instead of FP tensors? If so maybe we can template all kernels so they work for both integer and FP and test both. The integer results should match exactly with x86 assuming no overflow and we can use a loose bound for FP. The integer tests will make sure the over logic of our kernel is sounds.