BLAS2 floating point summation accuracy

jlgreathouse commented 9 years ago

For a while now, we've been mulling over how to validate results in test-blas2. Ideally, we would be able to compare the results of a GPU-based SpM-DV calculation against a known-good CPU implementation. However, we're using floating point and comparing floating point values is hard.

This comes from two problems (which are related):

Floating point addition is not associative. In SpM-DV, all of the mat*vec results for a row must be summed. The order of this summation is different between CPU algorithms and our GPU algorithms (both in csrmv_general and csrmv_adaptive). As such, they accumulate different rounding errors during the summation process. This results in a divergence between the two values, meaning that we cannot (and should not) compare the results of these algorithms bit-for-bit.
More problematic, floating point summation (especially when both positive and negative values are allowed) is notoriously error-prone. Catastrophic cancellation of values can completely destroy our ability to obtain useful answers. Even when this does not happen, however, the error rate can grow linearly with the number of additions. This means that a matrix with very long rows will have loose errors bounds in its results, making it difficult to automatically say whether an two results are comparable.

The first problem means that we cannot directly compare (bit-for-bit) the output of a CPU-based SpM-DV and a GPU-based SpM-DV. The inability to directly compare the two floating point numbers is somewhat common knowledge, and we currently compare against a range of values.

Originally, we hard-coded this to say e.g. "if in single precision, the values must be within 1e-5 of each other". I recently changed this to instead say that "in single, the values must be within 0.1% of one another". We shouldn't be comparing against a static value (because a gigantic number, e.g. 1.0e126, cannot have another number that is within 1e-5 to it in single precision -- thus we're just doing a direct comparison that is doomed to fail). However, even the (extremely loose) error bounds I've included fail for many of the matrices we're operating on.

That is due to the second issue: floating point summation has linearly-increasing error bounds and suffers from catastrophic cancellation effects that can cause large values (which are eventually subtracted out) to destroy values that, in the end, are significant. This is true even in our CPU-based algorithm. This means that our CPU and GPU results may be different and they may both be wrong.

For example, the matrix dc2 from the UFL sparse matrix collection has a couple of extremely long rows (e.g. row 91, which has 47,193 non-zero values). A simple CPU-based SpMV algorithm (where the input vector entirely contains 1s) that is calculated in single precision says that the output for row 91 is roughly -1.921. csrmv_general calculates the answer as roughly -1.857, a noticeable percentage off. csrmv_adaptive returns roughly -1.013. Much worse!

However, if you perform this calculation in higher (double, quad) precision, the answer is roughly -1.005. Both the single-precision CPU and GPU results are wrong -- and the GPU-based results are closer to the correct answer. The different calculation orders between these algorithms have all resulted in different errors -- and it's not always the case that the GPU algorithms are closer.

This leads to the problems and questions I want to raise.

I believe we should ensure that our CPU-based comparison point is computed accurately by calculating at a higher precision. For example, when comparing against single precision results from the GPU, we should perform CPU-based SpMV in double precision and cast the final answer to a float. I think that a multi- or arbitrary-precision library is overkill here, but this is probably a point for discussion. However, it's vital that our CPU baseline not be so erroneous that it makes comparisons impossible.
How should we handle this floating point summation problem in the GPU kernels? Should we handle it there? If so, how much performance are we willing to lose (if any) in order to obtain more accurate results? Should these more accurate calculations always be enabled, or should there be an accuracy-mode and speed-mode? If so, which should be the default?

I believe this issue is larger than just the test case. It's also a question about how accurate our SpM-DV should be.

jlgreathouse commented 9 years ago

I've been working on code for this for the last few days, and it's almost done. I started the Issue partially as a centralized point of discussion for the eventual changes.

With respect to the first recommendation (computing more accurately on the CPU), I've made some simple changes to calculate the comparison point for single precision SpM-DV using double precision on the CPU. Since I do my development on Linux, I'm linking in libquadmath to use 128-bit precision intermediate numbers when computing the comparison point for double precision SpM-DV.

This works for me, but I'm guessing that libquadmath is not the right way to go about doing 128-bit FP in Windows.

Recommendations?

jlgreathouse commented 9 years ago

As for the second set of questions, I'll start by saying that I've been adding in support for compensated summation into both csrmv_general and csrmv_adaptive. In particular, the algorithm presented in this paper from Yamanaka, Ogita, Rump, and Oishi.

When combined with the above method of making the CPU computations more accurate, we get answers that are within 0.001% in single precision, and 1e-10% in double precision (there are still rounding errors due to multiplication, etc. so we still shouldn't expect bitwise-exact answers).

However, performing compensated summations is more computationally intensive than a simple summation. On my test platform (with a FirePro W9100), this increased accuracy causes roughly 10% slowdown (on average). This is the reason for all of my questions related to how we should present this more accurate method to users. In addition, I believe it will make our double precision results noticeably slower on GPUs with worse DPFP performance, as it makes the kernels much more computation heavy.

In my opinion, this method of more precise calculation should be an option -- I think a lot of our users will want the ability to get tighter error bounds. However, because of the performance loss, I'm building this to be a kernel-compile-time option (e.g. -DEXTENDED_PRECISION).

Should this mode be the default, meaning users would need to request a fast-calculation mode if they're OK with more naive summation techniques?

There are also some ways to get middle-of-the-road results -- more accurate than naive while only ~3% performance loss. Should that be another separate mode, or should we just have two?

kvaragan commented 9 years ago

I understand floating point accuracy issues, did we compare CPU implementations against GPU implementation for integer data type?

jlgreathouse commented 9 years ago

No, we do not. The test application as written only tests floats and doubles. We'll need to make some (minor) modifications to csrmv_adaptive to handle integer and long data types if this is a goal.

jlgreathouse commented 9 years ago

I'm nearly finished adding compensated summation into both csrmv_general and csrmv_adaptive. With this mechanism in place, the double precision output in ./test-blas2 is bit-for-bit identical to a CPU calculation performed in quad precision and cast down to double (0 ULPs).

There is one wrinkle that I would like to bring up, however. The current AMD OpenCL implementation does not support single precision denormals on current AMD GPUs. Instead, it rounds to zero for denorms.

As such, the error rate for single precision numbers SpMV calculations can't be eliminated, as the compensated summation methods presented in the literature assume no underflow. Dropping denorms to zero causes a not-insignificant amount of new underflow in some of the test matrices. These matrices, then, will vary from what is calculated on the CPU. This is not because the algorithm on the CPU is wrong, per se, but because the CPU and the GPU are using different floating point rules.

Thoughts on how to best handle this? I'm tempted to say that test-blas2 should use rough error bounds for single precision (the answers are usually still close) while using much stricter bounds on double precision.

Another, perhaps hackier, possibility is to set Beta (the -b command line parameter) very high (10000?) -- this often prevents denormalized results on the matrices I'm testing, resulting in accurate results.

kknox commented 9 years ago

Hi Joe~ Thank you so much for creating this discussion and writing your thoughts down; I appreciate that. You have a lot of information and brainstorming written above, and I'm coming to the conversation a bit late. Let me add a few of my thoughts on your comments:

I recently changed this to instead say that "in single, the values must be within 0.1% of one another" [kknox] Typically, if the result that you are comparing against is a value that is 'close' to 0, then relative errors become unstable. Relative error checking is great when comparing numbers larger than a threshhold (i'm leaving this undefined for now), but below that threshold it is more appropriate to apply absolute error checking.
I've made some simple changes to calculate the comparison point for single precision SpM-DV using double precision on the CPU. [kknox] Sounds like a simple solution for single precision, with the negative that CPU test performance is lowered. I'm not concerned about that unless it makes running the tests burdensome, like the experience you were having with uBLAS.
Since I do my development on Linux, I'm linking in libquadmath to use 128-bit precision intermediate numbers when computing the comparison point for double precision SpM-DV. [kknox] My first disagreement with you; I don't like this. Having a test suite that only works with gcc is a non-starter. I've looked around, and there does not seem to be 128-bit floating point support for visual studio at all (one of our target compilers). The effort to get something like this working with visual studio is a whole separate project.
In my opinion, this method of more precise calculation should be an option -- I think a lot of our users will want the ability to get tighter error bounds. [kknox] All of our routines take a clsparseControl object that basically is a collection of library state. We could add an enum to clsparseControl that allows users to control desired performance/accuracy tradeoffs. We have something like this for clFFT. I think the default should be the faster implementation, and users would opt for the more accurate summations if they feel that is needed (or they go to double).

One extra wet idea to throw against the wall:

Create a mode in our test programs that continues to load .mtx files from disk, but replaces all non-zero data with random strict power-of-2 float values float(2^rand(x)), where rand(x) covers the range of the floating point exponent. I don't think that this will fix all of our precision problems (especially with negative numbers, adding large numbers with small numbers and denorm numbers on GPU), but it may be enough that accumulated error is kept really small and the test suite is still effective to find errors. The interesting portion of loading the .mtx files is to keep the general 'shape' of the matrix under tests.

jlgreathouse commented 9 years ago

Hi Kent,

Thanks for the feedback. I'll try to push out my changes tomorrow in order to get some more specific comments.

I agree that libquadmath is not the correct solution in general (it helped during development, though). As such, I tried using compensated summation method linked above on the CPU as well. This allows us to calculate the double precision SpMV baseline as if it were calculated in quad precision and then cast to double. I've verified that the answers are bitwise identical between using this method and using libquadmath. This doesn't hurt performance too much and removes any extra library dependencies.

As for precision/speed and which should be the default, we should probably have a face-to-face discussion once you're back. Some of the folks I've talked to indicated they might like accuracy first and foremost, and only when doing something akin to -ffast-math should we return answers with higher error bounds. However, it might behoove us to instead put our best performance face forward, as per your recommendation.

Along the lines of your final idea: maybe we could just force all data being loaded from .mtx files (for test-blas2, not for anything else) to be a positive number. I don't think any values being read from .mtx files will come in as denorms, so this would prevent us from running into denorm verification problems for single precision.

jlgreathouse commented 9 years ago

I have a working version of my changes pushed into my fork in ad7144c, 774d357, 3ad5527, and 7670486.

I don't think this mechanism is completely integrated (for instance, I hard code the fact that we're using extended precision). In addition, it currently causes between 10-15% slowdown. As such, I haven't issued a pull request. However, this may work for the tests you want to run, @kvaragan.

clMathLibraries / clSPARSE

BLAS2 floating point summation accuracy #109