Difference between CPU and GPU in floating point operations?

awai54st commented 7 years ago

Hi,

I'm currently working on Julia set. I notice that when I use CPU as device the results are all correct, but when I try GPU implementation some results are (slightly) wrong. I did a bit of research and noticed some possible differences between CPU and GPU in handling floating point operations. Could that be the reason for this observation?

If that's the reason, would we be penalised for loss in accuracy?

Thank you

Best, awai54st

m8pple commented 7 years ago

Good question - I'm pleased someone asked early :)

Yes, there is generally a potential for problems when floating-point is involved, particularly between different architectures. The exact calculations are defined by the IEEE standard (which both GPUs and CPUs support), so it should be possible to get bit-exact results. However, it is kind of difficult.

So the question becomes - which one is correct, the GPU or the CPU, or are either correct?

One thing you've had to do to go to the GPU is to explicitly rewrite the complex maths out as operations (presumably), so that has been made explicit, and only requires multiplications and additions. In principle that is exact, and mandated by the IEEE standard, and by default the GPU should be doing IEEE single precision: https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clBuildProgram.html

This is true if the GPU supports denormal numbers, and I can assert that the AWS GPU does (I have checked those properties). However, this may not be true of your personal GPU - for example, my desktop machine doesn't.

So one thing you might like to try is to swap between the GPU and a software provider, do you get the same results on both? Or is the software provider the same as software? A useful thing to do is to enable -ffloat-store in software to ensure all results are definitely, definitely single-precision.

So, I can actually assert that your program can work (I happened to have a GPU instance open, and ran it, then fiddled with it briefly), but you may need to force the NVidia OpenCL provider to do your bidding - in particular, look at the documentation for clBuildProgram, and see what you could try. Bear in mind that the NVidia drivers may be buggy (optimising when they shouldn't), but at least they are consistently buggy.

Also, do what you are doing - check carefully for many parameter sizes on the actual target platform.

(Also, I think there is a very subtle but unrelated bug in your kernel - when calculating the kernel dx and dy steps, it isn't quite the same as the software version. Look at the constants)

awai54st commented 7 years ago

Thank you for your reply. I will try something with my Nvidia GPU. I'll let everyone know if I make any progress. (Apologies for the messy indentation in my codes :) )

jeremych1000 commented 7 years ago

@m8pple What about errors in CPU only implementation? I've tried running julia with tbb, no openCL on a scale of 1500 and in 1/10 times there was a 1 character difference in the log of 1.6 million characters, so the output was not correct. What can I attribute this to?

m8pple commented 7 years ago

One possibility is that things were optimised differently - the TBB version will look different to the compiler, so it may do slightly different optimisations. But again, in principle the same sequence of operations should result in exactly the same bit-level output (assuming IEEE compliant floating-point).

I'd suggest looking into the methods used to enforce exact IEEE compliance, e.g. -ffloat-store is one for g++, or /fp:strict for devstudio.

Things that might change it are not doing exactly the same thing. Be careful about optimisations that are mathematically true - for example, is it true that:

abs(z) > 2

is always the same as:

z.re^2 + z.im^2 > 4

essentially, is it always true that sqrtf(x)>2 iff x>4 ?

If you were to choose, say x=(2+2^-i), does it hold for all integer i?

Also, there are multiple formulas for complex multiplication (though only one very simple one). Do they all give the same result?

Again, as you are in the vanguard here, you might want to think about tightening up the bit-level specification in your favour. I'm the client, but you can still help me refine the spec if it seems impossible/difficult to implement.

liafalk commented 7 years ago

@m8pple I also have a similar error, but for some reason I get correct outputs for the software platform (AMD), but incorrect for high scale inputs when using an AWS GPU instance.

I tried adding different build options (-cl-single-precision-constant, -cl-denorms-are-zero) when building the OpenCL kernel, even disabling all optimizations (-cl-opt-disable), which corrected the problem for low scale inputs (128-2048), but I still get errors with something like 4096.

I don't think it's my OpenCL code, as it does functions correctly with the software platform. Do you have any suggestions on how to improve the accuracy, or any advice on what to look for in this case?

I've heard that Nvidia's floating point sqrt and division are not IEEE compliant, could this be the problem?

Thanks!

Szypicyn commented 7 years ago

I found an interesting presentation which says that:

Math library functions not guaranteed to be identical but then in the paper linked below it is said:

The math library includes all the math functions listed in the C99 standard [3] plus some additional useful functions. These functions have been tuned for a reasonable compromise between performance and accuracy.

Which means that all standard math functions follow the same spec C99 on CPU and GPU - that excludes sines and cosines etc, whcih we are not using anyway.

Then on this paper it is said:

Even in the strict world of IEEE 754 operations, minor details such as organization of parentheses or thread counts can affect the final result. Take this into account when doing comparisons between implementations.

In the third reference on this paper you can find:

EXAMPLE 5 Rearrangement for floating-point expressions is often restricted because of limitations in
precision as well as range. The implementation cannot generally apply the mathematical associative rules
for addition or multiplication, nor the distributive rule, because of roundoff error, even in the absence of
overflow and underflow. Likewise, implementations cannot generally replace decimal constants in order to
rearrange expressions. In the following fragment, rearrangements suggested by mathematical rules for real
numbers are often not valid (see F.8).

double x, y, z;

/* ... */
x = (x * y) * z; // not equivalent to x *= y * z;

z = (x - y) + y ; // not equivalent to z = x;

z = x + x * y; // not equivalent to z=x* (1.0 + y);

y=x/ 5.0; // not equivalent to y=x* 0.2;

Which means that the slightest of changes can change your output.

So I am in the same situation as @indigo- the flags seems to solve the problem for inputs up to 4096.

martin-gantier commented 7 years ago

Hi, to add to this thread, I also tried to implement Julia with GPU, following the steps of coursework 3. However, as soon as I'm implementing a function void kernel_xy (..) (so not openCL yet) and replacing complex structures by handmade ones ( ie working directly on real and imaginary parts instead of complex ), I get output errors for large problem size ( >5000 ) depending on the "seed" c.

I managed to track down the issue to the value of my calculated abs(z) being -sometimes, not always- different from the original one. For example, for z = a+ib :

z: (1.09500003,1.47744346)

1.83898449 : Absolute value handmade calculated 1.83898461 : Absolute value of z calculated by the original function.

This small difference is critical only when close to 2, which explains why I (and all of us?) get errors for large problem size and not everytime. I guess this come from floats not being rounded up the same way, or something similar, during one of the calculation made to get abs(z).

So I tried reimplementing the standard complex library (found here : https://gcc.gnu.org/onlinedocs/gcc-4.6.3/libstdc++/api/a00812_source.html ) to work inside my kernel ie copying/rewriting this

// 26.2.7/3 abs(__z): Returns the magnitude of __z.
template
inline _Tp
__complex_abs(const complex<_Tp>& __z)
{
    _Tp __x = __z.real();
    _Tp __y = __z.imag();
    const _Tp __s = std::max(abs(__x), abs(__y));
    if (__s == _Tp()) // well ...
        return __s;
    __x /= __s;
    __y /= __s;
    return __s * sqrt(__x * __x + __y * __y);
}

Passing -ffloat-store and turning off optimisations did not resolve it, changing operations order or adding intermediary steps neither.

The question is : did I miss something or do you also have errors on large inputs even while staying on a classic c++ implementation, just changing data structures ?

Edit : format code thx to @jeremych1000

jeremych1000 commented 7 years ago

@martin-gantier Side note, use ``` code here ``` to format code.

// 26.2.7/3 abs(__z): Returns the magnitude of __z.
template
inline _Tp
__complex_abs(const complex<_Tp>& __z)
{
    _Tp __x = __z.real();
    _Tp __y = __z.imag();
    const _Tp __s = std::max(abs(__x), abs(__y));
    if (__s == _Tp()) // well ...
        return __s;
    __x /= __s;
    __y /= __s;
    return __s * sqrt(__x * __x + __y * __y);
}

liafalk commented 7 years ago

Hey all, I think I solved the problem with my code.

If you look at section 7.4 of the OpenCL spec, it states that the precision for fp sqrt and div has a relative error of 2.5-3 ulp (units in last place). This means that it is not IEEE 754 compliant, as then it must have a relative error within 0.5ulp. The GNU/GCC implementation of sqrt is within 0.5ulp and is compliant however.

While there is a compiler option to correct this "-cl-fp32-correctly-rounded-divide-sqrt", the GPU used on the AWS instance doesn't support this (you'll notice CL_DEVICE_SINGLE_FP_CONFIG returns 63). However if you check 7.4, it also mentions that for doubles, the sqrt is correctly rounded. By calculating the sqrt as a double, and then converting it back to a float, it seems to return the correct output, at least for the values I've tested with so far (up to 16000).

EDIT: I mentioned it in passing, but I should emphasize that floating point division also has an accuracy penalty, and I would recommend passing it into the kernel through your host code, or doing the same procedure as the sqrt (casting double as a float). EDIT2: It seems like the "-cl-opt-disable" flag is also necessary.

I'm not sure if I should actually be writing this as I'm not sure if this was part of the "challenge", but...

pierrezaza commented 7 years ago

Thanks @indigoDot for your tip. I found out yesterday about -cl-fp32-correctly-rounded-divide-sqrt but never got it to work on AWS (or my Mac) so I didn't post anything about it... Now I've tried implementing your solution and still have two problems:

AWS doesn't recognise the -cl-opt-disable compiler flag
If I remove -cl-opt-disable from the makefile, I then get this error: error: must specify '#pragma OPENCL EXTENSION cl_khr_fp64: enable' before using 'double'

Szypicyn commented 7 years ago

You're not meant to put these flags in the makefile. These are OpenCL flags, so you have to put them in the code next to where you build the kernel.

https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clBuildProgram.html

The second one, you are given the answer. Copy and paste that in your kernel at the top.

m8pple commented 7 years ago

Ack - I thought I'd posted a commit on Sat which answered some of these questions, and also added reference inputs. It looks like I pushed to the wrong origin (which is why I'm usually lenient on good-faith mis-pushes, it's quite easy to do...).

This wasn't suppose to be such a big stumbling block, as I wanted people to be able to get it working in opencl, while still being aware of the floating-point issues.

I'm going to send out an email stating that:

I will accept a "correct" GPU implementation of Julia which either:
- Uses sqrt(z2) > 2 (for a wonky implementation of sqrt)
- Uses x2 > 4
Uses FMA or non-FMA

Sorry about the lack of response, I hadn't noticed the activity on this issue.

jeremych1000 commented 7 years ago

@m8pple Will you still push the reference inputs as a new commit?

liafalk commented 7 years ago

@m8pple Is there a possibility of a small extension? I know more than a few groups spent a lot of time trying to get the Julia result to be accurate...

m8pple commented 7 years ago

@jeremych1000 Yes, I've pushed the reference results just now.

@indigoDot I'm very reluctant to allow extensions, as this will interfere with CW6, and past experience has been that any deadline extensions annoy more people than they please. It's within my gift to give class-wide extensions, but

AFAICT the best solution I have is what I'm doing - so I take on the responsibility of dealing with the different error possibilities, while people don't have to do any extra work (though may have wasted time trying to deal with it).

jeremych1000 commented 7 years ago

@m8pple Side note, but do you have any better solutions for git merging? I spent an hour installing kdiff3, configuring it for git mergetool, then merging manually (so accepting all your changes, but merging my user_* changes as well, for every change). It's all good now, but it seemed quite a long-winded way?

m8pple commented 7 years ago

While I use git/github extensively for research projects, most of the time there isn't more than one person active in a particular area, and people tend to work in their own branches. We don't exactly follow best-practises (academics :P) so there tend to be big-bang merges of branches. When I handle that, I often let all the merge conflicts get through to disk, then work through the conflicting files on-disk. With a reasonable set of unit tests it is possible to work through quite quickly and reliably.

For frequent-small-merge work-flows, I generally use TortoiseGIT, which comes with a merge program called TortoiseGitMerge that fires up if you want it to. I think it is probably similar to kdiff3.

To be honest, I don't really know much more than:

Use a decent three-way merge tool
Make sure you check any remaining conflicted files (sometimes you can't easily decide what to do just from looking at the merge tool)
Have good repository-wide unit tests that build with one command, and make sure they have decent coverage

HPCE / hpce-2016-cw5

Difference between CPU and GPU in floating point operations? #10