fetch_and_add operation for tbb::atomic<double>

JunShern commented 6 years ago

I have a simple loop which accumulates the result from each loop into a variable, something like:

int acc = 0;
for (...) { acc += something; }

Now I want to parallelize the loop, so I make the accumulation variable atomic, like so:

tbb::atomic<int> acc = 0;
parfor (...) { acc += something; }

Which works fine, as we have seen in CW4. However, when I try to do the same thing for an atomic<double> accumulator:

tbb::atomic<double> acc = 0;
parfor (...) { acc += something; }

The compilation fails with complaints about "no operator '+=' for atomic<double>". I looked it up and found this StackOverflow issue which mentions that indeed, atomic<double> does not support certain operations such as ++, +=, etc.

I tried getting around the problem by writing out the += equivalent in two steps

tbb::atomic<double> acc = 0;
parfor (...) { 
double tmp = acc;
acc = tmp + something; 
}

Which compiles, but fails the test for correctness. Note that the sequential version of this exact same thing does pass correctness.

So I am wondering

Why does this sort of fetch and add (+=) operation work sequentially but not parallel-y, even though the accumulator is atomic?
Is there any way I can get around this? (I also tried using other atomic types which do support the fetch and add, such as int_64 to replace double, but expectedly this fails correctness even for the sequential version).

Thanks in advance.

m8pple commented 6 years ago

For the fetch_and_add operation the processor needs to lock that address, and has to hold that lock for the entire time that the add is going on. The longer that the lock lasts, the less scalable the whole system is. In the case of floating-point maths, that presents a particular problem, as:

Double-precision hardware operators have relatively high latency compared to integer operations.
Double-precision hardware operators are relatively large (transistor count) compared to integer.
The double-precision ALUs are usually off in their own specialised area, and comparatively "far" from the integer side of the chip.

In order to make the latency as low as possible, and keep the lock for as short a period as possible, it is feasible to insert integer operators like + directly into the memory path, as it only adds a bit to latency, and doesn't require much area. Doing the same with floating-point is much more difficult, and probably requires deeper pipelines and more hardware.

It could be done if there was enough demand, but usually there isn't. One of the problems faced with atomic reductions on doubles is that the order of execution matters -- so an atomic add over doubles is guaranteeing non-deterministic results, which most people try to avoid.

If you really want atomic adds on doubles, one way is to use a mutex (not recommended):

std::mutex mutex;
double x=0.0;
...
parfor(...){
    ...
    {
        std::lock<std::mutex> lock;   // Take the lock for this scope
        x += something;       // Performed under the lock
    }   // Lock released at end of scope
    ...
}

But this will likely have terrible performance unless the increment represents only a tiny amount of the loop body.

Another option is to use a compare and swap loop:

double INVALID=nan("");
tbb::atomic<double> x=0.0;
parfor(...){
   ...
   double val=INVALID;
   while(val==INVALID){
       val = x.fetch_and_store(INVALID);
   }
   val += something;
   val = x.fetch_and_store(val);
   assert(val==INVALID); // should always be true
   ...
}

This is probably faster than the mutex, but will still not scale well.

However, summing up over a range is a problem that occurs a lot, so you could try a tried and tested solution if your problem fits.

JunShern commented 6 years ago

Yes! Your tried and tested solution (tbb::parallel_reduce) worked. Thank you!

HPCE / hpce-2017-cw5

fetch_and_add operation for tbb::atomic<double> #30