Closed JunShern closed 6 years ago
For the fetch_and_add
operation the processor needs to lock that address, and has to hold
that lock for the entire time that the add
is going on. The longer that the lock lasts, the
less scalable the whole system is. In the case of floating-point maths, that presents a particular
problem, as:
In order to make the latency as low as possible, and keep the lock for as short a period as possible,
it is feasible to insert integer operators like +
directly into the memory path, as it only adds a bit
to latency, and doesn't require much area. Doing the same with floating-point is much more
difficult, and probably requires deeper pipelines and more hardware.
It could be done if there was enough demand, but usually there isn't. One of the problems faced with atomic reductions on doubles is that the order of execution matters -- so an atomic add over doubles is guaranteeing non-deterministic results, which most people try to avoid.
If you really want atomic adds on doubles, one way is to use a mutex (not recommended):
std::mutex mutex;
double x=0.0;
...
parfor(...){
...
{
std::lock<std::mutex> lock; // Take the lock for this scope
x += something; // Performed under the lock
} // Lock released at end of scope
...
}
But this will likely have terrible performance unless the increment represents only a tiny amount of the loop body.
Another option is to use a compare and swap loop:
double INVALID=nan("");
tbb::atomic<double> x=0.0;
parfor(...){
...
double val=INVALID;
while(val==INVALID){
val = x.fetch_and_store(INVALID);
}
val += something;
val = x.fetch_and_store(val);
assert(val==INVALID); // should always be true
...
}
This is probably faster than the mutex, but will still not scale well.
However, summing up over a range is a problem that occurs a lot, so you could try a tried and tested solution if your problem fits.
Yes! Your tried and tested solution (tbb::parallel_reduce) worked. Thank you!
I have a simple loop which accumulates the result from each loop into a variable, something like:
Now I want to parallelize the loop, so I make the accumulation variable atomic, like so:
Which works fine, as we have seen in CW4. However, when I try to do the same thing for an
atomic<double>
accumulator:The compilation fails with complaints about
"no operator '+=' for atomic<double>"
. I looked it up and found this StackOverflow issue which mentions that indeed,atomic<double>
does not support certain operations such as ++, +=, etc.I tried getting around the problem by writing out the += equivalent in two steps
Which compiles, but fails the test for correctness. Note that the sequential version of this exact same thing does pass correctness.
So I am wondering
Why does this sort of fetch and add (+=) operation work sequentially but not parallel-y, even though the accumulator is atomic?
Is there any way I can get around this? (I also tried using other atomic types which do support the fetch and add, such as
int_64
to replacedouble
, but expectedly this fails correctness even for the sequential version).Thanks in advance.