[QUESTION/DOCUMENTATION] Document about reproducibility of computations

Hi! I have a question about something I was considering using accelerate for scientific computation that we wish to be reproducible.

The goal is to be able to run the same computation again on a different setup and get the same result, e.g. with CUDA on GPU or on X86 CPU.

I found in https://github.com/AccelerateHS/accelerate/blob/master/accelerate.cabal that it is possible to eliminate a large source of platform/optimisation-dependant behavior with -fno-fast-math. That's great. I also saw a -fno-fast-permute-const which I'm not sure what it does but I'm curious about.

Would you say disabling fast-math is enough to have IEEE754 floating point behavior with deterministic order of operations, so that the calculation result can be the same no matter the platform?

(I can make a PR afterwards to try to summarise this discussion in the README or wherever applicable.)

So:

Can I get reproducible floating point results?
(additionally) What's the effect of -fno-fast-permute-const?

Thanks!

The effect of -fno-fast-permute-const is to disable an assumption that accelerate makes about code using permute const. See the documentation of permute. Normally const is not an allowable combination function for permute anyway (as it is non-commutative), so we assign special behaviour: permute will not actually evaluate the const operation and simply assume that the index mapping function is injective, and directly write the permuted values to memory. -fno-fast-permute-const disables this behaviour, which means that results are still nondeterministic, but the implementation uses synchronisation so that at least multiple (compound) values do not each get half-written to the destination.

Regarding reproducible floating-point results: -fno-fast-math indeed improves things, but parallel algorithms, especially those on GPU, are sometimes non-deterministic by design, and their precise evaluation order depends on device hardware parameters to optimise their performance. In particular, folds and scans will select block sizes that depend on GPU hardware, and these block sizes will probably differ from the ones chosen on a CPU. Hence evaluation order may differ there.

Additionally, the computation model for permute is inherently non-deterministic if the index mapping function is not injective; wanting that to be deterministic puts a huge limit on permute performance.

Wanting precise IEEE754 floating-point behaviour with deterministic evaluation order is largely incompatible with efficient parallel evaluation. I expect that you won't find other parallel array languages that 1. ensure deterministic evaluation order and 2. are actually efficient.

Hi Tom, Thanks for your write-up. I'm not too surprised at the behavior you describe for permute. This makes sense given the performance goals.

If I understood correctly, as long as the indexing function is injective, permute is reproducible. I think I do not need to worry about fusion when I have a sequence of several permute, since "The permute operation will always be evaluated; it can not be fused into a later step."

So for fold, I assume in the worst case I can do a sum two-by-two with intermediate vectors 64->32->16->...->1 to enforce reproducibility. This would mean O(log2(length)) permute steps, which can be acceptable in my case.

I'm also wondering about the behavior of stencils. Since I pass an explicit function (Expr), can I assume that this is reproducible as well? Do I need to worry about fusion?

On 1 November 2024 13:53:50 CET, Tom Smeding @.***> wrote:

The effect of -fno-fast-permute-const is to disable an assumption that accelerate makes about code using permute const. See the documentation of permute. Normally const is not an allowable combination function for permute anyway (as it is non-commutative), so we assign special behaviour: permute will not actually evaluate the const operation and simply assume that the index mapping function is injective, and directly write the permuted values to memory. -fno-fast-permute-const disables this behaviour, which means that results are still nondeterministic, but the implementation uses synchronisation so that at least multiple (compound) values do not each get half-written to the destination.

Regarding reproducible floating-point results: -fno-fast-math indeed improves things, but parallel algorithms, especially those on GPU, are sometimes non-deterministic by design, and their precise evaluation order depends on device hardware parameters to optimise their performance. In particular, folds and scans will select block sizes that depend on GPU hardware, and these block sizes will probably differ from the ones chosen on a CPU. Hence evaluation order may differ there.

Additionally, the computation model for permute is inherently non-deterministic if the index mapping function is not injective; wanting that to be deterministic puts a huge limit on permute performance.

Wanting precise IEEE754 floating-point behaviour with deterministic evaluation order is largely incompatible with efficient parallel evaluation. I expect that you won't find other parallel array languages that 1. ensure deterministic evaluation order and 2. are actually efficient.

-- Reply to this email directly or view it on GitHub: https://github.com/AccelerateHS/accelerate/issues/545#issuecomment-2451825712 You are receiving this because you authored the thread.

Message ID: @.***>

I am wondering about your specific use-case, and whether Accelerate is the right tool here. :)

Perfect numerical reproducibility across platforms (and even across releases) is simply not a design goal of Accelerate. You may be able to find ways to find a sublanguage of Accelerate that is perfectly reproducible, but depending on exactly what sublanguage that is, we may not be able to give guarantees about future releases or backends.

One sublanguage that is safe from "imperfections" is to avoid floats entirely and work with fixed-point numbers -- would that work for you? That gives you a slowdown, clearly (especially on GPUs which have few integer compute units), but you don't seem terribly concerned about slowdowns.

The main surprising thing about fusion is that it may in some cases duplicate computation; this happens in some cases if a "producer" (e.g. generate, map) is fused into a backpermute or a stencil operation. (Possibly more.) I don't think duplication of computation really makes things non-reproducible; I think (!) you don't need to worry about fusion in your case.

Another point: differing hardware makes differing guarantees about numerical correctness. For example, NVIDIA writes that trigonometric functions may not return exactly the same results as CPU functions do due to differing algorithms, and numerical properties may also depend on exactly which instructions a particular computation is lowered to (e.g. whether an FMA is used).

I do think the most robust way to get what you describe is to avoid floating point numbers entirely, if possible. Any approach using floats will require in-depth checking of the entire stack from high-level language to hardware to see if any of the components may introduce incompatibilities. If you're okay with doing so, then that's on you -- but I wouldn't recommend it.

AccelerateHS / accelerate

[QUESTION/DOCUMENTATION] Document about reproducibility of computations #545