Replace RandomSampler for multi-part URs fountain codes

eliasnaur commented 1 year ago

The UR encoder and decoder have to agree on the degree and selection of part indices for any given fragment, so the pseudo-random algorithm has to be followed by every implementation. However:

Choosing a degree depends on an almost 100 line RandomSampler just to get the complexity from O(n) or O(log n) to O(1)[*].
RandomSampler uses doubles for its computations; floating point computations are notoriously hard to make deterministic across platforms, and they're slow on embedded devices. RandomSampler even has a comment,

        while !S.isEmpty {
            // can only happen through numeric instability
            probs[S.removeLast()] = 1
        }

which doesn't instill confidence in the complete reproducibility of the resulting degree selection.

I suggest one or more of the following:

Change the UR multi-part standard to use a simpler degree selection method, preferably avoiding floating point computations. It may even be possible that a replacement exists that is backwards-compatible.
Change the standard to require data sizes below a certain threshold not use fountain codes.
Completely document the algorithm, removing references to any particular implementation.

[*]: Note that chooseDegree re-creates the RandomSampler at every use, thus defeating the O(1) advantage.

wolfmcnally commented 1 year ago

Thank you for your comments! You're absolutely right that I was instantiating RandomSampler far more than was necessary. While I don't think this actually broke O(1) complexity, it did make each iteration take far longer than necessary. So I have fixed that in this commit.

As a user of a UR implementation, it's up to you to decide how large your fragments are (and therefore how many of them there will be for a given message size), and fountain codes are only used when you generate fragments where seqNum >= seqLen. The initial sequence of fragments produced in 0 ..< seqLen are never produced using fountain codes, so you could just repeat the initial sequence over and over, although you would give up the benefits of fountain codes in that case.

The RandomSampler algorithm is incorporated into the spec by reference, and was ported from a well-established C implementation based on a well-understood and documented algorithm, and URLs to these are in the code. Furthermore IEEE-754 floating point values are a standard, and all operations on them should yield bit-for-bit equal results across platforms; I would be interested to hear if you know of important examples where this is not the case, or important platforms that do not use IEEE-754.

In any case, I am not a numerical methods expert, so I don't consider myself qualified to attempt a replacement that avoids the use of floating point types including backward compatibility, which at this point would be an absolute requirement as numerous third-parties are now adopting URs.

ChristopherA commented 1 year ago

I had a discussion with a designer of low-power secure device, and the key question was if such a device would ever display a QR, and though it doesn't seem likely short-term (as they will more likely use the MPU which supports IEEE-754). However, long-term they may need to. For instance, someone may want to drive an e-paper display like https://www.ynvisible.com for trusted output.

I don't think we have the resources to redo fountain code without at least double int, especially also is often not available on these processors, so it might have to be 16 bit. Not sure if that possible.

eliasnaur commented 1 year ago

Thank you for your comments! You're absolutely right that I was instantiating RandomSampler far more than was necessary. While I don't think this actually broke O(1) complexity, it did make each iteration take far longer than necessary. So I have fixed that in this commit.

Thanks!

As a user of a UR implementation, it's up to you to decide how large your fragments are (and therefore how many of them there will be for a given message size), and fountain codes are only used when you generate fragments where seqNum >= seqLen. The initial sequence of fragments produced in 0 ..< seqLen are never produced using fountain codes, so you could just repeat the initial sequence over and over, although you would give up the benefits of fountain codes in that case.

The UR receiver cannot require senders to not send fountain encoded fragments, so compatibility would be improved if certain data types would be guaranteed not to use fountain encoding. For example, I believe BlueWallet doesn't support fountain encoded fragments.

However, simplifying the standard would be a better approach.

The RandomSampler algorithm is incorporated into the spec by reference, and was ported from a well-established C implementation based on a well-understood and documented algorithm, and URLs to these are in the code. Furthermore IEEE-754 floating point values are a standard, and all operations on them should yield bit-for-bit equal results across platforms; I would be interested to hear if you know of important examples where this is not the case, or important platforms that do not use IEEE-754.

From https://stackoverflow.com/questions/21212326/floating-point-arithmetic-and-reproducibility/21212366#21212366:

There are issues with reproducibility of even elementary floating-point operations in high-level languages, but they are usually controllable with various platform-specific operations such as setting compiler switches, using custom code to set floating-point controls and modes, or, if necessary, writing essential operations in assembly. As developed in comments, the specific problem you encountered may be that different C implementations use different precision to evaluate intermediate floating-point expressions. Often this can be controlled by compiler switches or by including casts and assignments in the expressions to require rounding to the nominal type (thus discarding excess precision).

One compiler optimization that breaks reproducibility elementary operations on floating point values is the fused-multiply-add operation supported by most modern processors. For example,

var a, b, c, d float32
...
a = b*c + d

may produce different results depending on whether the multiplication and addition is implemented as one instruction or two separate instructions. In particular, the fused operation may be more precise.

In summary, perhaps all implementations of RandomSampler are correct today, especially if you're careful with compiler flags or pragmas that disable the reproducibility-breaking behaviour. But what about tomorrow, for ports to other languages?

In any case, I am not a numerical methods expert, so I don't consider myself qualified to attempt a replacement that avoids the use of floating point types including backward compatibility, which at this point would be an absolute requirement as numerous third-parties are now adopting URs.

Yes, a deployed standard is indeed unfortunate. But so is subtle floating point differences in the fragment sequences.

eliasnaur commented 1 year ago

I had a discussion with a designer of low-power secure device, and the key question was if such a device would ever display a QR, and though it doesn't seem likely short-term (as they will more likely use the MPU which supports IEEE-754). However, long-term they may need to. For instance, someone may want to drive an e-paper display like https://www.ynvisible.com for trusted output.

I don't think we have the resources to redo fountain code without at least double int, especially also is often not available on these processors, so it might have to be 16 bit. Not sure if that possible.

I think the reproducibility (and performance) gain of an implementation in double int (32-bit?) is enough to consider it.

ChristopherA commented 1 year ago

Do you want to present at the next Gordian Developer meeting on the problem (June 7th), so that we can seek guidance from others on priority and best ways to achieve this?

eliasnaur commented 1 year ago

Sure. I've posted in the Signal group about the link/dates for the meeting.

shannona commented 1 year ago

I've got you on the schedule for tomorrow to let people know about the cool features of your engraver and to talk about the reproducibility/simplification issues you raise here, @eliasnaur.

jeandudey commented 8 months ago

RandomSampler uses doubles for its computations; floating point computations are notoriously hard to make deterministic across platforms, and they're slow on embedded devices. RandomSampler even has a comment

Just to sched some light here, I think this is a problem with the design of the code but it is deterministic, the bit representation of a double per the IEEE 754 standard is the same always on compliant platforms, double isn't deterministic is when the order of the operations change, like in a multithreaded algorithm but the random sampler algorithm by design only runs in a single thread.

One compiler optimization that breaks reproducibility elementary operations on floating point values is the fused-multiply-add operation supported by most modern processors. For example,

The fused-multiply-add optimization doesn't apply to the RandomSampler algorithm though.

So it's impossible for a mismatch between two devices talking UR.

eliasnaur commented 8 months ago

RandomSampler uses doubles for its computations; floating point computations are notoriously hard to make deterministic across platforms, and they're slow on embedded devices. RandomSampler even has a comment

Just to sched some light here, I think this is a problem with the design of the code but it is deterministic, the bit representation of a double per the IEEE 754 standard is the same always on compliant platforms, double isn't deterministic is when the order of the operations change, like in a multithreaded algorithm but the random sampler algorithm by design only runs in a single thread.

One compiler optimization that breaks reproducibility elementary operations on floating point values is the fused-multiply-add operation supported by most modern processors. For example,

The fused-multiply-add optimization doesn't apply to the RandomSampler algorithm though.

How do you know? A compiler is free to automatically optimize floating point operations with FMA operations.

There are other sources of CPU specific differences. One that comes to mind is that Intel processors can perform floating point operations in 80-bit extended precision, increasing precision but with different results than on CPUs without extended precision doubles.

eliasnaur commented 8 months ago

Floating point determinism is a common problem for network games that assume deterministic simulation of game state. See for example https://gafferongames.com/post/floating_point_determinism/. It discusses FME, SSE, 80 bit precision, compiler flags for enabling strict IEEE determinism and more. Networked games sometimes restrict players to a single architecture (say, Intel) to ensure determinism.

jeandudey commented 8 months ago

How do you know? A compiler is free to automatically optimize floating point operations with FMA operations.

Yes, but it can't with the random sampler code as there's no room for that optimization there, the compiler is not able to reduce any of the expressions in the random sampler onto something equivalent that FMA instructions can take advantage of.

One that comes to mind is that Intel processors can perform floating point operations in 80-bit

Can be disabled through compiler flags at least for C and C++ on both GCC and Clang (-fexcess-precision=standard), and the places where that can't be disabled (for example, libm) aren't used by the RandomSampler code, it uses basic arithmetic.

It discusses FME, SSE, 80 bit precision, compiler flags for enabling strict IEEE determinism and more.

There's the compiler flag for the excess precision and the random code can and should be compiled without SIMD optimizations, even though I don't think they apply on that code (so it'd be still fine to use optimizations, because it doesn't apply), maybe on the map operation which could be parallelized: probs.map { $0 * Double(n) / sum } but I'm not yet aware of a fused multiply-divide operation for any ISAs for the inner expresion of the map, and if the outer expression is parallelized it should still yield the same result though if results are stored in order.

eliasnaur commented 8 months ago

How do you know? A compiler is free to automatically optimize floating point operations with FMA operations.

Yes, but it can't with the random sampler code as there's no room for that optimization there, the compiler is not able to reduce any of the expressions in the random sampler onto something equivalent that FMA instructions can take advantage of.

One that comes to mind is that Intel processors can perform floating point operations in 80-bit

Can be disabled through compiler flags at least for C and C++ on both GCC and Clang (-fexcess-precision=standard), and the places where that can't be disabled (for example, libm) aren't used by the RandomSampler code, it uses basic arithmetic.

It discusses FME, SSE, 80 bit precision, compiler flags for enabling strict IEEE determinism and more.

There's the compiler flag for the excess precision and the random code can and should be compiled without SIMD optimizations, even though I don't think they apply on that code (so it'd be still fine to use optimizations, because it doesn't apply), maybe on the map operation which could be parallelized: probs.map { $0 * Double(n) / sum } but I'm not yet aware of a fused multiply-divide operation for any ISAs for the inner expresion of the map, and if the outer expression is parallelized it should still yield the same result though if results are stored in order.

What about other compilers, or compilers for other languages?

To be clear, I'm not claiming that you can't achieve determinism, or even that there is a determinism problem with the current implementations. I'm saying that avoiding floating point types in a standard completely sidesteps all these problems.

My uneducated hope is that the implementation can be implemented using integers. Perhaps by limiting the range of inputs to reasonable values.

wolfmcnally commented 5 months ago

I'm open to suggestions, but I think our solution will work for all the compilers and languages of which I am aware, and will require more number theoretic knowledge to replace than I currently possess.

For more information about this algorithm in context, please refer to our new Multipart UR (MUR) Implementation Guide.

BlockchainCommons / Research

Replace RandomSampler for multi-part URs fountain codes #121