Data integers - Githubissues

danakj commented 1 year ago

"Why does SIMD seem safer than regular integer math?"

The answer to this at first may have seemed to be "there's no Undefined Behaviour on overflow." However unsigned integers also define overflow behaviour (to wrap) yet contribute to security vulns used in-the-wild on a regular basis. So what is going on there, or is it an illusion?

Integer overflow leads to security vulnerabilities when those integers are used for control: especially for indexing into a container, but also for any logical operations. In an application dominated by logic all the integers can be safe numerics and it makes sense. Now your software can avoid a class of bugs that lead to security vulns.

But in a large-data application, there is additional costs that are not warranted. For instance if the pixels in a texture overflow, this changes what you see on your screen, but it does not lead to a memory safety bug or an invalid program state/weird machine.

Rust integers can be made to trap on overflow, but the default is to not. This is the one place in the Rust language where safety is not the default. I suspect it is because of the additional cost this puts on large data. For a security-critical application, enabling integer trapping will be critical to user safety in Rust, just as with using safe trapping numerics in C++. But in both cases, we would need a way to express that some integers are not like the others. That some integers are data, are not meant for control, and we want well-defined and cheap overflow behaviour.

There's two main directions that we can go from here.

Using wrapping math explicitly

The first path is that data should always be manipulated through .wrapping_add(), .wrapping_mul() etc. Using the + and * operators is logically incorrect for them because we want them to wrap on overflow. This expresses the intention directly in the syntax (good!) and it eliminates any potential runtime overheads for them (good!), but it does require a lot more writing. This type of decision to wrap is more rare in other places than in large data.

This also means you can choose .saturating_add() etc if you'd prefer to saturate instead of wrap. That introduces extra logic though and the assumption here is that we don't care about producing invalid values here in the presence of bugs and/or malicious inputs and weird machines as the invalid values don't further contribute to the take over of our machine and our data.

Putting this into the type system

We could introduce a data integer type that is separate from regular integers. Such a type should really show up only as values in data structures where they are eventually consumed as an output but not as a control, where getting the output wrong may be a bug but one we'd rather hear about in a bug report than a crash report. Where getting the output wrong won't lead to a CVE and stolen user data or identity theft.

To have cheap overflow behaviour implies that the data integers are unsigned in C++ since overflow on signed values causes Undefined Behaviour in a C++ compiler. Then overflow is defined to wrap.

To have such integers in a program, we would require guardrails so that they don't end up being used to index into an array, or application control. These data integers should really only appear in datasets, like textures or measurements or the like. Things used in graphics or scientific or financial computing. For the Web, this would mostly come up in the graphics case.

Developers should not need to think hard about which types to use, it should be easy to get things right. If it's easy to get things wrong, then this is the wrong direction to go in.

So how do we make this easy to use right and hard to use wrong through guardrails:

Allow conversion from primitives and control (normal) integers. Becoming data where we don't care about overflow is a safe direction to go.
Don't allow conversion into control (normal) integers except through explicit unsafe methods, for a hypothetical d32 something like this: d32::to_unsigned(UnsafeFnMarker) -> u32.
Provide conversion to primitives in the same way as control, in order to use the data. Sadly this allows roundtripping through primitives to bounce back and forth, which is a big problem. But primitive integers should be only be used to interact with APIs that require them, they should never show up in the middle of an application otherwise using safer numerics.

What we want to avoid is this:

void using_data(d32* data, d32 i) {
  data[i] *= 2;  // OOB can happen here if `i` can end up with unintended values.
}

If all containers are subspace types, then the index operations take usize and d32 would not convert into it. Great!

But when working with other container types (let's ignore native pointers, they are doom and you should use span in C++ if not using Slice/SliceMut) the index operations take size_t. And the conversion from d32 to size_t looks exactly like the one from i32:

void using_data(d32* data, d32 i) {
  data[size_t{i}] *= 2;  // OOB can happen here if `i` can end up with unintended values.
}

The way to protect security-critical software here is:

Use a subspace container type (or some container that works with usize), or
Prevent unintended values in i by preventing overflow.

It's important to best support working inside a codebase with std containers and size_t indexing operations, as otherwise incremental adoption provides very weak security benefits. So it seems like we need to rely on i to be valid (not overflowed).

Until a world where all types work with usize for indexing, it seems that "Using wrapping math explicitly" is preferable.

danakj commented 1 year ago

I should add that we have a library type OverflowInteger which provides arithmetic operations and just keeps going in the case of overflow, allowing you to check for overflow at the end of a sequence of operations.

It would be pretty trivial to make a WrappingInteger that does the same but uses wrapping_add() etc, in which case you get cheap but not-always-valid outputs and the conversions of this to actual integer types can be provided through its API.

So nothing stops this from existing in user code too, but maybe it would be nice to provide similar to OverflowInteger, and maybe a SaturatingInteger too. But doing that with a loud and annoying type is different than d32 or so which looks like it should fill in for "an integer" and can be used in control too.

danakj commented 1 year ago

After some discussions and thinking about this, I believe that another type here is not in line with the project principles. Code that does wants wrapping adds should do wrapping_add(). It was pointed out to me that of course even in pixels, the values can be used for control such as branching on 0 or specific alpha etc.

OverflowInteger is a particularly special case as it allows sum() and product() on an integer iterator without panicking. Until there's sufficient motivation, there's no clear cause to pull WrappingInteger or SaturatingInteger into the library too. As these are different.

Principles include "Memory safety should be the default. You don't pay (in bugs and CVEs) for premature optimization you didn't need. Leaving this path should require more typing, and a clear signal what is happening." A competing integer type would not provide the clarity required for leaving safety. Integer unsafety leads to memory safety bugs, as thoroughly demonstrated by in-the-wild exploit chains, so it is in scope for this.

chromium / subspace

Data integers #300

Using wrapping math explicitly

Putting this into the type system