ByteArray maths operations

ajwfrost commented 2 years ago

Manipulating ByteArray data

I thought I'd seen a request about this but can't find it, so creating a new issue here.... there was a comment about the manipulation of ByteArray data which is currently a bit slow/manual and uses a lot of the IDataInput/IDataOutput functions to get at the values in the appropriate format. We noticed recently the use of this in Starling e.g. in VertexData, a matrix multiplication is applied to data via 2x readFloat requests, then the multiplication is done using a Matrix object with all the values being converted into Numbers, then the values are written back over the original ones via writeFloat.

The plan here would be to provide operations that can be applied over the whole of a ByteArray object without having to go through loops/type coercion/etc within ActionScript, so e.g. that operation could be re-written as something like

ByteArrayOperations.multiply(targetRawData /* i.e. the input ByteArray */, matrix /* we can work out the type */, ByteArrayOperations.ARRAY_OF_FLOATS /* tells us how to treat the data */);

Under the hood we can then use quicker pointer arithmetic (or even SIMD operations) to accelerate this.

The goal would then be to provide the ability to do thing like:

add/subtract/multiply/divide etc some byte array data by a constant value such as a number or (where it makes sense) a matrix
add/subtract/multiply/divide etc two byte arrays either one manipulating the other, or to return a new byte array result
resampling of data, for example to only take every other sample or similar (could be handy for audio with stereo etc)
other operations perhaps such as FFT .. or tell us what you want!

In all of these, we'd need to be told what format to treat the underlying byte array data, hence the enum value there of ARRAY_OF_FLOATS or similar: so if you did ByteArrayOperations.add(data, 1, ARRAY_OF_BYTES) then we'd add 1 to every byte, but if it was ByteArrayOperations.add(data, 1, ARRAY_OF_SHORTS) then we'd add 1 to every other byte...

I would be interested in any comments or thoughts on this, any requirements etc (and particularly @PrimaryFeather if you're able to comment on what might be useful for accelerating Starling it would be good, I'm wondering if we can set up a way for this to be a SWC library that can be used by any Flash/AIR runtime for backwards compatibility, and then just if it's running on an appropriate version of AIR we'll then be able to replace the AS3 implementation by accelerated C++.....)

thanks

PrimaryFeather commented 2 years ago

Hey Andrew, such an addition would be amazing! I don't remember how often I begged Adobe to make such a change, but at that time their resources were already too limited.

Starling's performance would profit immensely from specialized ByteArray manipulations. In a typical Starling render loop, each vertex position is multiplied with the current modelview-projection matrix and then stored in a big byte array (the batch). I don't have Scout installed at the moment, but if I remember correctly, this easily took up 25% of a frame's render time in common situations.

In source code, that's happening here.

In Starling, vertex data is stored in a custom format. For example, the default format of a vertex is position:float2, texCoords:float2, color:bytes4. So the first two floats store the position, followed by the texture coordinates (also floats), and then 4 bytes that contain the RGBA channels. This means that a method that applies a matrix operation to a ByteArray would require not only starting index, count and data format, but also some kind of stride parameter that allows to skip a certain number of bytes.

There are probably other methods to look into as well (what comes to mind now is pre-multiplying alpha values in byte4-color data), but that copy-and-transform method is definitely the most relevant one. If I can get Scout running again, I can have a look at where the other priorities are. (You can also do this yourself by starting the benchmark in the Starling demo and then having a look at what's coming up in Scout in regards to ByteArray and/or VertexData.)

Feel free to reach out to me any time if you need more feedback or want me to try out something! I'm really thrilled to hear that you're considering enhancements like these. :smile:

ajwfrost commented 2 years ago

Thanks Daniel - yes that's the bit of code that I'd spotted that I wondered about! I hadn't twigged about the fact the operations aren't done on closely packed sequences of values, so a stride parameter is a very good idea.

We'll have a bit of a play... :-)

thanks

ajwfrost commented 2 years ago

@PrimaryFeather fyi I also just noticed this snippet which is using a lot of time on one of our simple test cases:

_rawData.position = pos;
oldColor = switchEndian(_rawData.readUnsignedInt());
newColor = value ? premultiplyAlpha(oldColor) : unmultiplyAlpha(oldColor);

_rawData.position = pos;
_rawData.writeUnsignedInt(switchEndian(newColor));

pos += _vertexSize;

and the obvious question for me here is why you're doing 2x 'switchEndian' calls, why don't you change the endianness of the byte array?

But anyway -> we can potentially add functionality like "premultiplyAlpha" with formats such as "RGBA32", "ARGB32", and "BGRA32" which should help!

VertexData.copyTo is taking a bit of time for me too but that may be because of the fakeness of the test case...

PrimaryFeather commented 2 years ago

@ajwfrost

and the obvious question for me here is why you're doing 2x 'switchEndian' calls, why don't you change the endianness of the byte array?

I actually stumbled upon this myself yesterday, and I'm actually not sure why I did that. In the VertexData constructor, I'm manually ensuring that the ByteArray uses little endian. Could it be that stage3d expects all data that way? E.g. when reading floats? Remember, those ByteArrays contain mixed data, i.e. also coordinates or any other values needed by a vertex or fragment program.

But anyway -> we can potentially add functionality like "premultiplyAlpha" with formats such as "RGBA32", "ARGB32", and "BGRA32" which should help!

Yeah, that's exactly what I was hoping for – fantastic! 😁

2jfw commented 2 years ago

@PrimaryFeather: So, Starling 2.8 ... ? 👀

raresn commented 2 years ago

You guys have no idea how excited we are to hear these things!!!

PrimaryFeather commented 2 years ago

@PrimaryFeather: So, Starling 2.8 ... ? 👀

Haha, I guess a better opportunity for v2.8 is unlikely to show up, right? 😂

ajwfrost commented 2 years ago

I'm 99% sure that you could do something like

data.endian = "littleEndian";
data.pos = 0;
data.writeUnsignedInt(0x12345678);
data.endian = "bigEndian";
data.pos = 0;
data.readUnsignedInt(); // returns 0x78563412

i.e. the endian-ness is just used at the point of pulling data into/out of the array.

Actually looking at the code, this does seem to be the way it works, and it seems like littleEndian will be quicker normally (i.e. when you're on a little-endian machine, like most of them are now!) https://github.com/adobe/avmplus/blob/master/core/DataIO.cpp#L60 https://github.com/adobe/avmplus/blob/master/core/DataIO.h#L93

Interesting that the documentation for vertex buffer uploads says "The ByteArray object must use the little endian format." - but this isn't checked in the code as far as I can see, and I actually think the array should be in the machine-native format for it to work. Or the format that the graphical subsystem expects, so yes, basically little endian everywhere I guess!

Anyway -> let's see where we get to with some of these sorts of things, because it's equally wasteful to switch the byte ordering even in C++ if all we're doing is reading a value, operating on it in a per-channel method, and writing it back: better to change that per-channel premultiply operation so it works with the various pixel formats! and particularly if we can do this across multiple bits of data at once i.e. via SIMD acceleration....

PrimaryFeather commented 2 years ago

i.e. the endian-ness is just used at the point of pulling data into/out of the array.

Aaah, gosh, I actually wasn't aware of that. 🙈 So this part could definitely be simplified. 🙉

(... and we could speed up things at the same time if ByteArray.writeUnsignedInt and friends had an optional endianness parameter instead. But that's another discussion. :wink:)

In any case, the switching was done only for color values (since the uint color values people are used to in the display list is written 0xAARRGGBB); otherwise, I'm not checking or converting bytes for their endianness.

Thanks a lot for the additional information, Andrew!

raresn commented 2 years ago

Hey, @PrimaryFeather any development here? I think it would be super cool. Thanks!

PrimaryFeather commented 2 years ago

Thanks a lot for the reminder, @raresn! I just integrated the optimizations regarding the endianness-conversion and pushed the changes to the repository. Regarding the potential optimizations around the ByteArray API, I'll have a look at those when the new methods are available.

Cheers! 😄

raresn commented 2 years ago

@PrimaryFeather - got an idea on what we should expect so far performance wise?

PrimaryFeather commented 2 years ago

To be honest, I didn't have the time to make specific tests, but my guess is that the speedup won't be very big in most situations. It might make a bigger difference in scenes were a big chunk of the display tree have an alpha value smaller than one, because the batching code for that situation now leads to significantly fewer function calls.

So, don't expect any miracles – but enjoy a tiny little bit more performance in specific situations. 😄

airsdk / Adobe-Runtime-Support

ByteArray maths operations #1773

Manipulating ByteArray data