Open ajwfrost opened 2 years ago
Hey Andrew, such an addition would be amazing! I don't remember how often I begged Adobe to make such a change, but at that time their resources were already too limited.
Starling's performance would profit immensely from specialized ByteArray manipulations. In a typical Starling render loop, each vertex position is multiplied with the current modelview-projection matrix and then stored in a big byte array (the batch). I don't have Scout installed at the moment, but if I remember correctly, this easily took up 25% of a frame's render time in common situations.
In source code, that's happening here.
In Starling, vertex data is stored in a custom format. For example, the default format of a vertex is position:float2, texCoords:float2, color:bytes4
. So the first two floats store the position, followed by the texture coordinates (also floats), and then 4 bytes that contain the RGBA channels. This means that a method that applies a matrix operation to a ByteArray would require not only starting index, count and data format, but also some kind of stride
parameter that allows to skip a certain number of bytes.
There are probably other methods to look into as well (what comes to mind now is pre-multiplying alpha values in byte4
-color data), but that copy-and-transform method is definitely the most relevant one. If I can get Scout running again, I can have a look at where the other priorities are. (You can also do this yourself by starting the benchmark in the Starling demo and then having a look at what's coming up in Scout in regards to ByteArray and/or VertexData.)
Feel free to reach out to me any time if you need more feedback or want me to try out something! I'm really thrilled to hear that you're considering enhancements like these. :smile:
Thanks Daniel - yes that's the bit of code that I'd spotted that I wondered about! I hadn't twigged about the fact the operations aren't done on closely packed sequences of values, so a stride
parameter is a very good idea.
We'll have a bit of a play... :-)
thanks
@PrimaryFeather fyi I also just noticed this snippet which is using a lot of time on one of our simple test cases:
_rawData.position = pos;
oldColor = switchEndian(_rawData.readUnsignedInt());
newColor = value ? premultiplyAlpha(oldColor) : unmultiplyAlpha(oldColor);
_rawData.position = pos;
_rawData.writeUnsignedInt(switchEndian(newColor));
pos += _vertexSize;
and the obvious question for me here is why you're doing 2x 'switchEndian' calls, why don't you change the endianness of the byte array?
But anyway -> we can potentially add functionality like "premultiplyAlpha" with formats such as "RGBA32", "ARGB32", and "BGRA32" which should help!
VertexData.copyTo
is taking a bit of time for me too but that may be because of the fakeness of the test case...
@ajwfrost
and the obvious question for me here is why you're doing 2x 'switchEndian' calls, why don't you change the endianness of the byte array?
I actually stumbled upon this myself yesterday, and I'm actually not sure why I did that. In the VertexData constructor, I'm manually ensuring that the ByteArray uses little endian. Could it be that stage3d expects all data that way? E.g. when reading floats? Remember, those ByteArrays contain mixed data, i.e. also coordinates or any other values needed by a vertex or fragment program.
But anyway -> we can potentially add functionality like "premultiplyAlpha" with formats such as "RGBA32", "ARGB32", and "BGRA32" which should help!
Yeah, that's exactly what I was hoping for β fantastic! π
@PrimaryFeather: So, Starling 2.8 ... ? π
You guys have no idea how excited we are to hear these things!!!
@PrimaryFeather: So, Starling 2.8 ... ? π
Haha, I guess a better opportunity for v2.8 is unlikely to show up, right? π
I'm 99% sure that you could do something like
data.endian = "littleEndian";
data.pos = 0;
data.writeUnsignedInt(0x12345678);
data.endian = "bigEndian";
data.pos = 0;
data.readUnsignedInt(); // returns 0x78563412
i.e. the endian-ness is just used at the point of pulling data into/out of the array.
Actually looking at the code, this does seem to be the way it works, and it seems like littleEndian will be quicker normally (i.e. when you're on a little-endian machine, like most of them are now!) https://github.com/adobe/avmplus/blob/master/core/DataIO.cpp#L60 https://github.com/adobe/avmplus/blob/master/core/DataIO.h#L93
Interesting that the documentation for vertex buffer uploads says "The ByteArray object must use the little endian format." - but this isn't checked in the code as far as I can see, and I actually think the array should be in the machine-native format for it to work. Or the format that the graphical subsystem expects, so yes, basically little endian everywhere I guess!
Anyway -> let's see where we get to with some of these sorts of things, because it's equally wasteful to switch the byte ordering even in C++ if all we're doing is reading a value, operating on it in a per-channel method, and writing it back: better to change that per-channel premultiply operation so it works with the various pixel formats! and particularly if we can do this across multiple bits of data at once i.e. via SIMD acceleration....
i.e. the endian-ness is just used at the point of pulling data into/out of the array.
Aaah, gosh, I actually wasn't aware of that. π So this part could definitely be simplified. π
(... and we could speed up things at the same time if ByteArray.writeUnsignedInt
and friends had an optional endianness
parameter instead. But that's another discussion. :wink:)
In any case, the switching was done only for color values (since the uint
color values people are used to in the display list is written 0xAARRGGBB
); otherwise, I'm not checking or converting bytes for their endianness.
Thanks a lot for the additional information, Andrew!
Hey, @PrimaryFeather any development here? I think it would be super cool. Thanks!
Thanks a lot for the reminder, @raresn! I just integrated the optimizations regarding the endianness-conversion and pushed the changes to the repository. Regarding the potential optimizations around the ByteArray API, I'll have a look at those when the new methods are available.
Cheers! π
@PrimaryFeather - got an idea on what we should expect so far performance wise?
To be honest, I didn't have the time to make specific tests, but my guess is that the speedup won't be very big in most situations. It might make a bigger difference in scenes were a big chunk of the display tree have an alpha value smaller than one, because the batching code for that situation now leads to significantly fewer function calls.
So, don't expect any miracles β but enjoy a tiny little bit more performance in specific situations. π
Manipulating ByteArray data
I thought I'd seen a request about this but can't find it, so creating a new issue here.... there was a comment about the manipulation of ByteArray data which is currently a bit slow/manual and uses a lot of the
IDataInput
/IDataOutput
functions to get at the values in the appropriate format. We noticed recently the use of this in Starling e.g. in VertexData, a matrix multiplication is applied to data via 2xreadFloat
requests, then the multiplication is done using aMatrix
object with all the values being converted into Numbers, then the values are written back over the original ones viawriteFloat
.The plan here would be to provide operations that can be applied over the whole of a ByteArray object without having to go through loops/type coercion/etc within ActionScript, so e.g. that operation could be re-written as something like
Under the hood we can then use quicker pointer arithmetic (or even SIMD operations) to accelerate this.
The goal would then be to provide the ability to do thing like:
In all of these, we'd need to be told what format to treat the underlying byte array data, hence the enum value there of
ARRAY_OF_FLOATS
or similar: so if you didByteArrayOperations.add(data, 1, ARRAY_OF_BYTES)
then we'd add 1 to every byte, but if it wasByteArrayOperations.add(data, 1, ARRAY_OF_SHORTS)
then we'd add 1 to every other byte...I would be interested in any comments or thoughts on this, any requirements etc (and particularly @PrimaryFeather if you're able to comment on what might be useful for accelerating Starling it would be good, I'm wondering if we can set up a way for this to be a SWC library that can be used by any Flash/AIR runtime for backwards compatibility, and then just if it's running on an appropriate version of AIR we'll then be able to replace the AS3 implementation by accelerated C++.....)
thanks