Closed james7132 closed 2 years ago
NOTE: Tested on M1 Max.
I tested performance recently for transform * transform (where GlobalTransform3A is GlobalTransform with Vec3A for translation and scale components):
GlobalTransform - mul_transforms: 2000000, t: 33ms, muls/s: 59585212
GlobalTransform3A - mul_transforms: 2000000, t: 24ms, muls/s: 81130010
Affine3A - mul_transforms: 2000000, t: 16ms, muls/s: 124499416
Mat4 - mul_transforms: 2000000, t: 29ms, muls/s: 67183140
Discussion with someone in the game engine industry suggests that the separate representation like we do for (Global)Transform is the right source of truth for editing. I don’t know how propagation is done though - whether using the separate representation or going via Affine3A-type matrices. I’m investigating what is done there and why.
I did note that Affine3A has precision differences compared to GlobalTransform3A though I don’t know which would be more correct. Accuracy could also be a factor.
It seems that the separate representation as we have in Transform and GlobalTransform is only intended to be used for local transforms. The local transform is converted to an affine matrix, the local matrices are propagated to create local to world matrices and these are used for rendering. So it seems that the separate representation in Transform should be a source of truth but GlobalTransform does not need to have a separate representation because it is not edited in that way, only calculated. The affine matrix can be calculated on the fly from the separate representation in Transform, and GlobalTransform can store the matrix. Also, if an hierarchy of local matrices do non-uniform scale, rotate, non-uniform scale, rotate that creates a shear which cannot be represented by the separate translation, scale, rotation components. That makes me think the way we propagate right now using GlobalTransform is wrong
IMO, converting to Affine3A, multiplying, and back is likely going to perform worse.
Transform propagation right now is fundamentally bottlenecked by the storage format and access patterns right now, so changing the type used here is likely to see negligible wins in perf.
However, it's not the only location where SIMD operations might be beneficial. For example, frustrum culling seems to be reliant on GlobalTransform's position and orientation and sees slight wins from using Vec3A over Vec3, but it incurs a conversion cost to align the Vec3. If Transform and GlobalTransform used Vec3A, that conversion could be removed. It should be noted that conversion in the opposite direction is likely a no-op, as individual components are already 4-aligned in Vec3A. Other general operations on both Transform and GlobalTransform may be faster too.
I am saying that GlobalTransform should internally be an Affine3A. There would not be a conversion from Affine3A back to any separate representation. It’s one-way.
That might save some time during extract since Affine3A -> Mat4 would be a fairly trivial transformation. Though arguably we would see worse cache performance as Affine3A is 64 bytes over GlobalTransform(3A)'s current 48. Worth testing either way.
@cart will need to weigh in on where we land with this so you avoid unnecessary work. Our current GlobalTransform propagation does not support shears. I think non-uniform scale propagation will have unexpected results in cases which would produce a shear of propagated with an Affine3A and that would differ from other engines. I want to prove this by trying it in other engines to use as an argument for why GlobalTransform should be an Affine3A internally. I also want to understand what use cases may be broken by such a change if GlobalTransform didn’t expose ways of getting the scale and rotation. Translation is easy to get.
I think the only place where Affine3A to Mat4 would be needed would be for the view projection matrix. The view matrix and its inverse and all model matrices and their inverses can remain Affine3A. But I know how that bit works with rendering so I feel like I can do that bit as I have done it for the instance buffer mesh change.
For this PR I guess all of this with Affine3A is out of scope. But perhaps it’s worth making a decision first, @cart .
Whether or not we adopt Affine3A for GlobalTransform and whether or not we use Vec3A in Transform are definitely two separate conversations. @superdump the Affine3A conversation should happen in a different thread.
These do actually relate. I misread Vec3A in GlobalTransform
as Vec3A in Transform
. Ignore this while i rethink.
It complicates the public interface of what amounts to the most common component (Transform) by introducing a "second" vec type. Most people wont need to care, but converting to and from Vec3 and Vec3A is something I generally don't want bevy users to worry about. "Vector performance" is an implementation detail that users of Bevy apis should not ever need to think about.
This leaves us with a number of paths:
const_vec3a
weird (can macros have export aliases?). We would need to evaluate how weird this would be for users reading glam docs / find all of the apis this would make "weird".Given that (1) and (3) both look roughly the same from a user facing perspective, I don't feel much pressure to make the switch hastily.
Fun fact: I actually helped break us in to this problem :smile: https://github.com/bitshifter/glam-rs/issues/36
There seems to be a lot of sentiment in favor of a math lib that exports a bunch of different representations for a given type, then letting consumers assign type aliases to their preferred variants. So (3) isn't completely crazy.
global_transform.scale()
vs transform.scale
).transform.scale
!= global_transform.scale()
due to floating point math.This is a much more "disruptive" and fundamental change, which means we should prioritize making a call sooner rather than later. That being said, we've already had a lot of back and forth on the transform api design (this would be the fourth major version of the Transform api). I don't want to do this a fifth time. Anyone advocating for Transform changes should take the entire history of this conversation into account, explicitly address the complicated mix of performance, capability, and UX constraints, and see what "authorities" on the matter are doing (survey the "engine transform" space, look at feedback on these designs from people who know what they're talking about, etc, etc).
Ahhh wait its not a different conversation. Vec3A in Transform
isn't what this issue is about (its about Vec3A in GlobalTransform
) . Lemme quick re-evaluate my response
Ok sorry for the confusion. It seems like this issue is actually advocating for a change to both GlobalTransform and Transform. So actually I think my thoughts are generally still accurate. The only thing in my previous messages that changes in this new context is that we do need to talk about Affine3A in GlobalTransform
in this thread, as it is an alternative to this design.
I also think aliasing Vec3A
-> Vec3
is also probably not a good idea as people will expect it to be the same as glams's default Vec3.
I'd definitely advocate for heavier use of Vec3A
internally, but given the public interface between crates, that might not be too feasible.
I don’t like 3 as I think it will cause unexpected problems.
I’m more advocating for using optimal representations internally based on usage in alarms and presenting an ergonomic API. I want users to not think about it and algorithms that take significant time to execute to not be taking significant time because we have to convert things on the fly.
Maybe there’s some middle ground of using Vec3 for editing values which happens more rarely or in an amortised low significance fashion but Vec3A internally for the things done en masse every frame. Or we use a bit more time and RAM to do a single conversion from Vec3 to Vec3A at some point before doing the heavy algorithms using the optimised format.
I want to see that it makes a significant difference in performance before adding complexity let alone possibly reducing ergonomics. I investigated Affine3A and Vec3A to see if it would make a significant impact. In the focused benchmarks it does, but it didn’t really impact what I was testing with transform propagation. It seemed like it may be relevant for animation but it seemed like the vast majority of time was being spent in the hierarchy traversal. As such, it felt to me like this should be parked until it’s understood that it has a demonstrable and practical time-saving impact in a stress test that is verging on unreasonable but still feasible.
I continued discussing and investigating Affine3A and 4x3 matrix use right now as I thought it was valuable to understand the pros and cons of going down the route of using it internally for GlobalTransform so that if/when the transform * transform part of transform propagation or some other algorithm indicates that significant time could be saved, we’d be immediately ready with the information to take a decision.
I have input about how Unity works - mapping it to bevy terms, local Transforms are separate translation, rotation, and non-uniform scale, its compute_matrix() gives a 4x3 and GlobalTransform is a 4x3 matrix. I don’t know how they handle use cases like ‘I want to spawn this thing with the same translation and orientation as that thing’, say a water jet emitter from a water pistol. I feel like that should be similarly solvable in both cases.
In both cases the translation, scale, and rotation are calculated from floating point maths propagating local transforms through the hierarchy and so there will be error due to precision limitations. I was surprised to learn that there were more operations involved in the quarernion multiplications than in the Affine3A multiplications. I wonder which retains more accuracy. It’s not obvious to me. Also, as regards root entities and transform scale != affine scale, I wonder what the error is there in practice and if it is ever significant enough to care about. If it’s accurate to 8 significant figures or something (8 was fairly arbitrary), is that enough?
I agree that a more thorough investigation is needed. My information-gathering has been primarily optimisation motivated and ‘see what those advocating 4x3 actually do with it outside of rendering’.
Finally, I don’t remember if I tried Affine3A with check_visibility which is a system which takes quite a bit of time at the moment.
Is there any reason to keep this open after https://github.com/bevyengine/bevy/pull/4379?
What problem does this solve or what need does it fill?
Make operations on translation and scale faster with SIMD operations.
What solution would you like?
Change translation and scale to
Vec3A
. Update all APIs to work withVec3A
instead ofVec3
.What alternative(s) have you considered?
Leave them as
Vec3
, use scalar operations on them.Additional context
When not using
scalar-math
, on platforms where SIMD registers are available (WASM, x86). glam's Quat already has an align of 16, which means both Transform and GlobalTransform both have 8 bytes of padding. Might as well useVec3A
to accelerate operations on the type. You can check this with the following on a x86 machine: