farnoy commented 3 years ago

What problem does this solve or what need does it fill?

More optimizations could be done in systems that could iterate over queries in a batched, packed way rather than individually. A clear example would be SIMD.

What solution would you like?

I'd like to be able to use queries in the following way:

world
    .query_filtered::<Batched<(&mut Position, &Velocity)>, Without<FrozenInPlace>>()
    .par_for_each_mut(
        &task_pool,
        4,
        |(mut positions, velocities): (impl DerefMut<&mut [Position]>, impl Deref<&[Velocity]>)| { 
           use core::intrinsics::{assume, likely};
           if unsafe { likely(positions.len() == 4) } {
                unsafe { assume(velocities.len() == 4); } // hopefully done by Bevy
                // some batch fast path
           } else {
                for (mut pos, vel) in positions.iter_mut().zip(velocities.iter()) {
                       // fallback scalar slow path
                }
           }
        }
    );

I'm using Deref & DerefMut informally in this example. I imagine it would be some wrapper like Res, but I don't want to suggest any names.

What alternative(s) have you considered?

Manual buffering to reconstruct this information, but it's hard to implement especially for parallel iteration.

Additional context

For densely stored archetypes, I would like it to return however many spatially contiguous items as it's able to, up to the batch_size setting of course. If the total number of compatible rows for a batched query is equal to batch_size, but it's spread across different archetypes/tables and therefore not contiguous, I would prefer to get N separate batches under batch_size. That is what the &[Component] API would require anyway.

It would also be great if Bevy could prove to the compiler that all of these component-slices are of the same length. That would perhaps improve optimization.

alice-i-cecile commented 2 years ago

This is compelling, and I could see this having sizable performance impacts. Implementation will be fairly complex, but I don't think this will be a controversial feature.

InBetweenNames commented 2 years ago

1822

InBetweenNames commented 2 years ago

So, thinking about this a little more, to truly get the benefits of an SoA representation for calculations such as position = velocity*time, we'd actually need each element of these vectors to be stored as separate components, e.g., PositionX, PositionY, PositionZ... maybe Bundles might make that a bit more convenient. Otherwise, they get laid out as [x, y, z, x, y, z...] which would make things complicated from a batching sense. You'd need to do some kind of rearrangement to get the data in a form amenable for vectorization, and then rearrange it back when you're done.

Ideally, you'd have the X components stored as [x,x,x,...], Y as [y,y,y...]... and so on. This makes the vectorization straightforward, because you simply load chunks at the native vector width (as OP suggested), and do the math in each SIMD lane.

For example, for position = velocity*time, you could load 8 entities PositionX components in one go (for e.g., AVX):

Px = [x0, x1, ... x7] Py = [y0, y1, ..., y7] Pz = [z0, z1, ..., z7]

t is a scalar

Velocity (Vx, Vy, Vz) follows same form as position.

Now, Px' = Px + tVx Py' = Py + tVy Pz' = Pz + t*Vz

These are straightforward assembly instructions and bonus, the data for Px', Py', Pz' are already in the correct form to just replace directly in the storage as-is.

References:

https://www.intel.com/content/www/us/en/developer/articles/technical/memory-layout-transformations.html (AoS and SoA in general) https://www.intel.com/content/www/us/en/developer/articles/technical/how-to-manipulate-data-structure-to-optimize-memory-use-on-32-bit-intel-architecture.html (this specific technique)

InBetweenNames commented 2 years ago

Seems I just can't help myself! Digging around the code a bit, to facilitate this kind of change, we might need to ensure the allocator for the tables are correctly aligned for SIMD if this isn't already done. (I'm not sure what the Rust default is, but if it's just using the system malloc, then usually it's 16 bytes aligned for free). Normally this kind of change is pretty simple to implement and due to the way the ECS is stored, even if you didn't use SIMD, you'd only incur an overhead of a few bytes to achieve the desired "overalignment". For example, AVX2 requires a 32-byte alignment.

The next observation I've made is that if the alignment is good, then this could probably be implemented without too much trouble (of course now that I've said that... :) ). It only really makes sense for Components using Dense storage. I would imagine the Sparse storage implementation would just have an "adapter" to allow it to work, and possibly kick out a warning that what you're doing doesn't really make sense. Or, just panic, because that would be a leaky abstraction? Not sure what the Bevy policy is on that one. Maybe there's some compile-time magic that could make it so doing a "batched" query over a Sparsely backed component would fail to compile. The only reason you'd use this API would be for performance, so this is unlikely to affect a beginner.

It seems, due to the nice design of the Bevy ECS, that most of the work could be accomplished directly within the Query code. From what I can tell, aside from the alignment concerns I brought up earlier, no changes need to be made to the fundamental parts of the ECS to support this.

Regarding the Query API additions, there is a question of what the best way to expose the batches might be. Ideally, since the data is already aligned, we could return a "Batch" type that simply references that data, e.g, Batch<f32, 8>. This type would, provided certain constraints are met, be zero-copy convertable to some kind of SIMD type that would support your vector operations. (I'm not sure what the preferred way of doing SIMD is in Rust, but in C++ land, the xsimd library is really nice to work with and is general over different vector sizes). For a start, we might consider just returning glam vectors (which have the vector size hardcoded unfortunately, so would only work on batch sizes of 2 or 4). Again these are just ideas.

For types that don't have a "direct" SIMD representation, we could leave it to the user to define their own rearrangement.

Example usage, adapted from example in the OP:


//These are just markers for dimensions
struct X;
struct Y;
struct Z;

#[derive(Component)]
struct Position<Dimension>(f32);

#[derive(Component)]
struct Velocity<Dimension>(f32);

// For convenience, something like:
//type PositionBundle = (Position<X>, Position<Y>, Position<Z>) 

fn update_positions<Dimension>(ents: Query<(&mut Position<Dimension>, &Velocity<Dimension>)>, time: Res<Time>)
{
  // This "8"  for the batch size should be provided at compile time, maybe with a bevy_simd::NATIVE_F32_WIDTH constant?
   ents.for_each_mut_batch::<8>::(|mut positions, velocities| {
           if unsafe { likely(positions.len() == 8) } { // positions.len() == velocities.len()

                // I'm being vague with the "simd" type here --
                // the only thing that is important is the support for + and scalar multiplication in this case

                //This updates the positions in place.  The "Batch" type maps the result back to the storage,
                // and ideally *is* the storage
                positions = (positions.to_simd() + time*velocities.to_simd()).from_simd();

                // Note that .to_simd() would only be available for "SIMD" types, e.g., i32, f32... if you had a Vec3 in your batch,
                // you'd need to do this conversion yourself.  This is the 'rearrangement' I mentioned earlier.  And you'd
                // need to 'rearrange' this data back, too.

           } else {

                // Necessary, because there's no guarantee that the query results will be
                // a multiple of the batch size.  I think this could happen if there's a query filter, or
                // if we're at the end of the iteration and the number of remaining elements isn't a multiple
                // of the batch size.
                // This is a topic that could be explored a bit more later.
                for (mut pos, vel) in positions.iter_mut().zip(velocities.iter()) {

                       pos = pos + t*vel; //Just scalar code

                }
           }
   });

}

// In main
app.add_system(update_positions<X>);
app.add_system(update_positions<Y>);
app.add_system(update_positions<Z>);

The reason in this example each coordinate is processed separately is that if you process them together, you're going to incur a ton of cache misses from the data access pattern. It's better to "keep on going" down each coordinate than switch from X to Y to Z, which live in different storages and could be quite far in memory from one another.

Obviously this is a lot more complicated than the simple cases we're used to... but again, this is for cases where performance is a concern and we need to bring out the big guns. People should ideally prefer to not write things this way by default. The sample code only really makes sense when you start considering large numbers of entities, etc, that are being iterated over. Iteration could be combined with something like rayon even. I could see simulations being a particularly good use case for that.

On a final note, I would recommend allowing users to specify the batch size they want to work with, as sometimes you get some benefits by using a multiple of the native vector width (e.g., using a batch size of 16 instead of 8 for AVX) due to pipelining.

Sorry about the ramble -- I just find this really interesting, and I think Bevy is in a unique position to capitalize on SIMD this way.

EDIT: one last thing, notice that these are now 3 separate systems in the example. So even without something like rayon, there is opportunity to parallelize these systems as they work on completely different data.

InBetweenNames commented 2 years ago

Just curious, is there interest in a prototype implementation of this? Or benchmarks to show potential speedup under (good) circumstances? I believe I could mock up a benchmark with the existing Bevy. I'd be willing to take this on as a fun little project.

alice-i-cecile commented 2 years ago

I'd be interested in seeing this! I'd try to steer clear of splitting Transform in the proof of concept, just to avoid derailing.

InBetweenNames commented 2 years ago

Agreed! I'll keep the PoC focused on just a hypothetical custom "Position" and "Vector" component :)

InBetweenNames commented 2 years ago

@alice-i-cecile Not quite ready to share my code yet, but my initial benchmarks are promising:

"naive" is the normal Vec3 way of doing position + time*velocity (e.g., AoS representation) "naive_affine" (should read "native_aligned") is using Vec3A, otherwise the code is the same. The "batched" benchmark is where it gets interesting!

(BTW, this was done using the excellent criterion crate!)

The X axis represents the number of entities in the system. I went up in powers of 2. If there are some ideal numbers to reach here please let me know. The Y axis represents how long it takes to do one iteration of computing position = time*velocity for each World.

So what I did was I "faked" the approach using the existing Bevy ecosystem by having the "batched" version's entities actually manage 4 "virtual entities" each with their own position and velocity. Therefore, for the batched version, the true effective entity count was the same, despite the number of "entities" reported by Bevy being 1/4th of the ones in the other "naive" benchmarks. The batched version was implemented using Vec4, which is implemented using vector instructions on my platform (x86_64).

Now, I fully expect this to improve further when I add in AVX and use packed_simd or some other crate to do SIMD properly. Still, I wanted to share these early results in the hopes this sparks some discussion. The benchmark code shouldn't be far off from "release" and I'll post it when it's ready. (Also, this was without -ffast-math enabled... which will yield even more significant differences!)

alice-i-cecile commented 2 years ago

Those are some promising results! Doesn't the A in Vec3A mean "aligned", not "affine" though? :)

InBetweenNames commented 2 years ago

Err yes... my bad! I'll edit the post (very sleepy right now).

InBetweenNames commented 2 years ago

While I'm working on some more in-depth benchmarks, I found this excellent post from the nalgebra project about different data layouts and their realization in Rust. The post is a few years old and I'm sure the state of the art has advanced a bit since then in how SIMD is actually realized in Rust, but the general concepts all still apply. There's some potential big wins on the table :)

InBetweenNames commented 2 years ago

Wow, so here's a somewhat unexpected result:

I implemented a rough AoSoA representation for comparison (improvements noted and needed -- currently relying on struct laying out Vec4s sensibly). I know the SoA benchmarks could be improved, but I wasn't expecting a substantially better result from that alone. It could be simply that the SoA benchmark has a bug (performance wise) which explains the gap. The aosoa_sse4 benchmark uses the same "virtual entity" model that the soa benchmarks use. Also, I included a "suboptimal" iteration pattern for SoA to show the difference between iterating entirely through X, Y and Z in order compared to iterating through them interleaved. Not a huge difference.

I think to make this benchmark conclusive, I'm going to have to use the packed_simd crate to rule out everything else. I may need to switch to using nalgebra as well.

Unfortunately, the implementation of AoSoA into Bevy isn't an obvious thing to me at this time. Whereas SoA is straightforward and scales nicely with vector size, AoSoA kind of permeates the system a bit. Things need to be aware of it. It's probably not a huge deal, but it would definitely need some consideration. Still, I want to include some actual, hard, reproducible data to guide whatever decisions may come out of this.

CptPotato commented 2 years ago

@InBetweenNames These are some interesting results, thanks for investigating 👍

From what I gather the benchmark is very much leaning into the strengths of SIMD. I don't know that much about this topic, so my question is: What about access patterns that are less favorable for SIMD, and are these bad or common enough to cause a problem?

For example, I'm thinking about cases where random access happens, which would require 3x the amount of memory fetches for Vec3s if I understand the data layout correctly? I guess the answer is (as always) it depends™ but maybe you have some insight or numbers on this.

InBetweenNames commented 2 years ago

For example, I'm thinking about cases where random access happens, which would require 3x the amount of memory fetches for Vec3s if I understand the data layout correctly? I guess the answer is (as always) it depends™ but maybe you have some insight or numbers on this.

For the SoA layout, it will indeed run you more cache misses than compared to AoSoA or AoS layouts for the reasons you've mentioned. For AoSoA, this won't be nearly as bad. I'm curious, does random access occur commonly in Bevy systems? I always figured people would use the Sparse layouts for that (which batching wouldn't apply to). I'm fairly new to Rust and Bevy, so I'm still getting a handle on things here.

CptPotato commented 2 years ago

Thanks for clarifying!

I'm curious, does random access occur commonly in Bevy systems?

I don't have enough experience to give an exact answer to this. But I'd expect it to be not that common since it's also harder to use than straight up iteration. Though, some problems just don't fit this linear access pattern.

For example, I'd say anything that traverses hierarchy or some other kind of entity relation (e.g. transform and visibility propagation) ends up performing random* access. At least transform propagation requires access to all transform information (scale, rotation, position). iirc some form of acceleration structure for hierarchies was being discussed but I don't know about the current state or plans for this.

* it's obviously not actually "random" but afaik the data storage does not care about the hierarchy order

InBetweenNames commented 2 years ago

Ah, I hadn't even considered hierarchical relations! Thanks for that :) With that in mind, it may make sense to go for an AoSoA approach in the long run. You gain the benefits of efficient random access and SIMD potential. There are some complications on the implementation side of course... It would be super nice to be able to say:


#[derive(Component)]
#[component_layout_aosoa(<width>)] //width is probably 4 for x86_64 generic, but if you have AVX2 or something, you might go for 8.
//Or, if unspecified, you might let the compiler choose the width.
struct Position
{
   x: f32,
   y: f32,
   z: f32,
}

Which desugars to:


//with appropriate alignment, traits
struct PositionAoSoA4
{
  x: f32x4
  y: f32x4
  z: f32x4
}

impl LayoutAoSoA for PositionAoSoA4
{
    //number of lanes, metadata, etc
}

Later, Query uses this to determine the batch size. Unlike SoA, the batch size would have to be fixed to the AoSoA layout width specified at compile time. I'm not sure if Rust supports generics with lambdas, but if it does, you could actually provide a generic lambda that could be instantiated for any "width" required and have your code be relatively performance-portable:


fn optimal_system(query: Query<...>)
{
 //available only for AoSoA?  
  query.for_each_mut_batched(| batch | { ... }); //batch is generic and will work across a variety of widths
}

//perhaps, a batch might look like this:
enum Batch<T, N>
{
   //alignment directives, etc
   Partial([T]) //slice type, size < N
   Full([T; N]) //annotated with 'likely', either within for_each_mut_batched or somehow on the type.
}
//Now, code that uses Batch does not need to add the likely() check to help the optimizer:

fn optimal_system(query: Query<(&mut Position, &Velocity)>)
{
    query.for_each_mut_batched(| batch |
           if let Full((mut positions, velocities)) = batch {
                // some batch fast path
                // Bevy added likely and assume FOR you!  Yay!
           } else if let Partial((positions, velocities)) = batch {
                // slow path: positions.len() == velocities.len() < N
                for (mut pos, vel) in positions.iter_mut().zip(velocities.iter()) {
                       // fallback scalar slow path
                }
           }
}

Code that "doesn't care" about layout can safely ignore it using regular iterators, etc:


fn straightforward_system(query: Query<...>)
{
  query.for_each(| component | { ... }); // for_each provides the expected semantics independent of the underlying layout
  for component in query.iter() { ... }; // iter also provides expected semantics
}

which automatically pick out components for particular entities. Random access should be approximately as fast as the AoS case and AoS "aligned (using Vec3A for example)" cases. I realize this might be a pipe dream, but it would be very cool to see :) There are definitely some implementation challenges to solve for this though.

Aside: The astute reader might ask "can we transform the straightforward code into vectorized code?". I expect the answer to this to be "maybe", but it might be hard to do within the confines of the semantics of Rust. Automatic vectorization is possible under ideal circumstances, but usually you need language features to make this work predictably (e.g., ISPC, OpenCL, etc). I expect most production code would use the explicit batch-oriented approach detailed here for these reasons.

I actually had a small bug in my benchmarks today, where the iter methods on Queries isn't as performant as for_each methods. This is actually stated in the docs too -- oops. Here's the latest:

I included a "swizzle" benchmark as well to see what the performance difference would be between rearranging contiguous Vec3s into SIMD form and back. Just for comparison.

As I have verified that the generated assembly is indeed using packed SIMD (on applicable benches, though I haven't guaranteed it is 'optimal'), I think the benchmark code is now ready for interested people to play around with. It has some limitations: it is limited to SSE4 due to the vector type in play, and it is not using the Rust equivalent of -ffast-math.

The benchmark code is available here:

https://github.com/InBetweenNames/bevy-simd-bench

InBetweenNames commented 2 years ago

ASIDE:

I was a curious why the speedup for the non-AoS layouts wasn't as much as I expected, so I fired up vTune. With 4-wide SIMD, accounting for overhead, I expected somewhere around a 3.5x increase in speed. Instead, we observed only around a 2x increase. I discovered something that might be of interest:

Inlined code

In this screenshot, you're seeing the inlined for_each_mut for the AoSoA benchmark. The code accounts for a small percentage of what's actually going on inside the Bevy processing loop within for_each_mut. It seems there's actually a lot of bookkeeping going on in the background in between the code:

Bookkeeping

There's even more outside of that range. This could be a good optimization target for the future :)

InBetweenNames commented 2 years ago

Okay so, not all roses as I'd hoped. It looks like more than just the Query types will need to be altered; the Tables themselves might need some notion of batching too. In particular, a query over tuple (&T, &U, &V) with a batch size of N will need to return a something like a (&[T], &[U], &[V]). These would need to be retrieved from the table(s).

Queries have filtering functionality too, which it seems could maybe fragment iteration? Need to investigate the implications of that. If it does indeed cause fragmentation, does it make sense to have such a query and try to accelerate it with SIMD? Should this be excluded? Need some guidance on that.

Ideally, the alignment of the batch is a multiple of the required SIMD alignment. But this may not always be the case. It would be really nice to use the conditions at compile time and provide a suitable type with appropriate alignment guarantees baked into the type system. This would require the Layouts passed to the allocator to provide said guarantees (for the underlying storage of the Tables).

In fact, it might make sense to provide an additional "aligned" method if those alignment guarantees can be given. ~~If there were a struct that was, say, 12 bytes in size, then this method would only be callable if the batch were a multiple of 4 bytes (not that useful).~~ EDIT: was a little tired when I wrote this. The correct minimum viable batch size would have to be 64 bytes in total, or 4 elements (4 * 12 = 64). However, the unaligned batch would be just fine (although one would need to do "extra work" to really benefit from batches in this case).

NOTE; The above case would require rearrangement in this case to get to a SIMD form. It may make sense to restrict the "aligned" method to SIMD-like types, as there's probably not much benefit to alignment in the above case. However, the benefits of alignment would be much more apparent when working with, say, the SoA form of Position defined earlier in this post.

From what the benchmarks suggest, it may only make sense to implement the "aligned" batch and fail to compile otherwise (the "rearrangement" benchmark performed worse than even AoS). Since this is a performance-centric feature, it would actually be a leaky abstraction to downgrade to a "slow path" without informing the developer anyway.

Next, I'll investigate what changes need to be done to get table storage to allocate to the native SIMD alignment of the worst case for the target platform. This will provide more opportunities for aligned loads and have very little overhead. Hopefully I can isolate that work into a separate PR to prepare for the later changes.

NOTE: if the allocator isn't providing memory with the desired allocation for the tables, you there will need to be both a "prologue" and an "epilogue" for each table where a partial batch would be processed. Making the allocator itself give aligned memory to the tables makes only the epilogue necessary. This is outside of "filtering" considerations however, which will cause more of these partial batches if I'm understanding it right.

InBetweenNames commented 2 years ago

About midway through the prototype implementation. There's a lot to be improved for sure, but it's coming along. I wanted to note here that the change ticks are a little tricky to deal with during batched queries. I'll have to think over a good way to handle that over the near future. A low-bloat solution would be ideal.

In the meantime, I've needed to make a few additions to ThinSlicePtr, QueryState and WorldQuery to get "Read only" access going. The prototype also has the following caveats:

Sparse components are not supported. This is because the IS_DENSE table iteration method is required. Does it make sense to have sparse component iteration for this? I argue no, it doesn't, because this would yield a fragmented query. I'd like to hear from others on this though.
The returned "batch" type at the moment is just a slice. This is unfortunate, because there's no way to express what a slice's alignment is in the type system. Nor is there a way to express an upper bound for the number of elements in the slice. However, it'll do in the meantime.

@alice-i-cecile Question: What's a good type to represent a bounded slice with potential overalignment (e.g., a slice of u8 but aligned to 32 bytes)? I'm thinking we should probably define our own type alias to something for now, and once std::simd is stable, switch that alias over to a type using those primitives to provide the desired alignment. This is important down the road because it allows one to know straight away that 1) the data is aligned properly for SIMD and 2) the optimizer should be able to emit aligned moves, etc, due to that.

The murky explorations of the ECS continues...

alice-i-cecile commented 2 years ago

I'm thinking we should probably define our own type alias to something for now, and once std::simd is stable, switch that alias over to a type using those primitives to provide the desired alignment. This is important down the road because it allows one to know straight away that 1) the data is aligned properly for SIMD and 2) the optimizer should be able to emit aligned moves, etc, due to that.

This lines up with my intuition on the matter, but @james7132 or @bitshifter are better people to ask here.

bitshifter commented 2 years ago

I don't really know enough about Bevy internals to answer. If you want to have an aligned array of u8 I don't think std::simd can help with that as you will end with with an array of std::simd::u8x16 not u8. If you want an aligned struct you can assign into you can do something like below without need to wait for std::simd to stabilize.

#[repr(C, align(16))]
struct U8x16([u8; 16]);

and have a Vec of those. Obviously that means the stride will also be 16.

If you are loading bytes into SIMD registers, you should be able to load unaligned data without too much of a penalty, so if that's what you are wanting to do you don't strictly need allocations to be aligned.

InBetweenNames commented 2 years ago

Unfortunately, unaligned loads are not portable across architectures. They also have performance implications where they are supported in some cases, especially if they aren't natively supported. Otherwise I agree that on Intel at least, they aren't too bad.

With the right design, we can fix this up front for no cost :)

InBetweenNames commented 2 years ago

Made a good amount of progress. There's a few more changes than I originally anticipated but so far, reads are working and Miri isn't complaining. My working batch type definition is:

#[repr(align(16), C)] //FIXME: align needs to be configured properly at compile time
#[derive(Clone, Debug)]
pub struct AlignedArray<T, const N: usize>([T; N]);

#[derive(Clone, Debug)]
pub enum QueryBatch<'a, T, const N: usize> {
    Full(&'a AlignedArray<T, N>), // Full aligned batch of N
    Partial(&'a [T]),             //<= N.  MAY be aligned... under certain conditions
}

(thanks to @bitshifter for the repr(C) pointer with that).

Internally, this enum will actually decompose nicely along the code paths involved, emitting aligned moves, etc, as possible.

I also made an addition to ThinSlicePtr to enable "batched" retrieval of component elements. This is "smart" in that it will attempt to align your batches if possible by giving you a partial batch. If alignment is not possible, only partial batches will be returned... but honestly, this should probably be a static assert. On that topic, I found a nice little hack to get compile time static assertions in Rust, which are in the prototype. They can be replaced with debug asserts if desired.

Hopefully will make some more progress this week. So far, it's looking very good, and I expect when std::simd gets standardized, we can add the appropriate types to make the interface more ergonomic.

InBetweenNames commented 2 years ago

Another update. Due to the std::simd situation and no clear way to query the platform preferred vector width, I've opted to have tables be aligned to (minimum) 16 bytes for maximum portability. This appears to be the best we can do at the moment:


//FIXME: derive this in a maintainable way
#[cfg(all(any(target_feature = "avx"), not(target_feature = "avx512f")))]
pub const MAX_SIMD_ALIGNMENT: usize = 32;

#[cfg(any(target_feature = "avx512f"))]
pub const MAX_SIMD_ALIGNMENT: usize = 64;

//All platforms get 16-byte alignment on tables guaranteed.
#[cfg(not(any(target_feature = "avx512f")))]
pub const MAX_SIMD_ALIGNMENT: usize = 16;

I agree it's not pretty, and it's also not scalable. Is there a better way to do this? Please let me know.

The main reason this constant is even in here is for generating "native" binaries. Since the developer has the option of using whichever batch width they feel is appropriate, it is possible that any supported instruction set gets used on a "native" build (e.g. with -march=native). This means that table iteration logic has to be prepared to meet that worst case, even if that instruction set never gets used. For example, a CPU might support AVX512F, but the developer might intentionally choose a batch type of 8 f32s to avoid the clock speed penalty paid for using the bigger vector widths available (while benefiting from the AVX512 instructions available at that lower width that aren't present in AVX2). In this case, only a 32-byte alignment is required. However, the table would be aligned to 64 bytes on this platform anyway, as this is not something we can determine ahead of time. For simulation code, the bigger batch width might be beneficial, for example.

The good news is requiring minimum 16-bytes alignment on the table is hardly a problem. First, the alignment only applies to the table itself -- the elements remain exactly as they were -- so there is only an overhead of a few bytes per table. Second, most "C" allocators will give you a 16-bytes aligned chunk of memory anyway and that's never been a contentious thing (although on my system, the Rust one doesn't). Third, the tables will be quite large relative to the overhead size. (I would actually have chosen MAX_SIMD_ALIGNMENT to be 64 bytes just to have a "one size fits all" solution, but I figured that probably wouldn't get accepted 😆). So, I settled on 16 bytes as a conservative choice.

I have writes working as well in a few small test cases. The change ticks were a little tricky to get right, but fortunately, with the alignment situation now "uniform" across tables, that can be vectorized as well.

Query tuples are next: with MAX_SIMD_ALIGNMENT, now aligned chunks can be safely obtained from multiple tables for iteration.

Also, I'm leaning more towards modifying for_each to take a "fast path" and "scalar path" lambda separately... this makes it easier from a usage perspective and also simplifies the implementation.

The implementation is getting to the point now where it's a matter of "taste" on how it looks, etc. I'll migrate this issue to a PR once it's ready. Lots of "playing around" code to clean up first.

bjorn3 commented 2 years ago

For example, a CPU might support AVX512F, but the developer might intentionally choose a batch type of 8 f32s to avoid the clock speed penalty paid for using the bigger vector widths available (while benefiting from the AVX512 instructions available at that lower width that aren't present in AVX2).

Would disabling avx512f and enabling avx512vl, avx512dw and avx512bw instead work? Or do those have a hard dependency on avx512f?

InBetweenNames commented 2 years ago

Would disabling avx512f and enabling avx512vl, avx512dw and avx512bw instead work? Or do those have a hard dependency on avx512f?

Great question. Actually, the behaviour you're after is entirely in the hands of the developer. I'm unsure if you can just straight up disable AVX-512F without also disabling all the extensions, but I have good news to share. Here's a rustc issue describing that the defaults are sane for Rust. The key is the -mprefer-vector-width=256 attribute that achieves the effect that you want (AVX-512 on the lower vector widths). The main difference is AVX-512F is still enabled with this approach and you can use explicit AVX-512F intrinsics. However, the autovectorizer won't emit any ZMM instructions, if that makes sense. So, there's no risk of "accidentally" emitting the expensive instructions. Win/win!

On that note, in this circumstance, tables would still be 64-bytes aligned, in case someone did want to use those explicit intrinsics. I expect that the 64-bytes alignment might be beneficial anyway, as the table would start immediately at the start of a cache line. Actually, that may make a good case for increasing the alignment to 64 on x86-64 at least... maybe that will end up in the PR after all 😎

InBetweenNames commented 2 years ago

We may have to stick with just 16-bytes aligned chunks for the initial commit until generic_const_exprs is done. The reason: I'd like to be able to do something like the following:

#[repr(C, align(16))]
#[derive(Clone, Debug)]
pub struct AlignedArray16<T, const N: usize>([T; N]);

#[repr(C, align(32))]
#[derive(Clone, Debug)]
pub struct AlignedArray32<T, const N: usize>([T; N]);

#[repr(C, align(64))]
#[derive(Clone, Debug)]
pub struct AlignedArray64<T, const N: usize>([T; N]);

And select between arrays with some type-level machinery:


pub type AlignedArrayT<T, const N: usize, const ALIGN: usize> = ... //one of the three types above

I.e., AlignedArrayT<T, N, ALIGN> chooses from the three AlignedArray types. This is actually achievable right now. However, what makes it really shine is the magic below:


pub type AlignedArray<T, const N: usize> = AlignedArrayT<T, N,  { gcd::euclid_usize(core::mem::size_of::<[T;N]>(), MAX_SIMD_ALIGNMENT) }>

Now AlignedArray, given T and N, would correctly give you a compile-time guaranteed optimal alignment type for batch iteration. There's a little math behind the gcd part. Essentially, each batch has the size of [T; N], and this formula determines what the alignment of the [T;N] batches will be assuming that the table is aligned to MAX_SIMD_ALIGNMENT. (The proof of correctness is straightforward, but I'll omit it to keep on topic here).

This would be awesome to have... except you need generic_const_exprs for it to work 🤦. I did have a small version of it running using the typenum crate, but I figured that would almost certainly get rejected, especially since it required some patches to typenum to work. Not to mention the risk of increasing compile times due to all the type level machinery going on with that crate. Still was fun to try :)

One way forward is to add an ALIGN parameter for for_each_batched like:


query.for_each_batched::<4, 32>(scalar_path, vector_path); //Batches are aligned to 32 bytes

and have this parameter be defaulted to 16 (the lowest AlignedArray type defined). This would be checked at compile time for correctness, but isn't the cleanest approach obviously. It does provide a natural upgrade path in the future when the necessary language features are stabilized though. The only risk of setting this parameter lower than it could be would be potentially lower performance on some architectures, but the code would otherwise still be correct. Setting the parameter too high would yield a compile time error. More AlignedArray types could be provided, e.g, AlignedArray4 and so on, but I expect their utility to be super limited and more of an implementation detail than anything. 16 is a sensible default in that, if you're unable to achieve that alignment with the given batch size, you're probably doing something wrong (e.g. a batch of two Vec3s would only be aligned to 8... not that useful, whereas a batch of 4 Vec3s would be aligned to 16 since gcd(48, MAX_SIMD_ALIGNMENT) == 16 which opens more possibilities).

An alternative would be to avoid this ALIGN parameter entirely and just stick with using an AlignedArray16 for the fast path, even if greater alignment is actually the case. This would provide speedups for SSE4 code (some/most instructions require an aligned memory operand), On AVX and higher, this might not be too bad either, since supporting microarchitectures are generally a lot more performant with unaligned loads anyway. I can't comment substantially on anything other than x86 on the CPU side, unfortunately.

EDIT: it looks like the parameter can't be defaulted to 16 as that feature doesn't apply to const generics on functions.

InBetweenNames commented 2 years ago

Just a small update, the code is coming along but I've found a few ways of making the design a bit nicer so I've held off on submitting for the moment. I'm also going to see if the batch alignment can be computed with a macro as a workaround for the above noted issue -- although even if it can't, the other workaround should still work fine. For the curious, the code lives in my fork here: https://github.com/InBetweenNames/bevy/tree/batched-queries but don't expect it to compile -- I mainly committed it just so I'd have a backup available :)

EDIT: should have some time in the next 5 days to dedicate a solid day to getting it cleaned up and make a real PR out of it.

InBetweenNames commented 2 years ago

Wow, has it been two weeks already? Time flies. Been a little busy with a few things and vacation, but I got the prototype actually compiling tonight on my batched-queries branch above. There's a few remaining items before I make this PR:

The "aligned batch" construction needs to be implemented through a better trait, I think. I'm not sure what the implications of implementing a trait for all T is. Right now I basically implement an "AlignedBatch" trait for everything.
Now that the Layouts for BlobVec are MAX_SIMD_ALIGNMENT-aligned, component ticks can be updated in aligned batches too. This shouldn't take too much work.
Remove extraneous commented out code, rename things to be more descriptive
Currently, I'm praying that the optimizer will see that an empty filter (i.e. ()) will basically result in a no-op for table_filter and do nothing. I haven't tested if that's actually the case. Also, this needs to be examined to see if the filter fetch itself can be done as a batch. I expect it can for all dense queries. But for anything involving a sparse component, this could get more complicated.
for_each_mut is implemented but not for_each for batched queries -- this should be relatively simple.
Benchmarks, naturally!!

And some nice to haves for future work:

The fast path is currently only implemented for dense queries. Can it be implemented for other kinds of queries as well?
Component tick updates necessarily incur overhead even with batched queries. Is there a simple approach we can use to opt out of component tick updates when speed is necessary? For example, a Tickless WorldQuery?

My latest work has been pushed up in case anyone is interested in trying it out.

InBetweenNames commented 2 years ago

A few bugs to work out, I think. Performance is actually better than expected. In the below screenshot, you can see the "soa" representation simulated previously with the actual native SoA representation from the batched_queries branch (labeled simd_batch_4:

It's a little kludgy, but this is how the prototype works so far in practice:

As you can see, the desired alignment and batch size is passed in with a const generic parameter. There's a few things I'd like to add to make this cleaner -- there's some "bridge" code in there that I don't want to have in the initial PR (conversion of batches of 4 f32s to Vec4s for example). What I hope this screenshot emphasizes is how the types work out (editor is VSCode with rust-analyzer). It's very exciting to see it all start coming together like this!

InBetweenNames commented 2 years ago

I wrote up a unit test today to comprehensively test things out and the results are computed as expected. This caught a few small bugs, but only of the "off by one" variety (nothing fundamental). From a cursory examination of the generated assembly, I can conclude that empty filters really do optimize out entirely too (e.g. F = ()). This is really good, and means that no duplication is necessary to achieve good performance in "filterless" cases.

However, there was a major bug caught that accounted for the performance increase: change ticks modification was acciidentally not being done due to a lazy iterator not being traversed. Oops. Results of the bugfix shown below:

In this graph, the "simulated" batching is compared against the native batching offered by batched_queries. In the nochange case, change detection is bypassed using the bypass_change_detection method. In the other case, it isn't. Look at the magnitude in difference! What could be accounting for this? The assembler listing shows the following differences:

For native:

Address Source Line Assembly    CPU Time: Total CPU Time: Self
0x1400464e0 254 movaps xmm1, xmmword ptr [rcx+rbx*4]    0.1%    15.794ms
0x1400464e4 254 mulps xmm1, xmm0    1.8%    453.394ms
0x1400464e7 252 mov dword ptr [rsi+rbx*8+0x4], eax  1.6%    403.837ms
0x1400464eb 252 mov dword ptr [rsi+rbx*8+0xc], eax  1.5%    356.310ms
0x1400464ef 252 mov dword ptr [rsi+rbx*8+0x14], eax 0.1%    31.635ms
0x1400464f3 252 mov dword ptr [rsi+rbx*8+0x1c], eax 2.3%    562.202ms
0x1400464f7 254 addps xmm1, xmmword ptr [rdx+rbx*4] 0.9%    220.686ms
0x1400464fb 254 movaps xmmword ptr [rdx+rbx*4], xmm1    1.8%    453.304ms

For simulated:

Address Source Line Assembly    CPU Time: Total CPU Time: Self
0x140045a50 174 mov dword ptr [rdi+rsi*1+0x4], eax  0.5%    0.125s
0x140045a54 174 movaps xmm1, xmmword ptr [rdx+rsi*2]    0.2%    0.047s
0x140045a58 174 mulps xmm1, xmm0    2.7%    0.653s
0x140045a5b 174 addps xmm1, xmmword ptr [rbx+rsi*2] 1.3%    0.311s
0x140045a5f 174 movaps xmmword ptr [rbx+rsi*2], xmm1    4.5%    1.096s

And for native, nochange:

Address Source Line Assembly    CPU Time: Total CPU Time: Self
0x140044650 209 movaps xmm1, xmmword ptr [rcx+rbx*4]    1.5%    0.373s
0x140044654 209 mulps xmm1, xmm0    1.2%    0.296s
0x140044657 209 addps xmm1, xmmword ptr [rdx+rbx*4] 2.4%    0.593s
0x14004465b 209 movaps xmmword ptr [rdx+rbx*4], xmm1    4.7%    1.161s

The only real difference? The tick code updating. In the "simulated" case, each change tick corresponds to four "virtual entities", whereas with native batching, each entity gets its own change tick. Therefore, the native version is doing "more work". Notice also that ComponentTicks are defined as pairs of u32. This means that the updated memory is not contiguous, which makes optimization harder:

0x1400464e7 252 mov dword ptr [rsi+rbx*8+0x4], eax  1.6%    403.837ms
0x1400464eb 252 mov dword ptr [rsi+rbx*8+0xc], eax  1.5%    356.310ms
0x1400464ef 252 mov dword ptr [rsi+rbx*8+0x14], eax 0.1%    31.635ms
0x1400464f3 252 mov dword ptr [rsi+rbx*8+0x1c], eax 2.3%    562.202ms

Note the indices here: 0x4, 0xc, 0x14, 0x1c. Sure, this fits in a single cache line, but these are spread out. Not good!

I would really like to support fast mutation code with change ticks. To do that, I'm going to try my hand at refactoring ComponentTicks to be SoA instead of AoS as it is currently. This will make it so vector instructions can be used to update all change ticks in a batch in one go (ideally).

Since I refactored component ticks to be in batches of AlignedArray* now, part of this work is already complete. Ordinarily, I would count such a thing as scope creep, but I feel this is fundamental to making this feature worth it to developers. Otherwise, we'd have to advise people doing batched queries to use bypass_change_detection.

alice-i-cecile commented 2 years ago

As an FYI, we've been chewing with the idea of splitting change ticks out into their own phantom components. It would help with alignment, as you've seen here, but also make opting out completely much more feasible.

InBetweenNames commented 2 years ago

So far it's been less work than I imagined -- you're right, they really do work like components. I suspect after getting that small refactor done, that the performance hit will be at the expected level. To facilitate this change, I'm adding a new type, AlignedVec, that basically is a Vec whose internal buffer is aligned to MAX_SIMD_ALIGNMENT. It is implemented in terms of BlobVec for simplicity and it isn't a full implementation of the Vec API, just enough to do the job.

I looked at other solutions on crates.io to achieve this and didn't see anything production ready, so hopefully this is OK for now :)

Also, with this new model, check_change_ticks can be vectorized as well! I won't include that in the PR as it's something that could be added later, but it's nice to see things coming together and dovetailing with the existing design decisions.

alice-i-cecile commented 2 years ago

Awesome :D Bias towards small, incremental PRs so they're easy to review!

InBetweenNames commented 2 years ago

Yeah I may have misspoke on the last point 😆 ... enough changes have been made now that I feel the change detection should be a separate PR. However, I will still keep the AlignedVec stuff as it's necessary for keeping the batching invariants for Entities and ComponentTicks, and it's not a lot of code. Given the "fundamental" changes, this will still probably be a bigger PR as it makes small changes in a number of different places. Maybe we could work together on converting ComponentTicks over to two phantom components after this PR is made?

alice-i-cecile commented 2 years ago

I'm on board to see what that looks like :) Hard to fully assess without seeing the changelist honestly.

InBetweenNames commented 2 years ago

The PR is finally here: https://github.com/bevyengine/bevy/pull/6161

richchurcher commented 1 month ago

Closed PR #6161 during backlog cleanup. Could still be adopted in 2024, as the PR has been updated several times over the years!

InBetweenNames commented 1 month ago

Hard to believe it's been so long since I worked on this. I've since moved onto other things but I still think this is worth doing based on the benchmarks I showed previously. My PR is probably way out of date by now and it's probably best to start fresh. Good luck!

richchurcher commented 1 month ago

Hard to believe it's been so long since I worked on this. I've since moved onto other things but I still think this is worth doing based on the benchmarks I showed previously. My PR is probably way out of date by now and it's probably best to start fresh. Good luck!

Yes, someone on Discord said much the same (still worth doing) so it's on my list of issues to open! Thanks for the original work, we'll link it as a reference for anyone looking to implement.

bevyengine / bevy

Batched ECS Query #1990

What problem does this solve or what need does it fill?

What solution would you like?

What alternative(s) have you considered?

Additional context

1822