bevyengine / bevy

A refreshingly simple data-driven game engine built in Rust
https://bevyengine.org
Apache License 2.0
33.57k stars 3.26k forks source link

High performance sorted queries #13464

Open iiYese opened 1 month ago

iiYese commented 1 month ago

What problem does this solve or what need does it fill?

13417 adds sorting at an iterator level. These sorts:

What solution would you like?

  1. Command to reorder tables.
  2. Rearrange tables for sorted queries by default.
  3. Make queries cache sorts & use change detection to early out of resorting.
  4. Condense sorts to low..=high ranges for cache friendly iteration.
  5. Optionally cache iteration position to allow suspending and resuming iteration between system runs. This is useful for ui scroll areas and any rhythm game.

Final API should look something like this:

fn cmp(left: FilteredEntityRef, right: FilteredEntityRef) -> std::cmp::Ordering {
    left.get::<A>().unwrap() < right.get::<A>().unwrap()
}

fn sys(mut query: Query<(&A, &B)>) {
    for (a, b) in query.sort(cmp)
        // Optional call that will prevent tables from being reorderd.
        // Useful if you have multiple systems requesting slightly different sorts or
        // just completely opting out of reordering & doing multiple ordering manually.
        // User could do multiple stable sorts to get dense iteration across multiple systems.
        .no_reorder()
        .iter()
    {
        // ..
    }
}

#[derive(Component)]
struct Offset(pub f64);

fn sort_by_offset(/* .. */) {
    // ..
}

fn timeline(mut query: Query<(&Offset, &A, &B)>, time: Res<Time>) {
    for (offset, a, b) in query.sort(sort_by_offset)
        // Resume iteration from a previous run & update it for a future run.
        // Assuming the following:
        // - some cached pos: usize
        // - a needle: T
        // - a const window size N: usize
        // 
        // seeking will:
        // - Use linear search to search for the needle among nearest neighbors pos-N..=pos+N.
        // - Fall back to binary search if the range doesn't contain the needle (we've jumped).
        .seek(time.elapsed_seconds_f64(), |entity, needle| { /* .. */ })
        .iter()
        .take(5)
    {
        // ..
    }
}

What alternative(s) have you considered?

Additional context

Missing prerequisite features:

Not required but helpful missing feature:

alice-i-cecile commented 1 month ago

This is a follow-up to #1470. The other thing that this opens up is efficiently querying (using dynamic queries only, since enum variants aren't types) for enum variants. If we can store tables in a sorted form, requesting all Status::Dead entities becomes genuinely fast, rather than just sugar for a .filter operation.

Victoronz commented 1 month ago

First of all, thanks for the issue!

I'd like to go through the parts of your suggested solution:

  1. Can be done, see below

  2. This is not doable by default. This would require mutable access for Tables i.e. &mut Tables which is only obtainable by engine code with &mut World, see bevy::ecs::storage docs, even with interior mutability, this would still not work. To mutate tables by default, you would need to have mutable access to tables whenever you can obtain a query. Since systems can have aliasing and disjoint queries, both within themselves and across parallel systems, this is not possible. Sort calls don't appear in function signatures, so the scheduler can't work around them either.

    To mutate a table/world you need an exclusive system or a QueryData impl that has complete, unique access to a table, which leads me to make this replacement suggestion: A set of exclusive systems that expose the same API as the sorts implemented in #13417, without the QueryIter parameter and commands that can run those systems.

    These would no longer use transmute functionality internally, just sort tables directly. Some catches with this are that archetypes can share tables, and we'd need to sort sparse sets as well. This may lead to unexpected behavior on the user side. It'd also be less "table sorts" and more "archetype sorts"

    With my suggestion, a user could manually order archetype sorts to happen in front of or after their systems, or spawn commands in their systems to do that.

  3. For each of these archetypes we'd store some kind of "last_sorted_by key". Together with change detection this would allow exiting early. I don't see the need to have queries themselves cache the sorts with this.

  4. Definitely perf improvements to be had here! However this will be heavily data set dependent, so good benchmarks would be needed.

  5. Functionality-wise Iterator::nth does this. It may be useful to have some kind of indexing here though, could be done with dedicated functions on the QuerySorted*Iter structs

On the example itself (other than the already presented issues above):

iiYese commented 1 month ago
  1. Yes it is. Flecs already does this. You would not need mut access to tables. This is why a command is used. See the first point in the additional notes. Essentially the first run of the system will not have dense iteration, it will enqueue the command to reorder tables & the second run will have dense iteration. I don't like the alternate suggestion it's too clumsy.
  2. You cannot store caches at an archetype level. Archetypes are not enough. If you have archetypes (A, B) & (A, B, C) with tables that look like this:
    
    // A, B
    A(0), B
    A(4), B
    A(5), B

// A, B, C A(1), B, C A(2), B, C A(3), B, C


And you want to sort by `A` you need to first iter `(A, B): 0..=0` then `(A, B, C): 0..=2` then `(A, B): 1..=2`. This is just a simple case with 2 archetypes and 1 system & 1 sort key. Queries are the only place you can cache this information. Caching at a query level also allows you to do things like:
- Sort by `B`
- Stable sort by `A`

If you want different sorts in different systems. You will get dense iteration across `A` & not completely dense but still some dense iteration across `B`.

4. Benchmarks are irrelevant as the worst case is simply what we already have now which is completely random access.

5. `Iterator::nth` still iterates <https://doc.rust-lang.org/src/core/iter/traits/iterator.rs.html#305> it is not an `O(1)` operation as this would be.

> FilteredEntityRef and EntityRef currently act the same in type position, so better use the shorter one.

I don't think you actually understand the difference between these. `EntityRef` gets read access to *all* of an entities components. Meaning you'll block parallelism if you allow this safely. `FilteredEntityRef` just acts like `EntityRef` but only has access to a subset of components, in this case it'd be whatever appears in the `D: QueryData` part of the query.

> To sort by EntityRef you need to include it in the original QueryData, lest you alias with disjoint mutable queries.

That is another reason why it has `FilteredEntityRef`. You can transmute this from any `D: QueryData`.
iiYese commented 1 month ago

I've updated the seek example to be more flexible.

Victoronz commented 1 month ago
  1. It seems I misunderstood you on this. How would this command spawning look like? You say that this would use thread local commands, does this mean we would be using Commands implicitly? Or are we having the users pass it? I still think the explicit sort systems would be useful for users, though separately, since it doesn't seem to cover your needs here.

  2. In that case, we'd be mutating QueryState, right? Does this fit with current ECS plans? In terms of execution, I would still not want it by default, because this would be hidden behavior and potentially introduce unexpected costs for the user on first iteration. How about a Cached enum as an additional parameter, instead of chained method calls? The user decides explicitly whether the sort is cached, no need for implicit cost other than a single, well-predicted branch. The enum would have two variants, one for cached and uncached behavior each.

  3. I'd still like to see two cases: The new version that uses adjacent runs of data vs the current version, on a dataset with no adjacent items after sort, and a dataset that is already sorted except for one item, i.e. the worst and best case. AFAIK, there are no benchmarks on this yet, so I would not like to be guessing how big of a difference it makes.

  4. I wasn't clear on this, that's what I meant!

On FilteredEntityRef, if that's how it works, I didn't actually understand it: This ability needs more docs, neither transmute_lens nor the struct itself say it can be transmuted from any QueryData.

iiYese commented 1 month ago
  1. On main there's already World::commands the problem is currently this takes &mut World where as queries get &World so the addition of World::parallel_commands which would take &World or just making the command queue thread local would allow queries to enqueue commands.

  2. There's nothing incompatible about this with future ECS features. Either on QueryState or Query is fine. It will not be hidden behavior because it will be explicitly stated in the docs & there will be a way to opt out of it. With regards to an additional param I prefer well designed defaults with ways to opt out instead of noisy configuration.

  3. This is not guessing. I can confidently assert the difference this would make. You do not need this to be implemented to observe what is well established hardware behavior. It is the reason bevy and other archetypal ECSs tend to be orders of magnitude faster at iteration than things like sparse set ECSs. If you really want you can play around with flecs which already has this implementation.

Victoronz commented 1 month ago
  1. Taking a broader view, are there any other things that need caching on the Query level? To me this sounds like a problem that isn't unique to sorting queries. Would it make sense to make use of some shared infrastructure here?

    I think we disagree here. Of course this will depend on implementation, but as per my current understanding: My issue with this is that sort itself would queue the caching, and no_reorder would unqueue that caching within the same method chain. Not wanting to cache sorts will introduce that configuration noise you speak of into an iter chain that one would like to for loop over. How about we have some call that takes Query mutably, and mutate some queue_sort_cache setting, causing subsequent sorts to follow that behavior? (Sorts would use an already present cache regardless of this setting) If you sort the same query twice in a different way in the same system, the "no-reorder noise" can be placed before the iteration loop, and if a user still wants a sort to be cached afterwards, they can queue the command manually.

    This would also mean we could introduce some global setting that enable or disables this behavior entirely. We would still need a default, but this would be cleaner IMO.

  2. I know that it is faster and can be by a magnitude, and I guess this wouldn't be a blocker, but I would like to be able to show by how much!

Victoronz commented 1 month ago

3./4./5. Most improvements to QuerySortedIter should also be applied to QueryManyIter and its future offshoot types. QuerySortedIter is basically a special case of the more general QueryManyIter.

iiYese commented 1 month ago

Taking a broader view, are there any other things that need caching on the Query level? To me this sounds like a problem that isn't unique to sorting queries. Would it make sense to make use of some shared infrastructure here?

With queries as entities queries can have row polymorphism like everything else that's at an ECS level. This means caching is no longer limited to either:

Query entities can have different components for whatever caches they need. Aside from sort caches they will need other cache structures as we add new features. Two things that come to mind:

I would not worry about queries as entities as they're one of the prerequisite/accompanying features for relations and the sort cache can be refactored when they land.

Victoronz commented 1 month ago

Another question I have about the final API example: Currently, a user calls iter, then sort, here this is reversed. When sort also lands for iter_many, how would this work? The user can no longer specify a subset of a query to sort. I.e. if a user wishes to sort 3 out of a 1000 entities, then they do not wish to pay for a full query sort in advance. Additionally, since the actual table is only sorted after the first run, the query.sort call would have to return a sorted Vec of query results to iterate over. Entity is not a simple index into this, and its matching item would instead have to be found in a more expensive manner.

iiYese commented 1 month ago

Currently, a user calls iter, then sort, here this is reversed. When sort also lands for iter_many, how would this work?

This will either remove the existing API or exist along side it.

The user can no longer specify a subset of a query to sort. I.e. if a user wishes to sort 3 out of a 1000 entities, then they do not wish to pay for a full query sort in advance.

This sounds like an esoteric misfeature. How would a user know exactly which subset of items they would want sorted in the first place? There are zero guarantees about iteration order.

Additionally, since the actual table is only sorted after the first run, the query.sort call would have to return a sorted Vec of query results to iterate over. Entity is not a simple index into this, and its matching item would instead have to be found in a more expensive manner.

There should be no single use vectors. QueryState/Query wherever the cache ends up gets 2 new fields:

If .seek is not used the sort_cursor is reset to (0, 0).

Victoronz commented 1 month ago

The user can no longer specify a subset of a query to sort. I.e. if a user wishes to sort 3 out of a 1000 entities, then they do not wish to pay for a full query sort in advance.

This sounds like an esoteric misfeature. How would a user know exactly which subset of items they would want sorted in the first place? There are zero guarantees about iteration order.

F.e. if we want to sort the children of an entity. The Children or some other component stores a list of entities, which we retrieve by iter_many, then sort. I do not find this esoteric at all. This would only be relevant if this is intended to replace the existing API.

iiYese commented 1 month ago

That's not a subset of entities in a query that's mapping an entity list. Which is at best niche because

Adamkob12 commented 1 month ago

Essentially the first run of the system will not have dense iteration, it will enqueue the command to reorder tables & the second run will have dense iteration.

When should the command be executed? If it's executed right away and reorders the Table, then it's very likely that the components will be mutated before the second sort is needed - requiring another table reordering just in time for the sorted query.

The logical conclusion is that the reordering of the table must happen right before the system that requires the sort runs (or after the last mutable access, but its impossible to know when that would be).

After thinking about it, I think there are 3 options that allow for dense, sorted iteration: 1) Designate certain archetypes as "hot", meaning their tables would have to sort themselves after each mutation (only one sort per archetype) 2) Let the user call a command to sort archetypes manually (A bit of a footgun) 3) Reorder the table right at the .sort() call, and then iterate densely (need a special type of Query for that, because no one must access the table at the time of the sort)

iiYese commented 1 month ago

When should the command be executed

In the next sync point. So basically the first run would be without dense iteration, the second run will need to rebuild the sort cache again and every run after that will be dense. (Until you have a mutation which will start this whole process again)