entity_ref `get_component_with_type` is slow

tower120 commented 3 years ago

What problem does this solve or what need does it fill?

I'm trying to get components data from EntityIds(Vec<Entity>) with world.get(entity).... I'm experience some "unexpected" performance drop (in compare to Vec<Rc>).

Looking at the code of World::get<T: Component>(&self, entity: Entity) ... there is so much indirection... Getting component index from hashmap, then bursting through 3 indirections to get component "Column", and one branch between storage type.

What solution would you like?

Is it possible to somehow .... prefetch/cache that archetype table_row/column?

That's just too much... Thats waaaay slower then just cache misses with Rc.

Additional context

I use Vec<Entity> not that often... But sometimes you just have to use it ....

TheRawMeatball commented 3 years ago

AFAIK, Query s do this sort of caching :)

tower120 commented 3 years ago

@TheRawMeatball Nice! Now if only there where "queriable" collection of entities....

tower120 commented 3 years ago

@TheRawMeatball Or you meant world.query(...).get(entity) ?

TheRawMeatball commented 3 years ago

The second, but not quite like that - you need to construct the query with the components you're interested in first, and store it alongside the list. It'll do some of those indirections at build time, and then you can use get to go fast :)

tower120 commented 3 years ago

So... From system query it should already work?

// For example
fn my_system(
    datas: Query<(&Data)>
)
{
    for entity in entites{                  // entities = Vec<Entity>
        let data = datas.get(enity);
    }
}

cart commented 3 years ago

Looking up an Entity's component vs looking up a value in a Vec will never be "identical" because we will always need indirection to determine (1) whether or not the entity has that component and (2) where that component exists. Vec lookup is the "upper limit" that ECS impls strive for (but will never reach ... its a logical guarantee).

That being said it should still be "approximately" the same cost as Vec lookups if you cache things like TypeId -> dense_component_index lookups, which we already do for Queries. And iteration will be very close to Vec iteration because these lookup costs only need to happen once instead of per-entity.

So... From system query it should already work?

Yup QueryState, which is used by System Queries and returned by world.query(...) , already caches everything that it (reasonably) can.

tower120 commented 3 years ago

But.... Looks like get_unchecked_manual called by query.get() reconstructs Fetch and Filter State on each call (I'm not sure which exact implementation is called, but looks like there is some branches in Fetch::init). Clearly, there is no such overhead for iteration...

P.S. Is component pointers stable for alive entities?

cart commented 3 years ago

We don't reconstruct Fetch and Filter State in get_unchecked_manual. We pass the cached state into Fetch and Filter instances. Fetch:init does do the work of looking up pointers to storages, but this cannot be assumed to be stable. Any time a new component or entity is added (or removed), that could trigger a reallocation of the array, which would invalidate the pointer. Caching that pointer would be very unsafe.

cart commented 3 years ago

For a given world borrow and set of entity lookups using that borrow, we can assume it is stable because nothing can add/remove entities or components during that time. But we can't store that in QueryState, which exists outside of a given World borrow.

We might be able to cache the Fetch and Filter instances inside of System Queries though (because components cannot be added / removed in a way that invalidates these pointers when the system is running). I can't assert that is valid off the top of my head though.

tower120 commented 3 years ago

I see, thanks. I think I need to benchmark this more thoroughly... At first glance looks like accessing resource is way faster then accessing component. Will return with results.

cart commented 3 years ago

At first glance looks like accessing resource is way faster then accessing component.

This makes sense because we can skip the "entity location lookup" and move right to the "component storage lookup". However in practice they should still be pretty similar. Keep in mind that we're talking about very small / inexpensive operations here. Things like "an extra array lookup" will register on benchmarks, even though that is an extremely cheap operation in the grand scheme of things.

cart commented 3 years ago

Definitely let us know if you find any optimization potential! We're already very competitive on ECS benchmarks, but faster is better!

tower120 commented 3 years ago

Reporting in. In-system Query::get 5-6 times slower then Arc.

https://github.com/tower120/bevy/commit/33a73fa51abf7ea44439eb8f99893b8630d967a3

entity                  time:   [136.58 us 137.07 us 137.57 us]
                        change: [-4.5581% -3.5928% -2.7019%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

box unique              time:   [20.592 us 20.669 us 20.752 us]
                        change: [+1.2907% +1.7769% +2.3090%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

arc mutex               time:   [166.39 us 166.87 us 167.42 us]
                        change: [+0.1934% +0.6981% +1.2190%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

arc unsafe              time:   [25.175 us 25.265 us 25.365 us]
                        change: [-1.1272% -0.5776% -0.0209%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

If you want, I can make PR with that benchmark.

FeldrinH commented 3 years ago

Query::get performance also has a big impact on bevy_rapier performance, because it stores the rigid bodies, colliders and other related data directly in bevy ecs and does a lot of random access with Query::get when processing physics

FeldrinH commented 3 years ago

According to one of the rapier contributors on their discord, bevy_rapier is about twice as slow as rapier with built-in data structures. (Haven't seen any benchmarks for this though)

tower120 commented 3 years ago

I also benchmarked query.for_each vs vec.for_each. https://github.com/tower120/bevy/commit/4d97e6fba47d9238e418ae087c8c2c1ab1963e9c

It appears that Vec traverse 3-4 times faster then query...

entity iter             time:   [190.90 us 191.37 us 191.85 us]
                        change: [-2.9586% -2.6283% -2.2918%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

vec iter                time:   [59.864 us 59.994 us 60.179 us]
                        change: [-2.4671% -1.8886% -1.3690%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

So it may make sense to pack few lightweight systems into one more heavy-weight... Frankly I did not expect that difference...

tower120 commented 3 years ago

There is branch inside archetype_fetch - match self.storage_type. It is inside each iteration... Maybe if change api somehow to have separate loops for each storage_type - so less work left to do inside tight loop...

UPDATE Pardon, that's for non dense only.

FeldrinH commented 3 years ago

I also benchmarked query.for_each vs vec.for_each.

Your query iteration benchmark uses a parallel SystemStage for running the query. I suspect a lot of the time is spent on the scheduling and caching done by that, rather than the iteration itself.

tower120 commented 3 years ago

@FeldrinH Thanks! I'll try change it now.

tower120 commented 3 years ago

@FeldrinH Changed to SystemStage::single_threaded() difference is marginal, if any.

cart commented 3 years ago

Just took a look. Lets start with:

entity_iter

First: there is a methodology issue. This is an apples-to-oranges comparison because you're running the entire scheduling infrastructure for the Query benchmark and comparing that to a single raw vec iterator. The scheduling infrastructure is cheap to run, but it will still register in a small way for benchmarks like this.

Additionally, even after accounting for that, this is still apples to oranges because the Vec benchmark doesn't do change detection. Functionally, change detection performs the same as iterating and writing to another component (because it needs to fetch the tracking state and write to it when Mut<Point> is mutated).

After disabling change detection (currently a manual process of commenting out the relevant code, but will ultimately be configurable per-component once we switch to #[derive(Component)]) I get:

Then, after removing the Schedule/Stage/System in favor of iterating a raw Query, I get:

After righting the scales, we are actually faster than iterating a raw Vec (by an extremely small margin). Despite the fact that we provide way more functionality than a simple Vec, we still come out on top because the extra work required to accomplish that functionality amortizes to zero.

The test:

fn entity_iter_raw_query_bench(criterion: &mut Criterion) {
    // Setup world
    let mut world = World::default();
    let mut query = world.query::<&mut Point>();

    // // Setup test entities
    for i in 0..ENTITIES_COUNT {
        world.spawn().insert(Point {
            x: i as u32,
            y: i as u32,
        });
    }

    // Run systems
    criterion.bench_function("entity iter", |b| {
        b.iter(|| {
            query.for_each_mut(&mut world, |mut point| {
                point.x += 1;
                point.y += 1;
            })
        })
    });
}

Additionally it seems like there is a general misunderstanding about how relevant numbers like 2x are in this context. Thanks to the cache friendly nature of both implementations, we are talking about extremely small timescales here. The second you start doing anything meaningful inside these iterators, that work will dwarf the iteration costs. We get a ~30% increase in cost, just by throwing in some more ops:

    criterion.bench_function("entity iter", |b| {
        b.iter(|| {
            query.for_each_mut(&mut world, |mut point| {
                point.x += (10 - point.y).pow(4);
                point.y += 1;
            })
        })
    });

Doing any amount of work on these scales registers in a big way.

From the perspective of a game engine, the benefits of change detection far outweigh the 2x cost (by enabling us to optimize / skip expensive operations each frame and opening up extremely useful coding patterns). The cost won't meaningfully register for anything but the most resource constrained benchmarks and in the very near future you will be able to easily opt out for those scenarios.

Lets segue into the next benchmark:

entity_get

After accounting for change detection, here are the results:

box_unique is ~4.4x faster than entity. First, even if this was an apples to apples comparison, these numbers would be acceptable because as we've previously covered, we're working on very small timescales here. We are also quite competitive when compared to other ECS implementations. So why isn't box_unique a fair comparison to ECS get(Entity) operations?

Lets compare the inner loops of the two benchmarks:

box: iterate an array of boxes and deref each box
- for each item (a box), to find the component we need to do the following:
  1. deref a pointer
entity: iterate an array of entities and look up each entity in the query
- for each item (an entity), we need to:
  1. map it to a location in an archetype (this is an array lookup, which is a pointer offset then a pointer deref)
  2. validate that the location is a valid entity (comparison and a branch)
  3. get the pointer to the storage for a specific component in the archetype (a sparse set lookup, which amounts to two array lookups, which equals two pointer offsets and two pointer derefs)
  4. get the component from the storage pointer (which amounts to a pointer offset and a pointer deref)

This amounts to the totals:

box: 1 pointer deref
entity: 4 pointer derefs and 4 pointer offsets

And that doesn't even account for the extra branching to ensure Entity and sparse set lookups are valid. In total, the 4.4x number makes perfect sense!

If you don't think too hard about it, it sure seems like we're doing a lot of unnecessary work. We should just use Boxes! What were we thinking?

Lets do some small (zero cost) refactors so we can start building our Box ECS:

Box indices are now Entity(usize)
PointBoxes is now Components<T> { components: [Box<T>; BOXES_COUNT] } (in practice this would actually be a Vec<T> for faster iteration)

Now lets pretend we want Entity(0) to have component [A, B] and Entity(1) to have [B]. How will we record this?

Here is the simplest path forward:

Use an option (or null pointer) to indicate that a specific item in the array exists. (ex: Vec<Option<T>> or [MaybeUninit<Box<T>>; BOXES_COUNT]). This is known as a "Sparse Array". But now we've created a couple of problems relative to PointBoxes:

All Component<T> arrays must be sized to accommodate the entire entity space. They are no longer "densely packed". If Component A has 10 entities and Component B has 2 entities, both arrays need to be length 10.
- This memory usage problem can be fixed by storing only valid components in a dense Vec<T>, then having another Vec<Option<usize>> array to determine whether or not the entity has component T and where it is located in the dense Vec. This is called a "Sparse Set" and is a popular ECS variant.
When looking up an entity's component, we now need to branch to determine whether or not the entity has that component.
When iterating multiple "sparse arrays", we no longer have linear component iterations (we need to "either hop around" the arrays using sparse sets or do the extremely expensive operation of checking each component index individually for each component)

Ex: The specs ECS "vec storage" is basically a sparse array. I had benchmarks handy that compare its entity lookups to Bevy's

Funnily enough, specs is about 4x faster than bevy_system, which makes sense given what we just talked about / the performance of your box benchmark!

However sparse sets have significant downsides:

If you want to do multiple component lookups for an entity, you need to branch for each component type.
Iterating sparse sets (especially when iterating the entities from multiple components at the same time) is expensive. Way more branching and hopping around in a cache unfriendly-way. This is slow (when compared to linear dense vec iterators).

The solution to this is "archetypal ecs".

Entities belong to a specific archetype, with per-archetype storages that are densely packed and perfectly aligned.

This adds the indirection of finding the archetype, but it means:

You only need to find the archetype once. After that, you already have the component storage locations and you don't need any more branching to check if the storages contain the components. This means that the more components you add to query.get(entity), the faster it gets relative to sparse sets.
Iterating can be a cache friendly linear scan. No need to hop around sparse sets!

Archetypal and Sparse Set ECS are the two most popular ECS implementations. They are a result of the best minds in the business putting their heads together and making informed compromises. They each have their tradeoffs, but they provide functionality that a simple Vec<T> cannot. It is literally a limitation in the laws of physics. You cannot have the flexibility of ECS without more branching and indirection. If you want the performance of an array, you are stuck with the rigidity of an array. If you think you can do better ... awesome! Just be aware that this is an extremely competitive space and that we are extremely competitive in this space (https://github.com/rust-gamedev/ecs_bench_suite). Don't expect to find anything but small / marginal wins (or wins that come at the cost of other benchmarks / features).

mirkoRainer commented 3 years ago

Just took a look. Lets start with:.....

@cart is this write up in a dev blog somewhere? I learned so much just reading it.

cart commented 3 years ago

@cart is this write up in a dev blog somewhere? I learned so much just reading it.

Not currently. I might ultimately consolidate this and other things like the Bevy ECS V2 description (which is basically a blog post by itself). But that takes time and I've got renderer code to write right now :smile:

tower120 commented 3 years ago

Thank you for explanation.

Previously, I worked with one custom ECS... As far as I remember, it had almost same performance as vec for components iteration; and 1.5-2.5 (depending on memory fragmentation, I guess) slower on component get by entity vs unique_ptr access. Surely it was not that feature-reach as bevy-ecs. I describe it just in-case, that could help enhance bevy-ecs.

I don't understand every bit of bevy-ecs architecture, so I can't say where exactly difference. So I better tell how that ECS worked. Lets call it CUSTOM_ECS. But looks like main difference in how entity-thing treated...

CUSTOM_ECS is archetype based ecs; systems run in manually defined sequence one-by-one. We used multithreading only as paralell-foreach inside system. All emplaces/erases scheduled and executed at the end of the system. (actually, maybe emplace postponing was unnecessary...)

            Archetype A+B+C

                Entity 0    Entity 1    ...     Entity N
Component A                                                 <- std::vector<A> technically implemented as std::vector<uchar>

Component B

Component C

Entity       Rc<EntityImpl>

Emplace

As simple as vector.push_back

Erase

Components column from back moved in place of erased one. All components pop_back(). Moved-in EntityImpl updated. Erased EntityImpl zeroed(marked as null).

struct World{
    std::vector<Archetype> archetypes;
    std::vector<ComponentData> components;
}

struct EntityImpl{
    uint archetype_id;
    uint entity_index;
}

Entity

Entity is shared_ptr<EntityImpl> (Well, actually in-house implementation without weak references "overhead"). Since enity's components can be moved in memory only between systems run, it is safe to update EntityImpl whenever entity moved due to erase or archetype change.

Traverse through required set of components (only have/don't have was supported).

Found archetypes that satisfy condition, and run through required component vectors. Basically:

for (auto& archetype : archetypes)
for (auto& component : archetype.components()[component_id])    // but for several components at once
    closure(component)

Get component by entity.

Now interesting part. Each component have associated table - archetype_id to components row pointer. It updated whenever archetypes changes. (We actually never delete empty archetypes automatically)

// instead of std::vector everything worked with plain array T* . vector here for clarity.

using Components = std::vector<uchar>;
struct ComponentData{
    // Even if table go big - It'll be still small enough to fit L1 cache. Table have high chances to remain in cache if we getting entity often (like in loop).
    vector<Components*> components_pointers;    // index = archetype_id. I think 4000 archetypes is a reasonable max.
}

using Entity = shared_ptr<EntityImpl>;
Component& get_component(ComponentData& component_data, const Entity& entity_){
    const EntityImpl& entity = *entity_;    
    auto&/*vector<Component>*/ components_row = component_data.components_pointers[entity.archetype_id]; // table fetch + reinterpret_cast (now showed here). 
    return components_row[entity.entity_index];     // as you can see "only" 3 indirections
}

ComponentData& components_A = world.GetComponentData<ComponentA>();      // this do hash-map lookup to associate Component "type_id" with world's ComponentData
for const Entity& entity ....{
    components_A.get_component(entity)
}

It is "unsafe" version - meaning you know entity's components exists.

cart commented 3 years ago

I see! Thanks for sharing. That looks pretty similar to how our archetypal storage works (and the general call pattern). But there are some cool ideas here that we aren't currently using! The main differences:

Rather than looking up Archetype->Table->Column (we currently use a sparse set with the component id to find the column), it looks up the archetype table row inside the "component metadata". This skips an indirection. We've discussed doing this at various points in the past, but haven't gotten around to it yet. It gets a little bit more complicated for us because multiple archetypes can point to the same table, which means theres a little bit more "fixup" work required when resizing table columns (but thats doable). This is definitely something worth thinking more about + benchmarking. I know @BoxyUwU's ECS did something like this (but maybe bitsets in the component metadata?) and I think @SanderMertens does something like this in flecs? It also would help cut down on sparse set usage, which per-table can get pretty memory hungry when it isn't paged.
The archetype location is 'cached' inside of the Entity(Rc<EntityImpl>) and there appears to be no "entity id value' decoupled from Archetype/Location. This enables skipping the "pointer offset" when looking up entity locations, but it is still a random pointer access to find the location (and therefore indirection). Of course, an entity cannot actually be identified by those two values (because the can change across updates), so I'm assuming the "pointer value" is treated as the "entity value decoupled from archetype location". I have some questions / comments:
- How does world.despawn(entity) work in relationship to Entity(Rc<EntityImpl>>)? In rust, the location wouldn't be freed up until the final Rc is dropped. How does your implementation ensure that an Entity stored in a list floating around somewhere isn't pointing to a previously "despawned" entity? Is it a weak pointer that needs to be "upgraded" to a strong pointer?
- If "entity ids decoupled from archetypes" are "pointer values", then spawning new entities would involve a bunch of small individual allocations of Entity(Rc<EntityImpl>) or synchronized access to some Vec<EntityImpl>. If done in parallel, we need some synchronized storage of these pointers to ensure that we update locations in the event that they change.

tower120 commented 3 years ago

Whenever you'll do that - my advice - use pointer to row start, not just row's index, if possible.
Entity points to column in table. It consists from archetype index + column index (of address offset, if you like). Archetype is table for us.

... and there appears to be no "entity id value' decoupled from Archetype/Location. What do you mean?

Of course, an entity cannot actually be identified by those two values (because the can change across updates) That two values changes too during update - that's the whole idea for keeping them in heap.

How does world.despawn(entity) work in relationship to Entity(Rc>)? In rust, the location wouldn't be freed up until the final Rc is dropped. How does your implementation ensure that an Entity stored in a list floating around somewhere isn't pointing to a previously "despawned" entity? Is it a weak pointer that needs to be "upgraded" to a strong pointer?

As soon as entity will be despawned, value of EntityImpl zeroed. So one can check if it is still valid, since in list stored shared_ptr (pointer to heap), not actual value. There were no weak pointers. Technically Entity stay alive until there is at least one value of it... The very data in heap changed (with zeros, or with new location if entity moved across archetype tables).

// between system runs!
void despawn(shared_ptr<EntityImpl> entity){
    // do actual components erase from storage
    entity->archetype_id = 0;
    entity->entity_index  = 0;
}
bool is_null(shared_ptr<EntityImpl> entity){
   return  entity->archetype_id == 0;
}

IOW, we does not prolong components lifetime with Rc. We prolong Entity lifetime in heap with Rc.

If "entity ids decoupled from archetypes" are "pointer values", then spawning new entities would involve a bunch of small individual allocations of Entity(Rc) or synchronized access to some Vec. If done in parallel, we need some synchronized storage of these pointers to ensure that we update locations in the event that they change.

I don't understand what you mean saying "pointer values", if that means form of smart-pointer pass-by-value, like shared_ptr, then yes.

Allocations was surprisingly fast... And when allocated one-by-one - they dense in memory too...

All structural changes to archetype table was done outside system run. They were scheduled, then executed. (I don't know how bevy_ecs Commands exactly work, but looks similar to that). Systems run sequentially (not in parallel as in bevy). So parallelism is not a thing for table alteration.

The initial reason for keeping entity_location data in heap, was to get rid of entity_id -> actual_entity_location table. It looked like there is no benefits for having LUT complexity, and that is essentially the same as heap allocated entity_location data.

tower120 commented 3 years ago

I think I can make smaller version of that ECS, if that helps...

It was derivative from https://github.com/tower120/ecs_mt . But that version is outdated. And does not actually use Rc. EntityImpl/entity_data just stored in non-movable memory area. So instead of Rc, pointer to entity_data is used in entity: https://github.com/tower120/ecs_mt/blob/master/include/tower120/ecs/entity.hpp So, in that version, there is no way to tell if that entity alive or no. Also it looks like it does not have archetype to components row LUT optimization.

alice-i-cecile commented 2 years ago

This discussion has been educational and useful, but I'm going to close this out as it's not particularly actionable. If there's specific take-aways or action items, new issues are very welcome.

bevyengine / bevy