Closed tower120 closed 2 years ago
AFAIK, Query s do this sort of caching :)
@TheRawMeatball Nice! Now if only there where "queriable" collection of entities....
@TheRawMeatball Or you meant world.query(...).get(entity)
?
The second, but not quite like that - you need to construct the query with the components you're interested in first, and store it alongside the list. It'll do some of those indirections at build time, and then you can use get to go fast :)
So... From system query it should already work?
// For example
fn my_system(
datas: Query<(&Data)>
)
{
for entity in entites{ // entities = Vec<Entity>
let data = datas.get(enity);
}
}
Looking up an Entity's component vs looking up a value in a Vec will never be "identical" because we will always need indirection to determine (1) whether or not the entity has that component and (2) where that component exists. Vec lookup is the "upper limit" that ECS impls strive for (but will never reach ... its a logical guarantee).
That being said it should still be "approximately" the same cost as Vec lookups if you cache things like TypeId -> dense_component_index lookups, which we already do for Queries. And iteration will be very close to Vec iteration because these lookup costs only need to happen once instead of per-entity.
So... From system query it should already work?
Yup QueryState, which is used by System Queries and returned by world.query(...)
, already caches everything that it (reasonably) can.
But.... Looks like get_unchecked_manual
called by query.get()
reconstructs Fetch and Filter State on each call (I'm not sure which exact implementation is called, but looks like there is some branches in Fetch::init
).
Clearly, there is no such overhead for iteration...
P.S. Is component pointers stable for alive entities?
We don't reconstruct Fetch and Filter State in get_unchecked_manual. We pass the cached state into Fetch and Filter instances. Fetch:init does do the work of looking up pointers to storages, but this cannot be assumed to be stable. Any time a new component or entity is added (or removed), that could trigger a reallocation of the array, which would invalidate the pointer. Caching that pointer would be very unsafe.
For a given world borrow and set of entity lookups using that borrow, we can assume it is stable because nothing can add/remove entities or components during that time. But we can't store that in QueryState, which exists outside of a given World borrow.
We might be able to cache the Fetch and Filter instances inside of System Queries though (because components cannot be added / removed in a way that invalidates these pointers when the system is running). I can't assert that is valid off the top of my head though.
I see, thanks. I think I need to benchmark this more thoroughly... At first glance looks like accessing resource is way faster then accessing component. Will return with results.
At first glance looks like accessing resource is way faster then accessing component.
This makes sense because we can skip the "entity location lookup" and move right to the "component storage lookup". However in practice they should still be pretty similar. Keep in mind that we're talking about very small / inexpensive operations here. Things like "an extra array lookup" will register on benchmarks, even though that is an extremely cheap operation in the grand scheme of things.
Definitely let us know if you find any optimization potential! We're already very competitive on ECS benchmarks, but faster is better!
Reporting in.
In-system Query::get
5-6 times slower then Arc.
https://github.com/tower120/bevy/commit/33a73fa51abf7ea44439eb8f99893b8630d967a3
entity time: [136.58 us 137.07 us 137.57 us]
change: [-4.5581% -3.5928% -2.7019%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
box unique time: [20.592 us 20.669 us 20.752 us]
change: [+1.2907% +1.7769% +2.3090%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high mild
arc mutex time: [166.39 us 166.87 us 167.42 us]
change: [+0.1934% +0.6981% +1.2190%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
arc unsafe time: [25.175 us 25.265 us 25.365 us]
change: [-1.1272% -0.5776% -0.0209%] (p = 0.04 < 0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
If you want, I can make PR with that benchmark.
Query::get performance also has a big impact on bevy_rapier performance, because it stores the rigid bodies, colliders and other related data directly in bevy ecs and does a lot of random access with Query::get when processing physics
According to one of the rapier contributors on their discord, bevy_rapier is about twice as slow as rapier with built-in data structures. (Haven't seen any benchmarks for this though)
I also benchmarked query.for_each
vs vec.for_each
.
https://github.com/tower120/bevy/commit/4d97e6fba47d9238e418ae087c8c2c1ab1963e9c
It appears that Vec
traverse 3-4 times faster then query...
entity iter time: [190.90 us 191.37 us 191.85 us]
change: [-2.9586% -2.6283% -2.2918%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
vec iter time: [59.864 us 59.994 us 60.179 us]
change: [-2.4671% -1.8886% -1.3690%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild
6 (6.00%) high severe
So it may make sense to pack few lightweight systems into one more heavy-weight... Frankly I did not expect that difference...
There is branch inside archetype_fetch
- match self.storage_type
. It is inside each iteration... Maybe if change api somehow to have separate loops for each storage_type
- so less work left to do inside tight loop...
UPDATE Pardon, that's for non dense only.
I also benchmarked
query.for_each
vsvec.for_each
.
Your query iteration benchmark uses a parallel SystemStage for running the query. I suspect a lot of the time is spent on the scheduling and caching done by that, rather than the iteration itself.
@FeldrinH Thanks! I'll try change it now.
@FeldrinH Changed to SystemStage::single_threaded()
difference is marginal, if any.
Just took a look. Lets start with:
First: there is a methodology issue. This is an apples-to-oranges comparison because you're running the entire scheduling infrastructure for the Query benchmark and comparing that to a single raw vec iterator. The scheduling infrastructure is cheap to run, but it will still register in a small way for benchmarks like this.
Additionally, even after accounting for that, this is still apples to oranges because the Vec benchmark doesn't do change detection. Functionally, change detection performs the same as iterating and writing to another component (because it needs to fetch the tracking state and write to it when Mut<Point>
is mutated).
After disabling change detection (currently a manual process of commenting out the relevant code, but will ultimately be configurable per-component once we switch to #[derive(Component)]
) I get:
Then, after removing the Schedule/Stage/System in favor of iterating a raw Query, I get:
After righting the scales, we are actually faster than iterating a raw Vec (by an extremely small margin). Despite the fact that we provide way more functionality than a simple Vec, we still come out on top because the extra work required to accomplish that functionality amortizes to zero.
The test:
fn entity_iter_raw_query_bench(criterion: &mut Criterion) {
// Setup world
let mut world = World::default();
let mut query = world.query::<&mut Point>();
// // Setup test entities
for i in 0..ENTITIES_COUNT {
world.spawn().insert(Point {
x: i as u32,
y: i as u32,
});
}
// Run systems
criterion.bench_function("entity iter", |b| {
b.iter(|| {
query.for_each_mut(&mut world, |mut point| {
point.x += 1;
point.y += 1;
})
})
});
}
Additionally it seems like there is a general misunderstanding about how relevant numbers like 2x are in this context. Thanks to the cache friendly nature of both implementations, we are talking about extremely small timescales here. The second you start doing anything meaningful inside these iterators, that work will dwarf the iteration costs. We get a ~30% increase in cost, just by throwing in some more ops:
criterion.bench_function("entity iter", |b| {
b.iter(|| {
query.for_each_mut(&mut world, |mut point| {
point.x += (10 - point.y).pow(4);
point.y += 1;
})
})
});
Doing any amount of work on these scales registers in a big way.
From the perspective of a game engine, the benefits of change detection far outweigh the 2x cost (by enabling us to optimize / skip expensive operations each frame and opening up extremely useful coding patterns). The cost won't meaningfully register for anything but the most resource constrained benchmarks and in the very near future you will be able to easily opt out for those scenarios.
Lets segue into the next benchmark:
After accounting for change detection, here are the results:
box_unique
is ~4.4x faster than entity
. First, even if this was an apples to apples comparison, these numbers would be acceptable because as we've previously covered, we're working on very small timescales here. We are also quite competitive when compared to other ECS implementations. So why isn't box_unique a fair comparison to ECS get(Entity) operations?
Lets compare the inner loops of the two benchmarks:
This amounts to the totals:
And that doesn't even account for the extra branching to ensure Entity and sparse set lookups are valid. In total, the 4.4x number makes perfect sense!
If you don't think too hard about it, it sure seems like we're doing a lot of unnecessary work. We should just use Boxes! What were we thinking?
Lets do some small (zero cost) refactors so we can start building our Box ECS:
Components<T> { components: [Box<T>; BOXES_COUNT] }
(in practice this would actually be a Vec<T>
for faster iteration)Now lets pretend we want Entity(0) to have component [A, B] and Entity(1) to have [B]. How will we record this?
Here is the simplest path forward:
Use an option (or null pointer) to indicate that a specific item in the array exists. (ex: Vec<Option<T>>
or [MaybeUninit<Box<T>>; BOXES_COUNT]
). This is known as a "Sparse Array". But now we've created a couple of problems relative to PointBoxes:
Component<T>
arrays must be sized to accommodate the entire entity space. They are no longer "densely packed". If Component A has 10 entities and Component B has 2 entities, both arrays need to be length 10.
Vec<T>
, then having another Vec<Option<usize>>
array to determine whether or not the entity has component T and where it is located in the dense Vec. This is called a "Sparse Set" and is a popular ECS variant.Ex: The specs ECS "vec storage" is basically a sparse array. I had benchmarks handy that compare its entity lookups to Bevy's
Funnily enough, specs is about 4x faster than bevy_system, which makes sense given what we just talked about / the performance of your box benchmark!
However sparse sets have significant downsides:
The solution to this is "archetypal ecs".
Entities belong to a specific archetype, with per-archetype storages that are densely packed and perfectly aligned.
This adds the indirection of finding the archetype, but it means:
query.get(entity)
, the faster it gets relative to sparse sets.Archetypal and Sparse Set ECS are the two most popular ECS implementations. They are a result of the best minds in the business putting their heads together and making informed compromises. They each have their tradeoffs, but they provide functionality that a simple Vec<T>
cannot. It is literally a limitation in the laws of physics. You cannot have the flexibility of ECS without more branching and indirection. If you want the performance of an array, you are stuck with the rigidity of an array. If you think you can do better ... awesome! Just be aware that this is an extremely competitive space and that we are extremely competitive in this space (https://github.com/rust-gamedev/ecs_bench_suite). Don't expect to find anything but small / marginal wins (or wins that come at the cost of other benchmarks / features).
Just took a look. Lets start with:.....
@cart is this write up in a dev blog somewhere? I learned so much just reading it.
@cart is this write up in a dev blog somewhere? I learned so much just reading it.
Not currently. I might ultimately consolidate this and other things like the Bevy ECS V2 description (which is basically a blog post by itself). But that takes time and I've got renderer code to write right now :smile:
Thank you for explanation.
Previously, I worked with one custom ECS... As far as I remember, it had almost same performance as vec for components iteration; and 1.5-2.5 (depending on memory fragmentation, I guess) slower on component get by entity vs unique_ptr
access.
Surely it was not that feature-reach as bevy-ecs. I describe it just in-case, that could help enhance bevy-ecs.
I don't understand every bit of bevy-ecs architecture, so I can't say where exactly difference. So I better tell how that ECS worked. Lets call it CUSTOM_ECS. But looks like main difference in how entity-thing treated...
CUSTOM_ECS is archetype based ecs; systems run in manually defined sequence one-by-one. We used multithreading only as paralell-foreach inside system. All emplaces/erases scheduled and executed at the end of the system. (actually, maybe emplace postponing was unnecessary...)
Archetype A+B+C
Entity 0 Entity 1 ... Entity N
Component A <- std::vector<A> technically implemented as std::vector<uchar>
Component B
Component C
Entity Rc<EntityImpl>
As simple as vector.push_back
Components column from back moved in place of erased one. All components pop_back()
. Moved-in EntityImpl updated. Erased EntityImpl zeroed(marked as null).
struct World{
std::vector<Archetype> archetypes;
std::vector<ComponentData> components;
}
struct EntityImpl{
uint archetype_id;
uint entity_index;
}
Entity is shared_ptr<EntityImpl>
(Well, actually in-house implementation without weak references "overhead"). Since enity's components can be moved in memory only between systems run, it is safe to update EntityImpl whenever entity moved due to erase or archetype change.
Found archetypes that satisfy condition, and run through required component vectors. Basically:
for (auto& archetype : archetypes)
for (auto& component : archetype.components()[component_id]) // but for several components at once
closure(component)
Now interesting part. Each component have associated table - archetype_id to components row pointer. It updated whenever archetypes changes. (We actually never delete empty archetypes automatically)
// instead of std::vector everything worked with plain array T* . vector here for clarity.
using Components = std::vector<uchar>;
struct ComponentData{
// Even if table go big - It'll be still small enough to fit L1 cache. Table have high chances to remain in cache if we getting entity often (like in loop).
vector<Components*> components_pointers; // index = archetype_id. I think 4000 archetypes is a reasonable max.
}
using Entity = shared_ptr<EntityImpl>;
Component& get_component(ComponentData& component_data, const Entity& entity_){
const EntityImpl& entity = *entity_;
auto&/*vector<Component>*/ components_row = component_data.components_pointers[entity.archetype_id]; // table fetch + reinterpret_cast (now showed here).
return components_row[entity.entity_index]; // as you can see "only" 3 indirections
}
ComponentData& components_A = world.GetComponentData<ComponentA>(); // this do hash-map lookup to associate Component "type_id" with world's ComponentData
for const Entity& entity ....{
components_A.get_component(entity)
}
It is "unsafe" version - meaning you know entity's components exists.
I see! Thanks for sharing. That looks pretty similar to how our archetypal storage works (and the general call pattern). But there are some cool ideas here that we aren't currently using! The main differences:
Entity(Rc<EntityImpl>)
and there appears to be no "entity id value' decoupled from Archetype/Location. This enables skipping the "pointer offset" when looking up entity locations, but it is still a random pointer access to find the location (and therefore indirection). Of course, an entity cannot actually be identified by those two values (because the can change across updates), so I'm assuming the "pointer value" is treated as the "entity value decoupled from archetype location". I have some questions / comments:
Entity(Rc<EntityImpl>>)
? In rust, the location wouldn't be freed up until the final Rc is dropped. How does your implementation ensure that an Entity stored in a list floating around somewhere isn't pointing to a previously "despawned" entity? Is it a weak pointer that needs to be "upgraded" to a strong pointer?Entity(Rc<EntityImpl>)
or synchronized access to some Vec<EntityImpl>
. If done in parallel, we need some synchronized storage of these pointers to ensure that we update locations in the event that they change.... and there appears to be no "entity id value' decoupled from Archetype/Location. What do you mean?
Of course, an entity cannot actually be identified by those two values (because the can change across updates) That two values changes too during update - that's the whole idea for keeping them in heap.
How does world.despawn(entity) work in relationship to Entity(Rc
>)? In rust, the location wouldn't be freed up until the final Rc is dropped. How does your implementation ensure that an Entity stored in a list floating around somewhere isn't pointing to a previously "despawned" entity? Is it a weak pointer that needs to be "upgraded" to a strong pointer?
As soon as entity will be despawned, value of EntityImpl zeroed. So one can check if it is still valid, since in list stored shared_ptr (pointer to heap), not actual value. There were no weak pointers. Technically Entity stay alive until there is at least one value of it... The very data in heap changed (with zeros, or with new location if entity moved across archetype tables).
// between system runs!
void despawn(shared_ptr<EntityImpl> entity){
// do actual components erase from storage
entity->archetype_id = 0;
entity->entity_index = 0;
}
bool is_null(shared_ptr<EntityImpl> entity){
return entity->archetype_id == 0;
}
IOW, we does not prolong components lifetime with Rc. We prolong Entity lifetime in heap with Rc.
If "entity ids decoupled from archetypes" are "pointer values", then spawning new entities would involve a bunch of small individual allocations of Entity(Rc
) or synchronized access to some Vec . If done in parallel, we need some synchronized storage of these pointers to ensure that we update locations in the event that they change.
I don't understand what you mean saying "pointer values", if that means form of smart-pointer pass-by-value, like shared_ptr, then yes.
Allocations was surprisingly fast... And when allocated one-by-one - they dense in memory too...
All structural changes to archetype table was done outside system run. They were scheduled, then executed. (I don't know how bevy_ecs Commands exactly work, but looks similar to that). Systems run sequentially (not in parallel as in bevy). So parallelism is not a thing for table alteration.
The initial reason for keeping entity_location data in heap, was to get rid of entity_id -> actual_entity_location table. It looked like there is no benefits for having LUT complexity, and that is essentially the same as heap allocated entity_location data.
I think I can make smaller version of that ECS, if that helps...
It was derivative from https://github.com/tower120/ecs_mt . But that version is outdated. And does not actually use Rc.
EntityImpl
/entity_data
just stored in non-movable memory area. So instead of Rc, pointer to entity_data
is used in entity
:
https://github.com/tower120/ecs_mt/blob/master/include/tower120/ecs/entity.hpp
So, in that version, there is no way to tell if that entity alive or no. Also it looks like it does not have archetype to components row LUT optimization.
This discussion has been educational and useful, but I'm going to close this out as it's not particularly actionable. If there's specific take-aways or action items, new issues are very welcome.
What problem does this solve or what need does it fill?
I'm trying to get components data from EntityIds(
Vec<Entity>
) withworld.get(entity)
.... I'm experience some "unexpected" performance drop (in compare toVec<Rc>
).Looking at the code of
World::get<T: Component>(&self, entity: Entity)
... there is so much indirection... Getting component index from hashmap, then bursting through 3 indirections to get component "Column", and one branch between storage type.What solution would you like?
Is it possible to somehow .... prefetch/cache that archetype table_row/column?
That's just too much... Thats waaaay slower then just cache misses with
Rc
.Additional context
I use
Vec<Entity>
not that often... But sometimes you just have to use it ....