SanderMertens / flecs

A fast entity component system (ECS) for C & C++
https://www.flecs.dev
Other
6.47k stars 454 forks source link

Memory usage & performance when using entity names and parenting #941

Closed jeanlemotan closed 1 year ago

jeanlemotan commented 1 year ago

When using entity names and parenting, memory usage increases a lot and performance when iterating with a query decreases significantly.

To repro, I used the queries_hierarchy example, modified to create 100K entities and time the execution of the query (not the construction, just the execution).

Results:

  1. No names, no parents (no child_of call). Memory Total: 92MB Memory per entity: 0.94K Queries per second: ~1500

  2. Names, no parents Memory: 280MB Memory per entity: 2.87K Queries per second: ~1500

  3. No names, parents Memory: 693MB Memory per entity: 7.1K Queries per second: ~42

  4. Names & Parents Memory: 2700MB Memory per entity: 27.64K Queries per second: ~42

So using parents adds a significant memory overhead, much more than names. Names add a lot as well. Performance takes a significant hit when using parenting, around 3% of the performance of no parents.

The case with parents and names is particularly bad - with around 28K per entity.

Any idea what's going on? Any way this can be improved?

Here is the code, as modified in the main.cpp if the queries_hierarchy example. You can simply paste it in that file and run it to test it. There are 2 bool at the beginning of the main function to control names and parents.

#include <chrono>
#include <hierarchy.h>
#include <iostream>

struct Position {
    double x, y;
};

// Tags for local/world position
struct Local { };
struct World { };

int main(int, char* []) {
    flecs::world ecs;

    bool use_names = true;
    bool use_parents = true;

    // Create a hierarchy. For an explanation see the entities/hierarchy example
    char name[32];
    for (int i = 0; i < 100000; i++)
    {
        sprintf(name, "sun_%d", i);
        auto sun = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 1, 1 });

        sprintf(name, "mercury_%d", i);
        auto mercury = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 1, 1 });
        if (use_parents) {
            mercury.child_of(sun);
        }

        sprintf(name, "venus_%d", i);
        auto venus = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 2, 2 });
        if (use_parents) {
            venus.child_of(sun);
        }

        sprintf(name, "earth_%d", i);
        auto earth = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 3, 3 });
        if (use_parents) {
            earth.child_of(sun);
        }

        sprintf(name, "moon_%d", i);
        auto moon = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 0.1, 0.1 });
        if (use_parents) {
            moon.child_of(earth);
        }
    }

    // Create a hierarchical query to compute the global position from the
    // local position and the parent position.
    auto q = ecs.query_builder<const Position, const Position, Position>()
        // Modify terms from template to make sure the query selects the
        // local, world and parent position components.
        .term_at(1).second<Local>()
        .term_at(2).second<World>()
        .term_at(3).second<World>()

        // Extend the 2nd query argument to select it from the parent
        .term_at(2)
        // Get from the parent, in breadth-first order (cascade)
            .parent().cascade()
        // Make term component optional so we also match the root (sun)
            .optional()
        // Finalize the query
        .build();

    auto start = std::chrono::system_clock::now();

    // Do the transform
    q.iter([](flecs::iter& it,
        const Position* p, const Position* p_parent, Position* p_out)
        {
            for (auto i : it) {
                p_out[i].x = p[i].x;
                p_out[i].y = p[i].y;
                if (p_parent) {
                    p_out[i].x += p_parent->x;
                    p_out[i].y += p_parent->y;
                }
            }
        });

    auto duration = std::chrono::system_clock::now() - start;

    // Print world positions
    //ecs.each([](flecs::entity e, flecs::pair<Position, World> p) {
    //    std::cout << e.name() << ": {" << p->x << ", " << p->y << "}\n";
    //    });

    std::cout << 1.0 / std::chrono::duration_cast<std::chrono::duration<float>>(duration).count() << std::endl;
}
SanderMertens commented 1 year ago

Note that you're creating 500.000 entities, not 100.000 ;)

The RAM overhead is expected, as parenting splits up entities across tables. Names are stored in a hashmap to enable fast lookups, which also takes up space.

The query performance should be somewhat slower, but not as much as you're seeing. Try adding this to the query builder:

.instanced()

You can read more about that here: https://www.flecs.dev/flecs/md_docs_Queries.html#autotoc_md225

The tl;dr is that without instancing, if you're using iter, each entity will be iterated by itself because the query contains shared fields (from the parent). With instancing enabled entities are iterated in bulk.

I should probably update the example with this 🤔 I could use each here, which is instanced by default.

jeanlemotan commented 1 year ago

My bad, it's 500K entities indeed. I tried with instanced now and it does make a difference when there is no parenting. For the parenting case, almost no diff. Here are the new timings and mem reports, adjusted for the real 500K entities:

  1. No names, no parents Memory Total: ~100MB Memory per entity: 0.2K Queries per second: ~3300 (up from 1500)

  2. Names, no parents Memory: 280MB Memory per entity: 0.56K Queries per second: ~3300 (up from 1500)

  3. No names, parents Memory: 670MB Memory per entity: 1.34K Queries per second: ~46 (up from 42)

  4. Names & Parents Memory: 2700MB Memory per entity: 5.4K Queries per second: ~44 (up from 42)

And the code:

#include <chrono>
#include <hierarchy.h>
#include <iostream>

struct Position {
    double x, y;
};

// Tags for local/world position
struct Local { };
struct World { };

int main(int, char* []) {
    flecs::world ecs;

    constexpr bool use_names = true;
    constexpr bool use_parents = true;

    // Create a hierarchy. For an explanation see the entities/hierarchy example
    char name[32];
    for (int i = 0; i < 100000; i++)
    {
        sprintf(name, "sun_%d", i);
        auto sun = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 1, 1 });

        sprintf(name, "mercury_%d", i);
        auto mercury = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 1, 1 });
        if (use_parents) {
            mercury.child_of(sun);
        }

        sprintf(name, "venus_%d", i);
        auto venus = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 2, 2 });
        if (use_parents) {
            venus.child_of(sun);
        }

        sprintf(name, "earth_%d", i);
        auto earth = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 3, 3 });
        if (use_parents) {
            earth.child_of(sun);
        }

        sprintf(name, "moon_%d", i);
        auto moon = ecs.entity(use_names ? name : nullptr)
            .add<Position, World>()
            .set<Position, Local>({ 0.1, 0.1 });
        if (use_parents) {
            moon.child_of(earth);
        }
    }

    // Create a hierarchical query to compute the global position from the
    // local position and the parent position.
    auto q = ecs.query_builder<const Position, const Position, Position>()
        // Modify terms from template to make sure the query selects the
        // local, world and parent position components.
        .instanced()
        .term_at(1).second<Local>().self()
        .term_at(2).second<World>()
        .term_at(3).second<World>().self()

        // Extend the 2nd query argument to select it from the parent
        .term_at(2)
        // Get from the parent, in breadth-first order (cascade)
            .parent().cascade()
        // Make term component optional so we also match the root (sun)
            .optional()
        // Finalize the query
        .build();

    const auto start = std::chrono::system_clock::now();

    constexpr int iterationCount = 100;
    for (size_t i = 0; i < iterationCount; i++)
    {
        // Do the transform
        q.iter([](flecs::iter& it, const Position* p, const Position* p_parent, Position* p_out)
            {
                if (p_parent) {
                    if (it.is_self(2)) {
                        for (auto i : it) {
                            p_out[i].x = p[i].x + p_parent[i].x;
                            p_out[i].y = p[i].y + p_parent[i].y;
                        }
                    }
                    else {
                        const double px = p_parent->x;
                        const double py = p_parent->y;
                        for (auto i : it) {
                            p_out[i].x = p[i].x + px;
                            p_out[i].y = p[i].y + py;
                        }
                    }
                }
                else {
                    for (auto i : it) {
                        p_out[i].x = p[i].x;
                        p_out[i].y = p[i].y;
                    }
                }
            });

    }
    const auto duration = (std::chrono::system_clock::now() - start) / iterationCount;
    std::cout << 1.0 / std::chrono::duration_cast<std::chrono::duration<float>>(duration).count() << std::endl;
}
SanderMertens commented 1 year ago

I ran the test locally and confirmed the numbers. A few observations:

I did measure a few things that I cannot explain (removing cascade improves performance, but it shouldn't make a difference). There may be some things I can do to improve.

Another thing you can do to significantly improve performance of systems that match with large numbers of hierarchies/other relationships is to use query groups. Query groups (see the queries/group_by_* examples) let you group tables together, and select the tables in advance to query.

This would for example let you group tables by world cell, and only iterate the ones that are close to the camera. This is something I do in the Flecs hub renderer module: https://github.com/flecs-hub/flecs-systems-sokol/blob/master/src/modules/geometry/geometry.c#L891

There are a few storage improvements on the roadmap that could improve RAM performance, stay tuned 😊 Those improvements can also help with query performance, especially for queries that aren't interested in hierarchies and just want to iterate a plain list of components.

jeanlemotan commented 1 year ago

Thanks a lot for the investigation. The problem is that in my use-case (gaming) it's more common to have deep hierarchies rather than wide ones, mostly due to skeletal nodes. So it's very common to have lots of nodes with only one child. In such cases, even the classis OOP Node* approach yields better results, performance and memory wise.

I'd really like to keep using flecs as I love the API and the features it provides, and now also the support :) So ATM I'm rethinking to structure the data with a custom relationship component so that I can avoid the child_of call and fragment the tables. Is there any way to iterate efficiently or sort to that I can process the parents before the children? Any other suggestions?

To give you an idea of the kind of data we're dealing with, there are usually 15-30K nodes in a scene, and each of these has 2-3-10 child nodes (in a deep hierarchy) with some (characters) having up to 130-150 nodes in a biped skeleton (so also a deep hierarchy). A level has a total of 100-300K entities, sometimes more. The level can be partitioned/grouped in the game, but the tooling has to be able to efficiently query and iterate all the nodes for various operations in realtime (so in under single digit milliseconds). Then there's the memory problem, which easily goes in the hundreds of megabytes of overhead.

We're thinking to replace a classic Node* approach with flecs btw, and apart for this issue, everything else is faster, easier, better.

Thanks again for the effort!

SanderMertens commented 1 year ago

Ahh right that makes sense! For skeletal nodes (especially if you have lots of them) the overhead of relationships is probably not worth it, I think I would use a more dedicated solution for that (like a component with a vector of nodes).

I think I'll be able to bring down the RAM footprint, possibly significantly, but almost certainly not below a simple OOP-style hierarchy. That is because of a few things:

Relationships work well when:

Your use case is kind of the opposite of that 😅 You're probably not interested in querying a specific part of a rig, skeletal hierarchies are static, and like you mentioned they're deep and not very wide.

This is an interesting use case, I've been thinking for a while on how to support static hierarchies like the one you describe, which wouldn't come with the overhead of separate tables/search indices etc. Storage wise that would be very similar to how you'd solve it if you didn't have relationships, which would be a component with child data.

Given what I know, I think that's how I would implement it :) You should still be able to use hierarchies/relationships for the top-level objects, but yeah for the skeletal nodes something else probably works better.

jeanlemotan commented 1 year ago

Thanks a lot for all the info, it clarifies everything. My use-case doesn't match, indeed. I'll look for alternative representations and hopefully still use flecs for the main structure. Thanks a lot!

SanderMertens commented 1 year ago

Quick update- this issue got me thinking about a feature that lets you "flatten" a static hierarchy, which would be able to combine entities for different parents in the same table. I just got a prototype of it working, lmk if you'd be interested in trying it out :)

jeanlemotan commented 1 year ago

Sure, send it over, I'll try it in my sample.

SanderMertens commented 1 year ago

The feature is available on this branch: https://github.com/SanderMertens/flecs/tree/flatten_hierarchy

I still need to do a bit more testing with it but it seems to work reasonably well so far. What it does is it turns a regular tree into a fixed tree, which combines entities from different parents into the same table. To turn a tree into a fixed tree, call:

ecs_make_fixed(world, ecs_pair(EcsChildOf, parent_of_tree));

Systems should still work as usual. There are a few limitations for now:

Let me know if this helps! I plan on merging it in the next few days.

SanderMertens commented 1 year ago

The feature just merged: https://github.com/SanderMertens/flecs/blob/master/include/flecs.h#L2942

alexconstant9108 commented 1 year ago

Hello I am having a similar use case (but using is_a as opposed to child_of) to @jeanlemotan

So I've added 50k entities (some have been specified as flecs::Final; no names just world.entity() to keep things as lean as possible) to an empty world and then created 75k .is_a() ( flecs::IsA) relationships between some of those entities.

Just adding the 75k relationships takes ~250ms on a beefy Ryzen 9 @4.2 GHz CPU. Is this to be expected?

Also I have to frequently destroy the world and start with a new one where the destruction of the current one takes more than 100ms. Is there a faster way to destruct a filled-in world (no special cleanup required as long as there are no memory leaks of course)?

@SanderMertens any tips on how to speed things up? I was hoping for something like 2-3ms (not 200-300ms) for adding all those relationships :)

@jeanlemotan have you come up with a more efficient representation for those hiearchies?

SanderMertens commented 1 year ago

@alexconstant9108 Make sure that:

Can you provide an example code snippet that shows what takes 250ms? For reference, I just created this quick example which finishes in ~5ms on an M1 laptop:

    ecs_time_t t;
    ecs_time_measure(&t);
    ecs_entity_t base = ecs_new_id(ecs);
    for (int i = 0; i < 75000; i ++) {
        ecs_entity_t e = ecs_new_id(ecs);
        ecs_add_pair(ecs, e, EcsIsA, base);
    }
    printf("t = %f\n", ecs_time_measure(&t));

It is possible that if you're creating many (thousands) different relationship pairs that this takes some time because of internal administration that's created. This administration is mainly for the benefit of being able to query relationships (though IsA has a bunch of other features too). If you don't need specific features provided by relationships, you may be able to get better performance from just using a component with an entity value in it.

Additionally, if this is code that initializes a world, make sure to run it outside of a system. By default ECS operations inside systems are inserted as commands, which has additional overhead.

Shutting down a world in 100ms is not unusual, world creation/deletion is not expected to be fast.

alexconstant9108 commented 1 year ago

Thanks for the tips!

I will do some more tests and tweaks and report back here.

alexconstant9108 commented 1 year ago

@SanderMertens

Okay, so having specified -O3, -lto, FLECS_NDEBUG Here are the results for 2 loops where loop 1, creates 50k entities and stores them into myMap; loop 2 iterates over 75k pairs and creates an is_a relationship. Also I am copying myMap[entityIdIntA] and myMap[entityIdIntB] into temporaries in loop 2 before the below line just to make sure that it's not the hashmap access that's slowing things down and also that the compiler won't optimize away the whole loop

Total measurements where the only difference is the following line being commented out or not in the second loop in micro seconds 9928 microseconds == 9.9ms when commented out. 125605 microseconds == 125ms when NOT commented out.

myMap[entityIdIntA].is_a(myMap[entityIdIntB]);

Just to mention: using flecs v3.2.4 Also, I am not using Systems at all, I only have -DFLECS_CUSTOM_BUILD -DFLECS_CPP

Additional question regarding Shutting down a world in 100ms is not unusual, world creation/deletion is not expected to be fast. Can I reuse a world somehow as a performance improvement? E.g. clear all previous entities and relationships and insert the new one in the old world? Would that be faster tham deleting the old world and creating a new one from scratch?

SanderMertens commented 1 year ago

@alexconstant9108 can you share the code you're benchmarking with? There is a range of things that could be contributing to the 125ms, if you show me exactly what you're doing I might be able to explain what's happening / suggest improvements. I'm not sure what you're doing in loop 2- you have 50k entities but 75k pairs? Showing the code would help a lot.

E.g. clear all previous entities and relationships and insert the new one in the old world? Would that be faster tham deleting the old world and creating a new one from scratch?

Yup it would be. Usually the quickest way to delete all entities is to use the delete_with operation, if entities have a common component. If they're all in the same subtree, you can delete all entities by deleting the parent.

alexconstant9108 commented 1 year ago

@SanderMertens here is a contrived example that should reproduce the issue. You'll have to uncomment the full contents of both arrays (note that that'll take a while to compile with full optimizations) for the full test or store the contents into files and read from those.

https://gist.github.com/alexconstant9108/bfee13d3d226634ed40e1beaf3206a7d

Ooops, there is a copy paste error: change the declaration of arrayB to std::vector<std::pair<std::uint16_t, std::uint16_t>> arrayB

copygirl commented 1 year ago

Ooops, there is a copy paste error: change the declaration of arrayB to

You can edit Gists. (It keeps the edit history. It's like a tiny git repository.)

alexconstant9108 commented 1 year ago

Ooops, there is a copy paste error: change the declaration of arrayB to

You can edit Gists. (It keeps the edit history. It's like a tiny git repository.)

Generally yes, but this gist in particular is huge because @SanderMertens wanted a fully reproducible example

SanderMertens commented 1 year ago

Looks like you have some issues in the code, when I run it (in debug mode) I get an assert:

fatal: id_record.c: 182: assert: !is_final cannot inherit from final entity (CONSTRAINT_VIOLATED)

Looks like this is because you're using explicit ids, and some of them overlap with Flecs ids. Trying to workaround it..

alexconstant9108 commented 1 year ago

I thought I had already filtered out the few errors in the IDs where the RHS Id had been declared as final. Can you tell me which IDs violate the constraint?

Looks like this is because you're using explicit ids, and some of them overlap with Flecs ids. Trying to workaround it..

What do you mean? I intentionally do NOT specify my integer IDs as the argument to .entity() since that was messing with flecs' internal IDs. I use just auto currentEntity = world.entity();

SanderMertens commented 1 year ago

It asserts on the first one. You should be able to reproduce by running the code in debug mode.

alexconstant9108 commented 1 year ago

It asserts on the first one. You should be able to reproduce by running the code in debug mode.

That's weird. Have you fixed the typo in the arrayB declaration? It should be: std::vector<std::pair<std::uint16_t, std::uint16_t>> arrayB

SanderMertens commented 1 year ago

The problem was that you did

entityMap[pair.first].is_a(pair.second);

instead of

entityMap[pair.first].is_a(entityMap[pair.second]);

EDIT- nvm, fixing the typo fixed it.

SanderMertens commented 1 year ago

Took 5 minutes to compile... and got another assert (same one).

Is there another way you could demonstrate the problem? This code is pretty difficult to iterate on.

SanderMertens commented 1 year ago

Bit easier to work with, and I think this may demonstrate the issue you're seeing:

#include "flecs.h"

#define TARGET_COUNT (75000)
#define COUNT (50000)

int main(int argc, char *argv[]) {
    flecs::world world;

    world.import<flecs::monitor>();

    std::vector<flecs::entity> entities;
    std::vector<flecs::entity> targets;

    for (int i = 0; i < COUNT; i ++) {
        entities.push_back(world.entity());
    }
    for (int i = 0; i < TARGET_COUNT; i ++) {
        targets.push_back(world.entity());
    }

    ecs_time_t t = {0};
    ecs_time_measure(&t);
    for (auto e : targets) {
        int32_t index = rand() % COUNT;
        entities[index].is_a(e);
    }
    printf("time = %f\n", ecs_time_measure(&t));
}

This code for me takes ~110ms to run, which is expected. The reason is that I'm creating 75k unique pairs, which in the storage also creates about as many tables. There's also a bit of additional logic that happens for IsA relationships, to support things like event propagation and component overriding.

I don't think this could get much faster than this. If you wanted to do 75k operations in 5ms, you'd have ~60ns per operation. That's usually enough for regular component/add remove operations, but not when the storage also has to create the unique pairs & storage tables.

alexconstant9108 commented 1 year ago

Perhaps, I should rather use a simple tag (e.g. IsALite) instead of isA (because in my case I don't actually need to support things like event propagation and component overriding)? OTOH I actually need the transitive property of the isA relationship though: if A IsALite B and B IsALite C, then A IsALite C, so I should be able to ask flecs if A IsALite C and get true as result.

alexconstant9108 commented 1 year ago

Took 5 minutes to compile... and got another assert (same one).

Is there another way you could demonstrate the problem? This code is pretty difficult to iterate on.

Yeah, that's why I made so many typos. It's really hard to edit / compile such big inline defined arrays without the whole IDE crashing. You could instead put the contents of each array into two separate files and load them from there on program start

copygirl commented 1 year ago

You can define your own relationship that is transitive.