Doraku / DefaultEcs

Entity Component System framework aiming for syntax and usage simplicity with maximum performance for game development.
MIT No Attribution
658 stars 62 forks source link

Current state of multi-threading... #145

Open nrader95 opened 2 years ago

nrader95 commented 2 years ago

When multi-threading with default runner, it seems there is big overhead for me. When i profiled with Unity profiler, system updates sometimes can take twice more time to execute for some reason. (and it was in middle of various methods, both mine and ecs methods(including entity.Get<>, for example)). I tried to write my own IParallelRunner based on System.Threading.Task<T> and result was the same. So if degree of paralelism is low, it can actually result in loss of performance.

Then i tried to use raw Threads from System.Threading instead of Tasks, and overhead was finally gone. So using Tasks seems to be the bad way in at least some cases.

Even if overhead i'm observing has anything to do with syncronization, current limitations on multi-threading for working with framework already prohibit changing entity composition. There is still that scenario where you write data into component after getting it from 2+ threads, but that's not always the case.

So i think you should consider changing default runner, or maybe alternative fast&unsafe one for safe-write systems.

Doraku commented 2 years ago

eh really? new Task(Update, index, TaskCreationOptions.LongRunning) from DefaultParallelRunner is supposed to create dedicated threads for the dotnet framework implementation but I guess this is not true for the mono implementation used in Unity. I'll see if I can change the implementation for targets above netstandard1.1 (which does not have Thread hence why I stuck with Task).

nrader95 commented 2 years ago

Hmm perhaps i was a bit too quick about saying Threads are good.

Because it seems my updates on threads were not time-synced and thus accessed data at a bit different times. Because when i finally writed my thread-based runner to get rid of this small time-desync, surprise, overheads are back. And a main thread that just so happened to run its chunk of data almost after threads had overhead on first few entities that time-overlapper with thread data processing, but did not have it after.

Doraku commented 2 years ago

Keep in mind that if the number of entities is too low using multithread can be detrimental because of the overhead of splitting/synchronization, hence the minEntityCountByRunnerIndex parameter of the system constructors. Since this number is dependent on your actual system process and the degree of parallelism of your runner, that's why I didn't put any default but maybe I should put 100 or something just in case.

Doraku commented 2 years ago

I pushed the change to use thread just in case, can you check if it changes anything?

nrader95 commented 2 years ago

Tried it just now, no luck, so it seems to be really linked to operation running at the same time.

And another weird thing - overhead is bigger, the more cores i add.

UPD: and i observe it in both linux and windows Unity.

Doraku commented 2 years ago

I will probably roll it back then. How many entities do you have for your tests?

nrader95 commented 2 years ago

Yeah its should be rolled back.

The only idea i have at the moment is that deep profiler is messing with me, but i would need to write some heavy-lift test code to see difference in performance without it.

nrader95 commented 2 years ago

Okay, so i tried to measure performance without using Unity deep profiling, and it seems deep profiler really was the culprit here for weird data (now i compared time since between PreUpdate and PostUpdate)

So here is results i obtained with different runners but on same dataset (30 units (root entities) and the system updated its children entities(didnt count how much per units, but anyway, it always stays the same for same unit) local positions using quaternions and vector3's):

DefaultParallerRunner (your usual task version) is indeed has troubles with performance, it's basically same as single-thread or worse on this dataset & entities count. My dumb task runner that basically comes down to "start the tasks, calculate own chunk of runnable in single thread, then wait for tasks in case they still didnt finish" showed the same results, so Task is indeed has its problems on Unity.

My thread runner showed performance improvement on most of frames, although not to the degree i was hoping (around 2x on most frames, the problem is i specified 4 cores for runner), so there is still room for improvement i guess.

What surprised me is that your Thread-version of DefaultParallerRunner didn't fare better then it's Task version, so there are some other overheads inside DefaultParallerRunner it seems.

nrader95 commented 2 years ago

By the way, another thing worth considering: currently, the entity count threshold for triggering runner usage with minEntityCountByRunnerIndex scales with the CPU core count. Which means, you cant predict maximum number of entities in a single-threaded execution before runner is triggered in compile time, since users have different CPUs with different number of cores.

Doraku commented 2 years ago

What surprised me is that your Thread-version of DefaultParallerRunner didn't fare better then it's Task version, so there are some other overheads inside DefaultParallerRunner it seems.

Not sure if related but maybe unity doesn't like the ManualResetEventSlim used internally for synchronization. I know that at the beginning I was using ThreadLocal<> which had terrible performance on mobile.

By the way, another thing worth considering: currently, the entity count threshold for triggering runner usage with minEntityCountByRunnerIndex scales with the CPU core count. Which means, you cant predict maximum number of entities in a single-threaded execution before runner is triggered in compile time, since users have different CPUs with different number of cores.

It is kinda hard to find a middle ground on this. The more cores you have, the more you will pay for synchronisation but the more you can process in parallel, ending in a net gain past a certain number of entities. I choose (from gut and nothing else so I may be completely wrong!) that the minimum number of entities to be processed per core for it to be beneficial is less dependent on the actual number of cores instead of the total number of entities to be processed. Add that every systems are different with their own threshold.

nrader95 commented 2 years ago

By the way, another thing worth considering: currently, the entity count threshold for triggering runner usage with minEntityCountByRunnerIndex scales with the CPU core count. Which means, you cant predict maximum number of entities in a single-threaded execution before runner is triggered in compile time, since users have different CPUs with different number of cores.

It is kinda hard to find a middle ground on this. The more cores you have, the more you will pay for synchronisation but the more you can process in parallel, ending in a net gain past a certain number of entities. I choose (from gut and nothing else so I may be completely wrong!) that the minimum number of entities to be processed per core for it to be beneficial is less dependent on the actual number of cores instead of the total number of entities to be processed. Add that every systems are different with their own threshold.

Maybe, tho currently i'm using a system which trigger runner after fixed amount of enities in filter and it seems at least not worse.

I've been thinking tho: Since we cant determine overall net gain from threading with all cores due to combined sync time & actual gain balance, maybe its worth trying non-fixed number of threads used. This way, you can specify how many entities thread has to lift from main one workload to justify the sync time for add one more thread.

Something like this: we set it to 100 entities for example, so when you have 200 you use 2 thread(main one included), for 300 - three, and so on until you hit your thread cap specified inside the runner's DegreeOfParallelism This way, the perf gains will always be bigger than sync times, assuming the system was configured properly.

hhyyrylainen commented 10 months ago

I just ran into this situation when trying to enable threading. I had just finished converting basically all of my game logic to systems and entities. I then turned on threading for all systems that didn't do operations only allowed on the main thread, and to my surprise I lost about 50% of the performance. So then searching a bit I found this issue. Turns out that the documentation doesn't mention minEntityCountByRunnerIndex at all (on this page where I read about multithreading: https://github.com/Doraku/DefaultEcs/blob/master/README.md#threading), which I think is pretty much key to solving my issue. So at the very least it'd be great if the main readme mentioned the threading caveats and that one constructor parameter that should be used to tweak things.

Something like this: we set it to 100 entities for example, so when you have 200 you use 2 thread(main one included), for 300 - three, and so on until you hit your thread cap specified inside the runner's DegreeOfParallelism This way, the perf gains will always be bigger than sync times, assuming the system was configured properly.

Yes, this would be awesome. This is how I did threading before switching over to DefaultEcs and I think it had a lot more consistency and ease of use. Now I need to try do something like estimating the number of entities that use each system, pick the number of threads based on that that are reported to DefaultECS and then also use minEntityCountByRunnerIndex to limit systems that have a lot less entities to process than other systems. So I basically need to do things in reverse to nudge things so that a reasonable number of entities to process per thread is selected. The good news is that based on my initial testing it is possible to get more performance from multithreading than the thread delegating overhead adds, but it requires quite a bit of manual tweaking.