Parallel iteration. - Githubissues

All8Up commented 2 years ago

I have posted an initial PR #257 which contains two alternative implementations of parallel iteration. One uses a pattern similar to existing Queries which return an iterator and the other is a late binding style where workers each call into the world by providing the query and a function with a piece of opaque data shared between the workers to keep things synchronized. I highly suggest the second variation even though it is a different API style because it removes considerable amounts of code complexity and also quite a few safety issues. Though, it is important to point out it is still unsafe as it performs zero borrow checking due to the nature of partitioning slices of the archetypes over the threads which the current borrow checks can't handle to my knowledge.

The purpose of this style of iteration is for use with something like Rayon or other parallel execution solutions. Why would you bother? Eventually when using the ecs pattern you almost always end up with one or two heavy weight systems, either because they are complicated/expensive per entity or because there are just tons of entities. So, while running multiple queries in parallel provides great benefits, eventually these bottleneck systems occur and slow things down. If you can run the system itself in parallel, you can break these bottlenecks which is the purpose here.

I posted some numbers in the PR showing a relatively complex AI sim running with just per-system parallelism versus adding parallel execution to the two primary bottleneck systems. It ends up about a 300% gain for 20,000 entities. If I finish parallelizing the last two primary bottleneck systems it will likely be closer to a 600-700% gain on that particular machine.

adamreichold commented 2 years ago

The purpose of this style of iteration is for use with something like Rayon or other parallel execution solutions.

At least when using Rayon, I think providing an implementation of IndexedParallelIterator would be preferable due to the usability and integration with Rayon's facilities that this provides. This can be done safely using exactly the same borrowing checks this crate already does and it pushes the concerns of synchronization into Rayon's join machinery. It can also provide efficiency similar to the second style proposed here via an implementation of Producer::fold_with.

If you interested in that, I did such an implementation for rs-ecs which should be straight-forward to port back into hecs even though I am not sure if integration with a specific threading systems like Rayon is in scope considering the comment on e.g. QueryBorrow::iter_batched

All8Up commented 2 years ago

I don't disagree that if I were doing this for Rayon that would likely be the best solution. But, I don't use Rayon, I only pointed it out as an example where it could be used as either a generic way to integrate or as an example of how to accomplish the task. Specifically though, while there is nothing wrong with Rayon, it is not built with the same requirements in mind. Generally speaking it is a fork/join model of threading, an ECS will use a pre-generated execution graph most likely. The primary difference is in the scoping where Rayon would be "borrow->iterate->release->|likely worker stall|->borrow->iterate->release" versus the graph which is "borrow everything once->iterate many things using job completions to scope borrows->release everything". The greatly reduced overhead and fewer thread stalls is a pretty massive gain in comparison.

I'd be willing to help with such an implementation for Rayon, but this solution will work against any threading library which is the primary goal.

Ralith commented 2 years ago

hecs already supports parallel iteration; see QueryBorrow::iter_batched. The implementation is simple, safe, and efficient. Do you have requirements this does not satisfy?

All8Up commented 2 years ago

Unless I tried to use it completely wrong, the batched iterator will only work in a fork/join model (like Rayon parallel for loops or spawn an async job and then await it) of execution which is a step up from serial execution but really not particularly good for an ECS execution graph. Legion was doing the closest to what I've been doing but even that didn't seem to be doing the full Monte. So, consider the case of batched iter, assuming this is how it is intended to work (?):

let mut borrow = world.query::<(&u32, &mut u64)>(); let mut batched = borrow.iter_batched(100); ... likely use rayon or whatever to issue a job per batch ... .. wait for everything to complete so you can safely drop the iterator and the borrow ..

That seemed to be the intended pattern since you must maintain the initial borrow and can't clone the batched iter into threads and of course they are internally not thread safe so even if you could put them into a shared reference they'd probably not make it past a single iteration before they crashed. For the occasional system which needs parallel execution that probably works just fine, but assume you have > 100 systems like my test bed and 20,000 entities, going parallel is critical to maintain interactive simulation rates.

So, with the non-binding iterator which I posted, you can write a function like the following, when you submit the schedule the only thing that has to be created is one opaque iterator per system that executes in parallel, then hand the function and a clone of those iterators to 1..n threads. Because there are no borrows on the world that are taken to create the function, there is no interaction with the main thread required from there on out. The only thing the main thread does after that is talk to the thread dumb OS's like a certain Apple OS that gets pissy if any thread other than main talks to it.

// Just one ref to non mutable world and a clone of all the prepared iterators. // This assumes you have more than one compatible system which is also capable of // being executed in parallel. Almost all of my systems are, or will soon be, both. fn execute(&self, world: &World, iterators: Vec) { for iter in iterators { // Note: these would be self contained in Box so the systems can have unique queries. world.parallel_query::<&u32, &mut u64>(iter, |a, b| {... do stuff ...); } }

While this is just a silly example of what a simple parallel system executor would look like, this is not actually that far from the real thing that I use conceptually. The key is that there is only the one world borrow in the function the threads call rather than a QueryBorrow which has to be maintained externally. Extending the above to deal with incompatible systems is simply a matter of passing in a Vec<Vec> and iterating the outer entries as "iterate inner->barrier, iterate inner->barrier". Take this to the logical conclusion and you end up with an execution system which is as close to optimal as is likely viable without all the latencies introduced by having to communicate with the single main thread.

Hope this makes some sense. I could probably build a diagram that would make it much more understandable. Of course, if I'm doing something stupid and the current batched iter can do the same thing, crow will be my main course for a week. :) Though, I'm fairly confident that I can't do this without maintaining the QueryBorrows out in the main thread with the existing API which would break the separation of generated graph being done once and then just cloned into the threads repeatedly as needed.

adamreichold commented 2 years ago

Since parallel_query is unsafe, I think it would be good to give much more concise description of how the proposed API can be used safely. "Manually ensure that your execution graph does contain conflicts." would not be convincing for me personally, as the usability of that together with flexibility of composing systems does not look like a helpful design IMHO.

Though, I'm fairly confident that I can't do this without maintaining the QueryBorrows out in the main thread

Note that World: Sync, i.e. multiple threads can concurrently construct QueryBorrow instances against a single world. They will not be synchronized, but conflicts will be prevented. This should allow dispatching multiple systems in parallel which each can dispatch multiple batches of query results into parallel jobs.

All8Up commented 2 years ago

much more concise description of how the proposed API can be used safely

Absolutely! I was trying to get feedback on the implementation to make sure it wasn't doing anything outstandingly stupid before going into that, and of course to see if it would be an acceptable thing at all. I'll outline the way it would be used, likely it might include a bit of how my abstraction on top of hecs works but it won't be too different than a raw hecs approach.

QueryBorrow instances against a single world. They will not be synchronized

Yup, that's how I was doing a few things initially, unfortunately it is exactly the "not synchronized" bit which prevented it from being used when I needed to actually parallelize a single system that was simply too heavy to execute single threaded in effective time.

I'll get an outline of something put together probably tonight or tomorrow, likely tomorrow as it's getting late here.

Ralith commented 2 years ago

I admit I have difficulty following the above example. A concise, complete, and concrete API proposal and a minimal illustration of its use would be helpful. Note that the bar for inclusion of APIs that are complex or unsafe is high, and multiplicatively so for APIs that are both.

unfortunately it is exactly the "not synchronized" bit which prevented it from being used

The intention is that, if you need this level of concurrency, you can introduce your own synchronization at a higher level, using whatever mechanism is appropriate.

All8Up commented 2 years ago

I admit I have difficulty following the above example. A concise, complete, and concrete API proposal and a minimal illustration of its use would be helpful. Note that the bar for inclusion of APIs that are complex or unsafe is high, and multiplicatively so for APIs that are both.

Actually in thinking how to present it I might have come up with a useful example. I'm going to convert the ffa_simulation example to run in parallel using Rayon scopes as the thread domain "scoping". While the simulation is simple it presents two separate gotcha's that have to be corrected in order to run in parallel. On the other hand the O(n^2) search and say 1000 entities should give it enough weight to show the effect of the parallelization. I figure if I point out the steps that go into how things are converted, and how/why the Rust borrow rules are being fixed in the thread execution domain, that will make follow up explanations more easy to present. It should be helpful as a starting point to address your concerns about how it would be used.

I'll also explain the key to making a schedule in there. The three systems implemented are a viable way to explain the topological sort which is how you turn a collection of systems into a safe execution graph. Having only 3 systems is beneficial for a manual walk through.

I'll finish it up in the morning and update the PR. Also, I'm going to go ahead and remove the iterator style implementation, it's just too horrible to look at any longer... :)

All8Up commented 2 years ago

I admit I have difficulty following the above example.

Well, if you check the PR now, I've converted the FFA example over to a full on example of parallel scheduling with what you might call a "threaded hecs starter kit". It's now a MOAE (mother of all examples) simply because I didn't feel like half assing the setup constantly. It's still semi half assed due to the fact that it doesn't provide all the protections against giving it invalid component configurations, doesn't wrap up shared data access, and of course it is using a work stealing job system rather than a parallel graph executor meaning it "could" still use the batch iterator since it has to dynamically generate the jobs and they are all scoped.

If you go down to the ScheduleBuilder, you can see how relatively simple the scheduling is. It's all about the data and duplicating the Rust borrow rules as the constraints to the topological sort. Given the simplistic nature of the FFA example, this is so overkill as an example that it's not even funny. I might do something a bit more with it to justify the monstrosity for fun but right now I figure as far as an example goes, it covers the bases... :)

All8Up commented 2 years ago

Note that the bar for inclusion of APIs that are complex or unsafe is high

I wanted to reply specifically to this part given I missed it the first time. After removing the implementation which attempted to mirror the existing iterator patterns, this is actually a very simple implementation and API, the "example" is now the bulk of the code consisting of about 350 lines of "how to thread hecs" since there are many different bits involved in using hecs itself safely with threads on top of the new API function. The only reason I marked it unsafe is because of the lack of borrow checks on the columns, otherwise there is no actual unsafe code behind the API at this point.

I'm hoping when the company acquisition finishes I can show my personal implementation of all this and how the API removes the user side potential for errors. Till then the big example is the best I can do, it's probably representative of what I started with several years ago before I got all the pieces in place.

Ralith / hecs

Parallel iteration. #258