Parallel bulk loading - Githubissues

earthnuker commented 1 year ago

Hello,

for a project of mine, a long distance route plotter for the game Elite: Dangerous, I'm using an R-Tree as my main data structure for nearest-neighbor queries, my data set contains about 87 million 3D-Points with some metadata attached, and constructing an R-Tree with the following parameters:

pub struct LargeNodeParameters;
impl RTreeParams for LargeNodeParameters {
    const MIN_SIZE: usize = 20;
    const MAX_SIZE: usize = 40;
    const REINSERTION_COUNT: usize = Self::MAX_SIZE - Self::MIN_SIZE - 1;
    type DefaultInsertionStrategy = RStarInsertionStrategy;
}

takes around 30 seconds on my machine, query performance is excellent at ~23M queries per second distributed across 32 threads

i have done a quick skim over the bulk-loading code and I've also skimmed the paper it is based on bit it goes a bit over my head. my main question is: would it be feasible to parallelize the bulk loading using something like rayon to improve performance?

I've experimented with splitting my data up and constructing multiple trees i can query in parallel but didn't have too much success with that.

best regards,

Earthnuker

Stoeoef commented 1 year ago

Parallel bulk loading is something I've actually implemented a long while ago. I then discarded the changes as I thought that the break-even point (number of elements where parallel load is faster than sequential load) was too high and that the maintenance burden wasn't worth the additional code.

Reading through my old code, I've written that the break-even point would be about 40000 elements... I guess I never considered back then that people would use rtrees with millions of elements. Or maybe this is just a typo.

In any case, the parallel bulk load got removed in commit eae7354bfbf7d4276f66237304259d2fd544d60e .

I have reverted that commit and have fixed all compilation issues. Would you be interested in testing the implementation? If this feature benefits your use case, it may be worthwhile to keep it.

For testing, checkout this branch:

https://github.com/Stoeoef/rstar/tree/bulk-load-parallel

Then, enable the "threadpool" feature and use RTree::bulk_load_parallel (or RTree::bulk_load_with_params_parallel(...))

Note that there's still some work to do, even if the parallel algorithm works just fine - e.g. checking the test coverage of the parallel algorithms and probably renaming the feature to something more descriptive. I can look into that if the changes look promising.

Performance wise, the threading-split is fairly coarse grained - only the root node is loaded in parallel. However, since the root will have a lot of children with the config you are using, you may still see some good speed up.

adamreichold commented 1 year ago

I know its much to ask for, but I think it would be nice if this used Rayon's thread pool to better integrate into code bases already using it for data parallel processing.

I am not sure, but the algorithm might even be amenable to using Rayon's join primivite to avoid the overhead of spawning heap-allocated tasks there decreasing the break-even point.

Stoeoef commented 1 year ago

I'm not bound to the threadpool library - I think I've only chosen it as it was more lightweight. Using rayon should be fine.

I also tried using join back then but without too much success - from memory, work items will spawn new work items in a very irregular fashion which made this really slow.

I think a real good speed-up could be gained by performing selection in parallel. I believe that that is, with some margin, the part where the implementation spends most time in. Currently, this is done using https://doc.rust-lang.org/stable/std/primitive.slice.html#method.select_nth_unstable .

I'm not aware of any library that implements a parallel implementation for that unfortunately.

Stoeoef commented 1 year ago

FWIW, there's a rayon issue about par_select:

https://github.com/rayon-rs/rayon/issues/838

earthnuker commented 1 year ago

just gave the parallel-bulk-loading branch a test drive and:

Serial bulk loading:
[0:00:00.347353] INFO | ed_lrr.route:src\route.rs:570 | Loading [C:/Users/Earthnuker/ed_lrr/data/stars]
[0:00:00.963467] INFO | ed_lrr.route:src\route.rs:575 | 92162363 Systems loaded in 616.6ms, 149.5M systems/s
[0:00:40.203789] INFO | ed_lrr.route:src\route.rs:583 | R*-Tree built in 39.07s
[0:00:40.203789] INFO | ed_lrr.route:src\route.rs:585 | Total time: 39.86s

Parallel bulk loading:
[0:00:00.204990] INFO | ed_lrr.route:src\route.rs:570 | Loading [C:/Users/Earthnuker/ed_lrr/data/stars]
[0:00:00.551782] INFO | ed_lrr.route:src\route.rs:575 | 92162363 Systems loaded in 346.7ms, 265.8M systems/s
[0:00:11.420669] INFO | ed_lrr.route:src\route.rs:583 | R*-Tree built in 10.87s
[0:00:11.421669] INFO | ed_lrr.route:src\route.rs:585 | Total time: 11.22s

gives quite a nice speedup even if it didn't even get close to saturating all 32 threads of my CPU

Stoeoef commented 1 year ago

Thanks for testing this! Good to see that this still gives a nice speed up. I think that's enough justification to warrant a change, even if the algorithm isn't ideal yet.

For me, the remaining TODOs would be:

[ ] Open a PR
[ ] Check test coverage of the parallel bulk load. I think tests might be duplicated on that test branch?
[ ] Rename the feature - my suggestion would be bulk_load_parallel in accordance with the method name.
[ ] Add / revisit appropriate documentation in lib.rs
[ ] Add / revisit appropriate documentation for bulk_load_parallel and bulk_load_with_params_parallel
[ ] Update Readme
[ ] Consider moving to a different library than threadpool. IMO, switching to rayon only makes sense if we plan to use a rayon feature (like join). I would suggest to postpone this switch to some later point - it's a non-breaking change after all. In the meantime, threadpool seems to be a smaller (faster compile times) option. No strong point for me though. Switch from threadool to rayon_core

@adamreichold : Is there anything else you would consider to be necessary? @Earthnuker : Would you be interested in finishing up these steps? I'm happy to do the review and assist with any issues.

adamreichold commented 1 year ago

IMO, switching to rayon only makes sense if we plan to use a rayon feature (like join).

In my opinion, the most important part of using Rayon would be to use same thread pool as other parts of a program that already uses Rayon. Of course, if we just want to spawn tasks onto Rayon's thread pool, a dependency on rayon-core would suffice with its spawn function just replacing the calls to ThreadPool::execute (thereby using the "current" thread pool which the user can change by "installing" their workload in any thread pool they want).

I think this change would be minimal but I could certainly try to provide it as a follow-up PR since I asked for it if there is no reason against it.

@adamreichold : Is there anything else you would consider to be necessary?

For such algorithms, I do like property-based tests of the sequential versus the parallel implementation. I am not sure though how powerful such tests would be as I suspect some non-deterministic ordering of the results of the parallel variant?

Stoeoef commented 1 year ago

In my opinion, the most important part of using Rayon would be to use same thread pool as other parts of a program that already uses Rayon.

Ah, that's a good argument for using rayon - color me convinced! I agree, the change will be fairly minimal.

For such algorithms, I do like property-based tests of the sequential versus the parallel implementation. I am not sure though how powerful such tests would be as I suspect some non-deterministic ordering of the results of the parallel variant?

I think property test should still be doable. The "only" difference between the outcomes of the parallel and the sequential solution should be the order of the tree's root nodes. That should hardly be noticeable. Alternatively we could consider to "simulate" the thread pool by executing tasks sequentially rather than on the thread pool - this will make the tests totally deterministic.

For this ticket I would suggest to adapt test_bulk_load_with_different_sizes - it currently is already very similar to a prop test and can easily be extended to also include the parallel version.

georust / rstar

Parallel bulk loading #107