anivegesana commented 1 year ago

This PR uses a simplistic backport of the atomic wait primitive from C++20. It adds an enabled variable to the scheduler which determines whether the threads are paused or are able to run jobs. The fork-join scheduler defaults to being paused and turns on whenever pardo or parfor are called. At the completion of all pardo and parfor calls, the scheduler is automatically paused to prevent busy waiting in the case that there are no jobs to run.

This code has not yet been tested for efficiency. Help would be appreciated.

Fixes #33

DanielLiamAnderson commented 1 year ago

Thanks for putting this together. I tried running the benchmarks, but it seems to run into an issue where sometimes (but not consistently), the program fails to terminate. In particular, when I build and run bench_examples, the benchmarks all complete, but then the program stalls forever without terminating. This suggests that the scheduler perhaps is getting stuck during termination. (see image. Program is stuck not terminating at the end)

Assuming this can be fixed, there's still some other things that I might be concerned about:

Performance

I haven't managed to check the performance yet because of the aforementioned issue, but I suspect that the change may introduce a performance penalty. Specifically, doing an atomic increment/decrement on every pardo could introduce a lot of contention (https://github.com/cmuparlay/parlaylib/pull/34/files#diff-6eda068e7db43c78bd6d29aeab46d53e15f28d2d2f672f422f12e0f0ee5ad0a6R333)
I'm not sure how efficient condition variables are, but I believe that modern std::atomic::wait for C++20 is typically implemented using OS-specific primitives like futexes to make it more efficient than a condition variable.

Style

I'm not a fan of inlining the std::atomic child class. I'd at least put this in a separate file (especially since the type would be replaced when we upgrade to C++20 anyway).

anivegesana commented 1 year ago

Hi Daniel. Sorry for the confusion. I marked this PR as a draft because I had a working version that compiles with C++20, but was not able to figure out how to get it to work in C++17. You can test it by changing the 17 in this line to a 20.

https://github.com/anivegesana/parlaylib/blob/master/CMakeLists.txt#L53

I primarily need help in creating a small lightweight backport for the atomic_wait feature. The following repo works, but it isn't exactly lightweight.

https://github.com/ogiroux/atomic_wait

As for speed, a PARLAY_INTERACTIVE macro can be added that controls whether or not the increment/decrement happens. This way, if std::cin is not used, then Parlay can just be compiled without this macro and keep the original performance. If std::cin is used, then Parlay must compiled with -DPARLAY_INTERACTIVE and every pardo/parfor will check to see if the scheduler is off and turn it on if it isn't on already, as well as turn it off if there are no tasks left.

If you guys are planning on updating to C++20 in the future, then maybe we can just wait on this PR and can remove the ugly hacks that I added trying to get it to work on C++17. We found a workaround and don't need this feature merged. I made this PR use code from the repo above and it passes the tests now.

mkdir build
cd build
cmake .. "-DCMAKE_CXX_FLAGS=-g -DPARLAY_INTERACTIVE=ON"
make

DanielLiamAnderson commented 1 year ago

I think we should actually try to improve the scheduler's performance in this regard. I don't think having a macro that slows the library down is a great solution. We can probably make it fast and keep the benefits with some more work.

shwestrick commented 1 year ago

If it's helpful, both OpenCilk and Rayon appear to be doing something that puts workers to sleep. Presumably they've already gone through the work of verifying that these tricks don't impact the performance of the scheduler too much, so it could be worth trying to adapt one of these for Parlay:

DanielLiamAnderson commented 1 year ago

Thanks for the pointers! This is the kind of thing we'd like to implement too.

anivegesana commented 1 year ago

Hmm. They seem to do almost the same sort of thing I did. I notice some differences in the memory order of the atomic operations, so it is possible that that is what is causing the inefficiencies. I will look into it.

DanielLiamAnderson commented 1 year ago

I think a big difference is that your code is all-or-nothing. If there's any parallel jobs, then all workers are woken up. But if there are just a small number of jobs, then ideally only a small number of workers will be awake.

DanielLiamAnderson commented 1 year ago

For some concrete data, here is the performance impact of the current change. See attachment. Overall, about 20-35% slower on our main benchmark set.

PR34.zip

anivegesana commented 1 year ago

Hey Daniel,

I updated the code to use a different memory order that seems more reasonable, but I still need to run it on the benchmark set.

codecov-commenter commented 1 year ago

Codecov Report

Merging #34 (0843bf2) into master (6e12b06) will increase coverage by 0.04%. The diff coverage is 100.00%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master      #34      +/-   ##
==========================================
+ Coverage   96.62%   96.66%   +0.04%     
==========================================
  Files          50       51       +1     
  Lines        3374     3421      +47     
==========================================
+ Hits         3260     3307      +47     
  Misses        114      114

Impacted Files	Coverage Δ
include/parlay/internal/atomic_wait.h	`100.00% <100.00%> (ø)`
include/parlay/scheduler.h	`98.65% <100.00%> (+0.13%)`	:arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

DanielLiamAnderson commented 1 year ago

I've implemented a similar solution in #52 that solves the same problem but with better performance on the benchmarks. Thanks for the idea and getting this started!

cmuparlay / parlaylib

Remove busy waiting in homegrown scheduler when no jobs are spawned #34

Performance

Style

Codecov Report