Closed anivegesana closed 1 year ago
Thanks for putting this together. I tried running the benchmarks, but it seems to run into an issue where sometimes (but not consistently), the program fails to terminate. In particular, when I build and run bench_examples
, the benchmarks all complete, but then the program stalls forever without terminating. This suggests that the scheduler perhaps is getting stuck during termination. (see image. Program is stuck not terminating at the end)
Assuming this can be fixed, there's still some other things that I might be concerned about:
std::atomic::wait
for C++20 is typically implemented using OS-specific primitives like futexes to make it more efficient than a condition variable.Hi Daniel. Sorry for the confusion. I marked this PR as a draft because I had a working version that compiles with C++20, but was not able to figure out how to get it to work in C++17. You can test it by changing the 17 in this line to a 20.
https://github.com/anivegesana/parlaylib/blob/master/CMakeLists.txt#L53
I primarily need help in creating a small lightweight backport for the atomic_wait feature. The following repo works, but it isn't exactly lightweight.
https://github.com/ogiroux/atomic_wait
As for speed, a PARLAY_INTERACTIVE
macro can be added that controls whether or not the increment/decrement happens. This way, if std::cin
is not used, then Parlay can just be compiled without this macro and keep the original performance. If std::cin
is used, then Parlay must compiled with -DPARLAY_INTERACTIVE
and every pardo
/parfor
will check to see if the scheduler is off and turn it on if it isn't on already, as well as turn it off if there are no tasks left.
If you guys are planning on updating to C++20 in the future, then maybe we can just wait on this PR and can remove the ugly hacks that I added trying to get it to work on C++17. We found a workaround and don't need this feature merged. I made this PR use code from the repo above and it passes the tests now.
mkdir build
cd build
cmake .. "-DCMAKE_CXX_FLAGS=-g -DPARLAY_INTERACTIVE=ON"
make
I think we should actually try to improve the scheduler's performance in this regard. I don't think having a macro that slows the library down is a great solution. We can probably make it fast and keep the benefits with some more work.
If it's helpful, both OpenCilk and Rayon appear to be doing something that puts workers to sleep. Presumably they've already gone through the work of verifying that these tricks don't impact the performance of the scheduler too much, so it could be worth trying to adapt one of these for Parlay:
Thanks for the pointers! This is the kind of thing we'd like to implement too.
Hmm. They seem to do almost the same sort of thing I did. I notice some differences in the memory order of the atomic operations, so it is possible that that is what is causing the inefficiencies. I will look into it.
I think a big difference is that your code is all-or-nothing. If there's any parallel jobs, then all workers are woken up. But if there are just a small number of jobs, then ideally only a small number of workers will be awake.
For some concrete data, here is the performance impact of the current change. See attachment. Overall, about 20-35% slower on our main benchmark set.
Hey Daniel,
I updated the code to use a different memory order that seems more reasonable, but I still need to run it on the benchmark set.
Merging #34 (0843bf2) into master (6e12b06) will increase coverage by
0.04%
. The diff coverage is100.00%
.
:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more
@@ Coverage Diff @@
## master #34 +/- ##
==========================================
+ Coverage 96.62% 96.66% +0.04%
==========================================
Files 50 51 +1
Lines 3374 3421 +47
==========================================
+ Hits 3260 3307 +47
Misses 114 114
Impacted Files | Coverage Δ | |
---|---|---|
include/parlay/internal/atomic_wait.h | 100.00% <100.00%> (ø) |
|
include/parlay/scheduler.h | 98.65% <100.00%> (+0.13%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
I've implemented a similar solution in #52 that solves the same problem but with better performance on the benchmarks. Thanks for the idea and getting this started!
This PR uses a simplistic backport of the atomic wait primitive from C++20. It adds an
enabled
variable to the scheduler which determines whether the threads are paused or are able to run jobs. The fork-join scheduler defaults to being paused and turns on wheneverpardo
orparfor
are called. At the completion of allpardo
andparfor
calls, the scheduler is automatically paused to prevent busy waiting in the case that there are no jobs to run.This code has not yet been tested for efficiency. Help would be appreciated.
Fixes #33