Closed e4lam closed 4 years ago
I can look into that, yeah. Any guiding points though? What classes to use, some small examples, etc?
Hmm, the way you go about doing tasks is a bit different though in TBB. A low-level task scheduler example can be found here: https://software.intel.com/en-us/node/506102 But most people use the higher-level constructs like this instead: https://software.intel.com/en-us/node/506057
Or use the graph parallelism instead: https://software.intel.com/en-us/node/517341
Yeah, higher-level constructs or graph parallelism are not something we can properly add to the benchmark results (since higher level abstractions are different or even don't exist for all the libraries in the comparison). Low-level task scheduler seems promising, but kinda different from regular task schedulers/thread pool anyway. Is there any chance you could send me a gist of how you see empty repost benchmark using tbb (you can find it at https://github.com/dkormalev/asynqro/blob/develop/benchmarks/empty-repost/asynqro.cpp ) and I'll catch it from there?
Actually, I assume it is easier to look on boost.asio one (https://github.com/dkormalev/asynqro/blob/develop/benchmarks/empty-repost/boostasio.cpp). Asynqro benchmark has some extra stuff to be benchmarked (like with/without futures usage, task types and so on)
@e4lam ok, seems like it was pretty easy, I'm still in the process (have only 2 of 4 benchmarks written), but TBB shows really good results with few quirks (I assume it is by-design). Stay tuned, I need to finish the code and then run these benchmarks (takes couple of days to go through the whole suite), so a good chance I'll have results somewhere next week :)
Ah, I was just starting to get back to this just now! I post a simple example anyways. I got sidetracked playing with vcpkg to load the TBB dependency, etc. :)
As requested, here's a simple example for empty-repost.
The thing to note here is that no one really uses TBB in this way in the first place because different tasks would usually be reposted but I converted your benchmark to the best that I can. It might be more interesting to see more practical use cases benchmarked because I imagine that different schedulers will work better/worst with different parallel algorithms.
#include <tbb/tbb.h>
#include <tbb/task.h>
#include <tbb/enumerable_thread_specific.h>
#include <tbb/global_control.h>
#include <chrono>
#include <iostream>
#include <thread>
#include <sstream>
#ifndef CONCURRENCY
# define CONCURRENCY 4
#endif
#ifndef JOBS_COUNT
# define JOBS_COUNT 1000000
#endif
#define DEBUG_CHECK 1
#if DEBUG_CHECK
tbb::enumerable_thread_specific<size_t> theCounters;
#endif
struct RepostJob : public tbb::task
{
volatile size_t counter = 0;
long long int begin_count;
RepostJob()
{
begin_count = std::chrono::high_resolution_clock::now().time_since_epoch().count();
}
tbb::task*
execute() override
{
if (++counter < JOBS_COUNT) {
this->increment_ref_count();
this->recycle_as_safe_continuation();
} else {
long long int end_count = std::chrono::high_resolution_clock::now().time_since_epoch().count();
// output as one string to avoid intermixing output
std::stringstream s;
s << "reposted " << counter << " in " << (double)(end_count - begin_count) / (double)1000000
<< " ms" << std::endl;
std::cout << s.str();
#if DEBUG_CHECK
theCounters.local() += counter;
#endif
}
return nullptr;
}
};
int main(int, const char* [])
{
std::cout << "Benchmark job repost (empty): " << CONCURRENCY << "/" << JOBS_COUNT << std::endl;
{
tbb::global_control c(tbb::global_control::max_allowed_parallelism, CONCURRENCY);
std::cout << "***TBB thread pool***" << std::endl;
// Spawn jobs via parallel_for with simple_partioner so that it's more
// likely to run each job in its own thread.
using Range = tbb::blocked_range<int>;
tbb::parallel_for(Range(0, CONCURRENCY), [](const Range&) {
RepostJob& job = *new (tbb::task::allocate_root()) RepostJob;
tbb::task::spawn_root_and_wait(job);
}, tbb::simple_partitioner());
#if DEBUG_CHECK
// Check to see that we actually ran the tasks on separate threads
std::cout << "DEBUG_CHECK\n";
for (size_t c : theCounters)
std::cout << " " << c << std::endl;
#endif
}
return 0;
}
"The thing to note here is that no one really uses TBB in this way" - yes, exactly. It is not the benchmark that shows how fast this or that scheduler works. It is more about overheads and scheduling efficiency. Thank you, @e4lam , for example. I didn't take it completely though but got some more understanding though about the tbb and will add tbb to the benhcmarks section shortly. I didn't use continuation and parallel_for though to make it as close to other system as I could (asynqro and QtConcurrent provide something similar to parallel_for, for example and asynqro provides very expressive continuations). I know that there is a huge temptation to modify the benchmark to the fully idiomatic way and best practices, but it can lead to the point when results would be simply incomparable. In fact, all these benchmarks were derived from boostasio vs. threadpoolcpp benchmarks I saw in threadpoolcpp repo. So asynqro and qtconcurrent were added to fit the style of them as close as possible. The same thing I did with tbb. I don't have the intention to fully compare different schedulers, I'd say it is almost impossible. Main point of these benchmarks is to show where the problems with asynqro can be and how much of overhead is there in comparison with other schedulers.
@dkormalev Sure thing. So what did you do in particular for this benchmark that was different from the way I approached it?
It is mostly about using the enqueue instead of both the recycling and parallel_for. According to documentation it should work roughly the same as it works in other systems benchmarks. Again, idea of the benchmark is not to test how fast we can repost the same task. It is mostly about how fast we can repost a lot of tasks concurrently. They ideally should be all different of course, but for the sake of simplicity we just use the same class and send it again and again (we in fact wouldn't need the class at all for all systems except tbb, but again for sake of unified benchmark it is used here).
The motivation for a work-stealing schedulers (eg. Intel TBB, Microsoft PPL, Apple GCD) is to not use a shared queue. The shared queue in TBB that enqueue() uses is there for certain patterns but the use cases for it in practice are relatively rare. I presume that your enqueue is only for the initial tasks and that the repost does regular continuation spawning? In real-life, it's important to spawn tasks in parallel rather than serial for performance.
Not all of the benchmarks do the "continuation" (i.e. send tasks from tasks). But if you truly insist I can add it as an extra row to the systems table. The reasoning behind that is that for asynqro I do it as well. There is line for "fair" shared scheduling (Intensive tasks), and there is ThreadBound, which works similar to the tbb::spawn (except for the stealing part, due to the nature and binding contract of ThreadBound). I would like to reiterate again. I have no intention to benchmark whatever one could call real-life cases. Just for the sake of no so such thing as common case for the generic library. It shows itself very good during this discussion, actually :) You are trying to insist on using high level abstractions such as parallel_for or continuations, which I could pretty easily do with QtConcurrent or with asynqro. Harder with asio, I'm not an expert in it, but I believe there should be a way. Impossible with threadpoolcpp at all. I don't do such thing simply because my intention is not to create fastest ever thread pool or task scheduler (these hohors go to threadpoolcpp and nobody can beat it as far as I know, tbb as well is lagging behind a lot). My intention is to create fast enough task scheduler with user-friendly api and lack of needed boilerplate. By fast enough I mean that it should be comparable to the big players as asio, tbb or qtconcurrent, but ultimately be so on the same level across all of them. It also should be predictable and knowledgable where bottlenecks are (that is what empty-repost benchmark is for, for example, so users could know what to beware).
Sorry, perhaps when I'm thinking about benchmarks, I'm thinking about comparing speed and thus my comments about performance. I'm actually more focused on low-level details than high-level abstractions. I think the point I was trying to make (poorly) is that if you're calling "enqueue" in TBB, my understanding is that operations posts to a shared thread queue which will then require contention to pop work off of this queue compared to posting to a local thread queue.
So now that you mention thread-pool-cpp, when I look at benchmarks/empty-repost/threadpoolcpp.cpp, it seems to me that the post(*this)
call is actually posting to a per-thread queue instead. So the thread-pool-cpp benchmark seems like it's doing more like what my example was doing? This is just based on my cursory reading of the code in https://github.com/inkooboo/thread-pool-cpp/tree/master/include/thread_pool though. That project doesn't look like it has much docs(?).
As I already mentioned - I'm adding both enqueue and spawn benchmarks for tbb, as you requested. As for thread-pool-cpp - it is more of a toy project, where inkooboo tested some ideas. It is added to my benchmarks to show what the bottom line can be in terms of task scheduling (which comes with a great cost on features it provides). It has only one mode of jobs adding and you are right - they are added directly to the thread queues. But it is blazing fast, that is for sure.
https://github.com/dkormalev/asynqro/commit/e7361de043bccc9192a52b694ff0baf72c4cbcee - and here goes the commit with both enqueue and spawn benchmarks added to the grid :) Thank you, @e4lam, for great idea and help with understanding TBB.
Thanks for adding these benchmarks!
Interesting numbers. In the empty-repost benchmarks, I didn't realize that the the reposted tasks were actually copied? It was a point that I took great pains to figure out how to do in TBB in order to avoid copying for the benchmark.
Copying is the main idea of this benchmark, it would be unfair to remove it from TBB benchmark only. Reposting represents entirely new task, having it with same body is just for source code brevity, that's all. The idea of repost benchmark is "how fast one can add new tasks from multiple threads". It is simplified with same instructions in body of the task, but, again, it is the same for all of the systems that are benchmarked.
How does this compare with Intel TBB? It might worthwhile to compare in your benchmark?