Not all worker threads are working

eschnett commented 10 years ago

I want to extract the true layout of HPX threads on the cores. For this, I start one async operations per OS thread (hpx::get_os_thread_count), and then wait in these threads until all such threads are running. Since nothing else is going on at the same time, I expect this to work and to not deadlock.

However, this works only in 50% of the cases; at other times, this deadlocks, and e.g. worker thread #2 (hpx::get_worker_thread_num) never joins the waiting threads.

What is going on here? What is thread #2 doing in this case? Why are only 3 threads out of 4 doing any work?

top reports 400% CPU usage.

I start my application via

./bin/block_matrix --hpx:1:pu-offset=8 --hpx:threads=4 --hpx:pu-step=2 --hpx:print-bind --hpx:debug-hpx-log=block_matrix.log

sithhell commented 10 years ago

Does the hello_world quickstart example work for you? It does exactly what you described.

eschnett commented 10 years ago

While looking at the HPX scheduler, I found this comment:

            // we allow the thread on the boundary of the NUMA domain to steal

(local_priority_queue_scheduler.hpp, line 731).

What does "boundary of a NUMA domain" mean? To my knowledge, all cores in a NUMA domain are equivalent; and thus the concept of a "boundary" doesn't make sense. In particular, the "first" core in a NUMA domain (numbered in what way?) cannot access other NUMA domains any quicker than its siblings.

What do I misunderstand?

At run time, I see the following in a debugger:

numa_domain_masks_ = {5592405, 5592405, 5592405, 5592405, 5592405, 5592405}
outside_numa_domain_masks_ = {11184810, 0, 0, 0, 0, 0}

which means that (as the comment indicates) thread 0 can steal from the other NUMA domain, while threads 2 to 5 cannot.

I am also not sure that the NUMA domains are calculated correctly. In this case, hyperthreading is active, and the odd-numbered PUs should simply remain unused (./bin/block_matrix --hpx:1:pu-offset=12 --hpx:threads=6 --hpx:pu-step=2 --hpx:print-bind).

Does HPX use internally the logical or the physical PU numbering scheme of hwloc?

In this case, there are 6 threads active, and there are 6 queues. HPX will thus only test the lower 6 bits of these masks. Given that the masks have an even/odd bit pattern, this means that HPX acts as if the cores were distributed over two NUMA domains. This is not so.

eschnett commented 10 years ago

Note: I am using the fixing_588 branch.

eschnett commented 10 years ago

Yes, hello world works for me:

make examples
$ ./bin/hello_world -t12
hello world from OS-thread 6 on locality 0
hello world from OS-thread 0 on locality 0
hello world from OS-thread 11 on locality 0
hello world from OS-thread 9 on locality 0
hello world from OS-thread 1 on locality 0
hello world from OS-thread 7 on locality 0
hello world from OS-thread 8 on locality 0
hello world from OS-thread 3 on locality 0
hello world from OS-thread 5 on locality 0
hello world from OS-thread 4 on locality 0
hello world from OS-thread 10 on locality 0
hello world from OS-thread 2 on locality 0

hkaiser commented 10 years ago

While looking at the HPX scheduler, I found this comment: // we allow the thread on the boundary of the NUMA domain to steal

(local_priority_queue_scheduler.hpp, line 731).

What does "boundary of a NUMA domain" mean? To my knowledge, all cores in a NUMA domain are equivalent; and thus the concept of a "boundary" doesn't make sense. In particular, the "first" core in a NUMA domain (numbered in what way?) cannot access other NUMA domains any quicker than its siblings.

What do I misunderstand?

All what this code is supposed to do is to only allow single cores in a NUMA domain to steal from neighboring NUMA domains, not all cores. It does not matter which cores are designated for stealing from neighboring NUMA domains, we randomly selected the 'boundary' cores for this.

At run time, I see the following in a debugger: numa_domainmasks = {5592405, 5592405, 5592405, 5592405, 5592405, 5592405} outside_numa_domainmasks = {11184810, 0, 0, 0, 0, 0}

Just for the record, this is equivalent to

numa_domainmasks = {0x555555, 0x555555, 0x555555, 0x555555, 0x555555, 0x555555} outside_numa_domainmasks = {0xAAAAAA, 0, 0, 0, 0, 0}

which is clearly wrong. Not sure if this is caused by the recent changes to fix #421 (are you on the master branch with this?) In any case - I'll investigate.

which means that (as the comment indicates) thread 0 can steal from the other NUMA domain, while threads 2 to 5 cannot.

I am also not sure that the NUMA domains are calculated correctly. In this case, hyperthreading is active, and the odd-numbered PUs should simply remain unused (./bin/block_matrix --hpx:1:pu-offset=12 --hpx:threads=6 --hpx:pu-step=2 --hpx:print-bind).

The masks for the NUMA domains is what we receive from HWLOC. What are you trying to achieve with pu-offset=12 (how many cores has the system your testing on, pls attach the output of lstopo to this ticket).

Does HPX use internally the logical or the physical PU numbering scheme of hwloc?

I'm not sure how that matters in this case.

In this case, there are 6 threads active, and there are 6 queues. HPX will thus only test the lower 6 bits of these masks. Given that the masks have an even/odd bit pattern, this means that HPX acts as if the cores were distributed over two NUMA domains. This is not so.

Sorry, you lost me.

eschnett commented 10 years ago

This is on the fixing_588 branch. I heard in the mean time that there are recent changes in the master branch in this respect.

If only a single core in a NUMA domain can steal from a neighbouring NUMA domain, does this mean that the other cores will then be idle? How would they obtain work if they cannot steal it?

My system has 12 cores with hyperthreading. Setting the pu offset to 12 ensures that, when using 2 localities, the second locality uses the other NUMA domain. However, in this case I am running only with a single locality, so this offset for locality 1 is irrelevant.

PUs can be numered physically or logically. The physical numbering scheme can differ, and in this case there may be no "logic" in the numbering scheme. In the logical numbering, numbers closer together indicate PUs closer together; e.g. the 2 PUs making up a core are numbered n and n+1 (with even n). The physical numbering scheme may differ. A mask of 0x5555 is clearly wrong when using logical PU numbering, but for a physical PU numbering it may be correct.

sithhell commented 10 years ago

I am absolutely sure that the work stealing inside NUMA domains is not the problem here. Hello world runs, which does exactly the same thing as described in the original issue. That doesn't mean that the calculated NUMA affinity masks are correct though. Investigating.

sithhell commented 10 years ago

To answer your question about the indices used in HPX. There are 3 numberings used inside of HPX: 1) The worker thread numbers (going from 0 to N-1) 2) The logical pu numbers on which hpx::threads::hwloc_topology operates on (you can get a mapping from worker thread number to logical pu number with the function affinity_data::get_pu_num(num_worker_thread) 3) The index of the bit which is set in the masks relates to the physical index of the respective PU. There is no function in HPX to get from the physical index back to the logical, because it is not really needed, as it is only used for binding threads. All other functionalities operating on masks should compare the masks itself, and not the phyiscal index.

eschnett commented 10 years ago

The hello world example attempts many times to schedule a thread on a given worker thread. There is thus no guarantee that all N thread will be active simultaneously.

In my case, I start N = num_os_threads() threads from main, and then wait_all for all these threads. Nothing else is running at that time. I then use an atomic variable and busy-waiting to ensure that all N threads meet at the same point, i.e. that they are all running simultaneously.

This deadlocks half of the time. I was trying to investigate where the missing thread is, and as it turns out, it is stuck in HPX's scheduler.

Starting the threads:

  int nthreads = hpx::get_os_thread_count();
  auto here = hpx::find_here();
  std::vector<hpx::future<std::pair<int,std::string>>> fs;
  hwloc_count = 0;
  for (int thread=0; thread<nthreads; ++thread) {
    fs.push_back(hpx::async(hwloc_run_action(), here));
  }
  hpx::wait_all(fs);

Busy waiting:

std::pair<int,std::string> hwloc_run()
{
  std::ostringstream os;
  const int thread = hpx::get_worker_thread_num();
  const int nthreads = hpx::get_os_thread_count();
  // Wait until all threads are running to ensure all threads are
  // running simultaneously, and thus are running on different PUs
  assert(hwloc_count < nthreads);
  ++hwloc_count;
  while (hwloc_count < nthreads) {
    // do nothing
  }
  assert(hwloc_count == nthreads);

sithhell commented 10 years ago

I see. Ok. I see how this could trigger the error you are seeing. The problem is indeed our work stealing algorithm. In order to reduce contention, we only let one core in a NUMA domain steal from other NUMA domains. Now, this particular core might end up doing this busy wait. Thus, this NUMA domain will not receive any new work from other NUMA domains. When running on a single NUMA domain, this shouldn't be a problem. The multiple NUMA domain case needs to be investigated though. I am not sure yet what the proper solution here might be. The problem with the wrong NUMA affinities should be solved on the master branch now. Just trying to merge the changes here into fixing_588.

eschnett commented 10 years ago

I this case, I was running on a single NUMA domain. I'll try again once NUMA domains are handle correctly.

Before using busy-waiting, I scheduled some work, about one second of calculations for each thread. My assumption was that this should give the HPX schedule sufficient time to ensure that all threads have found some work. This didn't happen (for the reason described above). My point here is that one second is a very long time for a CPU, and after this time I would expect threads to have found some work, even if they have to steal it from another NUMA domain. This should hold even if you consider busy-waiting on manual barrier to be an abuse of HPX.

On another note: I typically run one locality per NUMA domain. If I have a locality crossing multiple NUMA domains, then I would like to ignore NUMA boundaries as much as possible. For example, I am not allocating memory in a NUMA-aware fashion, so the notion of a thread "belonging" to a NUMA domain doesn't make sense. Therefore, I want to disable NUMA awareness in the scheduler. How can I do this?

sithhell commented 10 years ago

Just thought a bit more about the initial issue at hand and I think it's not solvable. This code relies on an implementation detail of a specific scheduler. There are schedulers which don't do work stealing at all and make your code hang again.

sithhell commented 10 years ago

Am 18.10.2013 17:03 schrieb "Erik Schnetter" notifications@github.com:

I this case, I was running on a single NUMA domain. I'll try again once NUMA domains are handle correctly.

Please do. I an pretty confident that it works now with the default scheduler.

Before using busy-waiting, I scheduled some work, about one second of calculations for each thread. My assumption was that this should give the HPX schedule sufficient time to ensure that all threads have found some work. This didn't happen (for the reason described above). My point here is that one second is a very long time for a CPU, and after this time I would expect threads to have found some work, even if they have to steal it from another NUMA domain. This should hold even if you consider busy-waiting on manual barrier to be an abuse of HPX.

You certainly have a point here. However, what you describe above is what we consider not enough parallelism and as such resources are starving. In other words, your granularity is to coarse grain. It is common practice, and even encouraged, to oversubscribe with as many HPX threads add possible to ensure that all cores have enough work. A good mean of the runtime of HPX threads is in the order of a few ten microseconds.

On another note: I typically run one locality per NUMA domain. If I have a locality crossing multiple NUMA domains, then I would like to ignore NUMA boundaries as much as possible. For example, I am not allocating memory in a NUMA-aware fashion, so the notion of a thread "belonging" to a NUMA domain doesn't make sense. Therefore, I want to disable NUMA awareness in the scheduler. How can I do this?

— Reply to this email directly or view it on GitHub.

eschnett commented 10 years ago

You are saying that having N threads on N cores is not enough parallelism? That is an interesting comment.

I don't see how having more and shorter-running threads (that may still all be located in a single NUMA domain) would help HPX here.

I guess the proof of the pudding is in the eating. In the end, I will look at timers telling me how much time threads are spending waiting in the scheduler vs. running application code.

How can I disable NUMA awareness in the scheduler?

hkaiser commented 10 years ago

Eric, generally you should not try to outsmart the scheduler (you don't try doing this with the Linux scheduler, do you?). Just let the scheduler do it's work and be done with it. Moreover, I agree with Thomas. Doing busy waiting in an HPX thread is an inherently bad idea as this blocks the whole core, preventing all other necessary background work from being executed. Also, HPX may select different schedulers than the default one on different platforms (for instance, on the Xeon-Phi the static non-work stealing scheduler seems to work best). So you may get different scheduler characteristics on different systems which may contradict your own attempts to control things.

I also think that you simply don't want to disable NUMA awareness. We implemented the current behavior for a reason and not out of idle curiosity, If we let all threads in a NUMA domain steal from all other available threads (cores) then the overall contention on the thread queues increases significantly. And it significantly increases cross NUMA-domain traffic, which additionally limits scalability, etc.

Could you please tell us why you try to do these things, i.e. what you like to achieve?

eschnett commented 10 years ago

At the moment, I want to output the hwloc thread/core bindings. I notice that HPX does not actually do this; when I last looked, there was a separate routine independently interpreting the command line options, and then outputting bindings that depend purely on the command line options. I actually want to see the bindings, as read back from the OS by hwloc.

While implementing this, I found that it is difficult to have all HPX threads do useful work at the same time. This seemed very suspicious to me, and so I investigated.

eschnett commented 10 years ago

Here is a real-world use case:

I have implemented a thread-parallel DGEMM routine in HPX. I now want to benchmark it. To obtain baseline values, I run a known-to-be-efficient DGEMM (e.g. MKL) in two configurations:

one thread runs DGEMM, all other threads remain idle
each thread runs the same DGEMM independently (acting on separate matrices) For the second benchmark, I run N copies of the (serial) routine on N cores.

It would be nice if it was possible to run the second benchmark with HPX. For convenience, I would really like to run it in the same application. If HPX does not support that, then maybe an OpenMP parallelization would work instead, but this would be rather inconvenient (e.g. need to ensure good thread/core layout manually etc.).

hkaiser commented 10 years ago

It would be nice if it was possible to run the second benchmark with HPX. I would really like to run it in the same application.

Why HPX does not support that? Sorry I don't understand what you mean.

hkaiser commented 10 years ago

While implementing this, I found that it is difficult to have all HPX threads do useful work at the same time.

How did you figure that out? I have never seen any problems related to keeping all OS-threads busy...

hkaiser commented 10 years ago

I notice that HPX does not actually do this; when I last looked, there was a separate routine independently interpreting the command line options, and then outputting bindings that depend purely on the command line options.

Sure, HPX outputs whatever settings it is going to apply to the OS-threads based on the command line settings. Again, I don't follow. How should it decide what to set if not based on the command line settings?

eschnett commented 10 years ago

Why HPX does not support that? Sorry I don't understand what you mean.

I mean that HPX (at least on the fixing_588 branch; maybe the mainline is different) does not reliably run N threads simultaneously when using N cores. As described above, it often happens that only N-1 threads are running application code while the Nth thread remains in the HPX scheduler, while there is 1 thread waiting.

How should it decide what to set if not based on the command line settings?

The current state of the fixing_588 branch outputs this with --hpx:print-bind on Stampede when running on a single node with 2 localities with 8 threads each:

*******************************************************************************
locality: 0
   0: PU L#0(P#0), Core L#0(P#0), Socket L#0(P#0), Node L#0(P#0)
   1: PU L#1(P#1), Core L#1(P#1), Socket L#0(P#0), Node L#0(P#0)
   2: PU L#2(P#2), Core L#2(P#2), Socket L#0(P#0), Node L#0(P#0)
   3: PU L#3(P#3), Core L#3(P#3), Socket L#0(P#0), Node L#0(P#0)
   4: PU L#4(P#4), Core L#4(P#4), Socket L#0(P#0), Node L#0(P#0)
   5: PU L#5(P#5), Core L#5(P#5), Socket L#0(P#0), Node L#0(P#0)
   6: PU L#6(P#6), Core L#6(P#6), Socket L#0(P#0), Node L#0(P#0)
   7: PU L#7(P#7), Core L#7(P#7), Socket L#0(P#0), Node L#0(P#0)
*******************************************************************************
locality: 1
   0: PU L#8(P#8), Core L#8(P#0), Socket L#1(P#1), Node L#1(P#1)
   1: PU L#9(P#9), Core L#9(P#1), Socket L#1(P#1), Node L#1(P#1)
   2: PU L#10(P#10), Core L#10(P#2), Socket L#1(P#1), Node L#1(P#1)
   3: PU L#11(P#11), Core L#11(P#3), Socket L#1(P#1), Node L#1(P#1)
   4: PU L#12(P#12), Core L#12(P#4), Socket L#1(P#1), Node L#1(P#1)
   5: PU L#13(P#13), Core L#13(P#5), Socket L#1(P#1), Node L#1(P#1)
   6: PU L#14(P#14), Core L#14(P#6), Socket L#1(P#1), Node L#1(P#1)
   7: PU L#15(P#15), Core L#15(P#7), Socket L#1(P#1), Node L#1(P#1)

This looks perfect! Unfortunately, my benchmark timings are off by a factor of 2. And when I ask hwloc about the actual bindings, I find:

HWLOC information:
    L0 (c438-603.stampede.tacc.utexas.edu#0) T0 (worker-thread#0): PU set L#{0} P#{0}
    L0 (c438-603.stampede.tacc.utexas.edu#0) T3 (worker-thread#3): PU set L#{1} P#{1}
    L0 (c438-603.stampede.tacc.utexas.edu#0) T0 (worker-thread#0): PU set L#{2} P#{2}
    L0 (c438-603.stampede.tacc.utexas.edu#0) T1 (worker-thread#1): PU set L#{3} P#{3}
    L0 (c438-603.stampede.tacc.utexas.edu#0) T6 (worker-thread#6): PU set L#{4} P#{4}
    L0 (c438-603.stampede.tacc.utexas.edu#0) T3 (worker-thread#3): PU set L#{5} P#{5}
    L0 (c438-603.stampede.tacc.utexas.edu#0) T0 (worker-thread#0): PU set L#{6} P#{6}
    L0 (c438-603.stampede.tacc.utexas.edu#0) T2 (worker-thread#2): PU set L#{7} P#{7}
    L1 (c438-603.stampede.tacc.utexas.edu#1) T3 (worker-thread#3): PU set L#{0} P#{0}
    L1 (c438-603.stampede.tacc.utexas.edu#1) T5 (worker-thread#5): PU set L#{1} P#{1}
    L1 (c438-603.stampede.tacc.utexas.edu#1) T3 (worker-thread#3): PU set L#{2} P#{2}
    L1 (c438-603.stampede.tacc.utexas.edu#1) T1 (worker-thread#1): PU set L#{3} P#{3}
    L1 (c438-603.stampede.tacc.utexas.edu#1) T5 (worker-thread#5): PU set L#{4} P#{4}
    L1 (c438-603.stampede.tacc.utexas.edu#1) T6 (worker-thread#6): PU set L#{5} P#{5}
    L1 (c438-603.stampede.tacc.utexas.edu#1) T0 (worker-thread#0): PU set L#{6} P#{6}
    L1 (c438-603.stampede.tacc.utexas.edu#1) T7 (worker-thread#7): PU set L#{7} P#{7}

As we see, HPX allocated both localities to the same set of PUs.

This is not evident from HPX's --hpx:print-bind output. Thus I want a feature to actually query, at run time, what bindings are used, instead of merely outputting what binding should be used. Without my own routine outputting the actual bindings, I would be hard pressed to discover this problem.

sithhell commented 10 years ago

Am 19.10.2013 21:51 schrieb "Erik Schnetter" notifications@github.com:

Why HPX does not support that? Sorry I don't understand what you mean.

I mean that HPX (at least on the fixing_588 branch; maybe the mainline is different) does not reliably run N threads simultaneously when using N cores. As described above, it often happens that only N-1 threads are running application code while the Nth thread remains in the HPX scheduler, while there is 1 thread waiting.

How should it decide what to set if not based on the command line settings?

The current state of the fixing_588 branch outputs this with --hpx:print-bind on Stampede when running on a single node with 2 localities with 8 threads each:

locality: 0 0: PU L#0(P#0), Core L#0(P#0), Socket L#0(P#0), Node L#0(P#0) 1: PU L#1(P#1), Core L#1(P#1), Socket L#0(P#0), Node L#0(P#0) 2: PU L#2(P#2), Core L#2(P#2), Socket L#0(P#0), Node L#0(P#0) 3: PU L#3(P#3), Core L#3(P#3), Socket L#0(P#0), Node L#0(P#0) 4: PU L#4(P#4), Core L#4(P#4), Socket L#0(P#0), Node L#0(P#0) 5: PU L#5(P#5), Core L#5(P#5), Socket L#0(P#0), Node L#0(P#0) 6: PU L#6(P#6), Core L#6(P#6), Socket L#0(P#0), Node L#0(P#0) 7: PU L#7(P#7), Core L#7(P#7), Socket L#0(P#0), Node L#0(P#0)

locality: 1 0: PU L#8(P#8), Core L#8(P#0), Socket L#1(P#1), Node L#1(P#1) 1: PU L#9(P#9), Core L#9(P#1), Socket L#1(P#1), Node L#1(P#1) 2: PU L#10(P#10), Core L#10(P#2), Socket L#1(P#1), Node L#1(P#1) 3: PU L#11(P#11), Core L#11(P#3), Socket L#1(P#1), Node L#1(P#1) 4: PU L#12(P#12), Core L#12(P#4), Socket L#1(P#1), Node L#1(P#1) 5: PU L#13(P#13), Core L#13(P#5), Socket L#1(P#1), Node L#1(P#1) 6: PU L#14(P#14), Core L#14(P#6), Socket L#1(P#1), Node L#1(P#1) 7: PU L#15(P#15), Core L#15(P#7), Socket L#1(P#1), Node L#1(P#1)

This looks perfect! Unfortunately, my benchmark timings are off by a factor of 2. And when I ask hwloc about the actual bindings, I find:

HWLOC information: L0 (c438-603.stampede.tacc.utexas.edu#0) T0 (worker-thread#0): PU set L#{0} P#{0} L0 (c438-603.stampede.tacc.utexas.edu#0) T3 (worker-thread#3): PU set L#{1} P#{1} L0 (c438-603.stampede.tacc.utexas.edu#0) T0 (worker-thread#0): PU set L#{2} P#{2} L0 (c438-603.stampede.tacc.utexas.edu#0) T1 (worker-thread#1): PU set L#{3} P#{3} L0 (c438-603.stampede.tacc.utexas.edu#0) T6 (worker-thread#6): PU set L#{4} P#{4} L0 (c438-603.stampede.tacc.utexas.edu#0) T3 (worker-thread#3): PU set L#{5} P#{5} L0 (c438-603.stampede.tacc.utexas.edu#0) T0 (worker-thread#0): PU set L#{6} P#{6} L0 (c438-603.stampede.tacc.utexas.edu#0) T2 (worker-thread#2): PU set L#{7} P#{7} L1 (c438-603.stampede.tacc.utexas.edu#1) T3 (worker-thread#3): PU set L#{0} P#{0} L1 (c438-603.stampede.tacc.utexas.edu#1) T5 (worker-thread#5): PU set L#{1} P#{1} L1 (c438-603.stampede.tacc.utexas.edu#1) T3 (worker-thread#3): PU set L#{2} P#{2} L1 (c438-603.stampede.tacc.utexas.edu#1) T1 (worker-thread#1): PU set L#{3} P#{3} L1 (c438-603.stampede.tacc.utexas.edu#1) T5 (worker-thread#5): PU set L#{4} P#{4} L1 (c438-603.stampede.tacc.utexas.edu#1) T6 (worker-thread#6): PU set L#{5} P#{5} L1 (c438-603.stampede.tacc.utexas.edu#1) T0 (worker-thread#0): PU set L#{6} P#{6} L1 (c438-603.stampede.tacc.utexas.edu#1) T7 (worker-thread#7): PU set L#{7} P#{7}

As we see, HPX allocated both localities to the same set of PUs.

This is not evident from HPX's --hpx:print-bind output. Thus I want a feature to actually query, at run time, what bindings are used, instead of merely outputting what binding should be used. Without my own routine outputting the actual bindings, I would be hard pressed to discover this problem.

OK that is obviously a bug in hpx itself as print-bind should really output the actual bindings

hkaiser commented 10 years ago

Why HPX does not support that? Sorry I don't understand what you mean.

I mean that HPX (at least on the fixing_588 branch; maybe the mainline is different) does not reliably run N threads simultaneously when using N cores. As described above, it often happens that only N-1 threads are running application code while the Nth thread remains in the HPX scheduler, while there is 1 thread waiting.

How do you know that?

hkaiser commented 10 years ago

I'm just going to close this after reading everything again. Apparently, the problem is caused by your code which busy waits for the HPX threads to be invoked. If you busy wait you inhibit the HPX scheduler to do the proper load balancing, thus it can't make sure that all threads are executed at the same time on different cores.

Please reopen if my assumption is wrong. In this case, please also add a small use case reproducing your issue, It is way to much effort for me to diagnose those problems using an example consisting of more than 50-100 lines of code.

eschnett commented 10 years ago

The "busy waiting" is a red herring; if you don't like this, let me reproduce this without busy waiting. Find the source code below.

This code schedules N threads (when running on N cores). Each thread runs for 2 seconds, simulating a certain workload. If HPX actually schedules N threads simultaneously, this program finishes in about 2 seconds. However, this code often takes 4 seconds to run, because HPX (sometimes) only has N-1 threads running, while 1 OS thread does not find work for itself, although there is an application thread waiting. I have confirmed this by using gdb to look at the currently running threads, and see that one OS thread remains waiting in the HPX scheduler looking for work, and not finding the application thread waiting to be executed.

(I have also used busy-waiting to demonstrate this issue: I have scheduled N threads, and then use an atomic variable as counter, and have each thread wait busily until this counter reaches N. When this counter reaches N, this indicates that N threads are running concurrently. In many cases, it never reaches N, indicating that HPX schedules less than N concurrent threads. However, since you indicated that busy waiting is an abuse of the HPX scheduler, my example below uses timing to show that not all cores are busy.)

To reproduce this issue use this code:

#include <hpx/hpx.hpp>
#include <hpx/hpx_init.hpp>

#include <iostream>

// Be busy for two seconds
void justwait(void)
{
  std::cout << "Running on OS thread " << hpx::get_worker_thread_num()
            << std::endl << std::flush;
  auto t0 = hpx::get_system_uptime() / 1.0e+9;
  const double t1 = t0 + 2.0;
  for (;;) {
    const double t = hpx::get_system_uptime() / 1.0e+9;
    if (t >= t1) break;
  }
}
HPX_PLAIN_ACTION(justwait);

int hpx_main(boost::program_options::variables_map&)
{
  auto nthreads = hpx::get_os_thread_count();
  auto here = hpx::find_here();
  std::vector<hpx::future<void>> fs;
  auto t0 = hpx::get_system_uptime() / 1.0e+9;
  for (std::ptrdiff_t t=0; t<nthreads; ++t) {
    fs.push_back(hpx::async(justwait_action(), here));
  }
  hpx::wait_all(fs);
  auto t1 = hpx::get_system_uptime() / 1.0e+9;
  std::cout << "This took " << t1-t0 << " seconds" << std::endl;

  return hpx::finalize();
}

int main(int argc, char** argv)
{
  namespace po = boost::program_options;
  po::options_description desc("Benchmarking options");
  return hpx::init(desc, argc, argv);
}

This code schedule N threads on N cores. Each thread runs for 2 seconds. I expect HPX to be able to execute this set of threads in about 2 seconds. However, in many cases HPX needs 4 seconds to execute these N threads, because it leaves one core idle, forcing another core to execute 2 threads sequentially.

For example, I may receive the output

This took 4.00041 seconds

This is using the fixing_588 branch.

hkaiser commented 10 years ago

One possibility why this is happening could be that right after the first HPX-thread gets created (HPX-threads are assigned round robin to the cores) it gets stolen for a core of the neighboring NUMA domain before it can be executed on the core it was created on. All other threads get scheduled and executed as required which leaves the first core without work. However, it might not be allowed to steal from the neighboring NUMA domain where the only waiting thread is scheduled. The remaining HPX-thread gets picked up as soon as a core becomes available.

This situation is a border case for which I wouldn't like to add any special handling to the schedulers. HPX is meant to be used with a large number of HPX-threads (millions), which are preferably very short lived (minimally a couple of micro seconds). Thus the scenario causing this is not a typical use case.

hkaiser commented 10 years ago

If if we will not work on this in the short term, I wanted to make some unrelated comments wrt your example code.

There is no need to use the program options objects if you don't need them. An HPX application can be as simple as:

#include <hpx/hpx.hpp>
#include <hpx/hpx_main.hpp>
int main(int argc, char** argv)
{ 
    // ...
    return 0;
}

or

#include <hpx/hpx.hpp>
#include <hpx/hpx_init.hpp>
int hpx_main(int argc, char** argv)
{ 
    // ...
    return hpx::finalize();
}
int main(int argc, char** argv)
{ 
    return hpx::init(argc, argv);
}

both are equivalent.

Also, there is no need to create an action if you want to run on one locality only (or you want to specifically run a local function). Both HPX APIs hpx::async and hpx::apply work with plain functions or C++ function objects, Thus your code above could be:

#include <hpx/hpx.hpp>
#include <hpx/hpx_init.hpp>

#include <iostream>

// Be busy for two seconds
void justwait(void)
{
  std::cout << "Running on OS thread " << hpx::get_worker_thread_num()
            << std::endl << std::flush;
  auto t0 = hpx::get_system_uptime() / 1.0e+9;
  const double t1 = t0 + 2.0;
  for (;;) {
    const double t = hpx::get_system_uptime() / 1.0e+9;
    if (t >= t1) break;
  }
}

int hpx_main(boost::program_options::variables_map&)
{
  auto nthreads = hpx::get_os_thread_count();
  std::vector<hpx::future<void>> fs;
  auto t0 = hpx::get_system_uptime() / 1.0e+9;
  for (std::ptrdiff_t t=0; t<nthreads; ++t) {
    fs.push_back(hpx::async(&justwait));
  }
  hpx::wait_all(fs);
  auto t1 = hpx::get_system_uptime() / 1.0e+9;
  std::cout << "This took " << t1-t0 << " seconds" << std::endl;

  return hpx::finalize();
}

int main(int argc, char** argv)
{
  return hpx::init(argc, argv);
}

All of this might not be 100% obvious from the docs.

eschnett commented 10 years ago

(1) This runs on a single NUMA domain, so NUMA stealing should not be an issue here. (2) The code is a cut-down version of a larger application, thus it may be more complex than necessary. I didn't bother cleaning it up too much... In Cactus (which does its own command-line handling), I am indeed calling hpx::init(argc, argv). (3) Thanks for the pointer regarding the actions.

hkaiser commented 10 years ago

(1) This runs on a single NUMA domain, so NUMA stealing should not be an issue here.

This was just an assumption. However I'm not able to reproduce this issue when running on one NUMA domain only. For me the issue manifests itself only when I use all cores on a system.

STEllAR-GROUP / hpx

Not all worker threads are working #963