ARM-software / synchronization-benchmarks

Collection of synchronization micro-benchmarks and traces from infrastructure applications
Other
37 stars 36 forks source link

Deadlock if fewer threads (<args.nthrds) started #28

Open mjaggi-cavium opened 6 years ago

mjaggi-cavium commented 6 years ago

This is similar to earlier issue I posted sometime back.

After a run of about 20 minutes a deadlock is observed when not all of the 'n' threads (args.nthrds) could be started by main(). All cores on which threads are started are at 100%. The first child thread is waiting for ready_lock, while others are waiting for sync_lock.

This behaviour is observed when number of cores (threaded per core threads 4) is 200+.

Not sure why all nthrds not starting, could be RT throttling issue. Comments suggestions...?

geoffreyblake commented 6 years ago

Hi, does the issue still happen with the "-s" flag which disables the SCHED_FIFO setting?

zoybai commented 6 years ago

Hi Manish, which workload did you run for this deadlock case? Thanks

mjaggi-cavium commented 6 years ago

I am running ./runall.sh. The issue is not seen when run a single instance of workload.

mjaggi-cavium commented 6 years ago

Hi, does the issue still happen with the "-s" flag which disables the SCHED_FIFO setting? No. Test completes without hanging

lucasclucasdo commented 6 years ago

When running with -s what sort of effective parallelism do you see? It should be close to the number of requested cores. If it's significantly lower then the improvement may be due to -s mode not being able to recreate the high contention case and not directly a problem with FIFO mode itself.

mjaggi-cavium commented 6 years ago

It should be close to the number of requested cores Yes.

I havent seen with -s, anytime number of thread created < nthrds. So mainthread is not starved.

lucasclucasdo commented 6 years ago

The number of threads created will be the same but "effective parallelism" (output by the tool) tells you how many of the actual threads are running at the same time. So you could have 200 cores and 200 threads but if each one runs to completion on one core before the next core starts you can theoretically have effective parallelism of only 1 thread even though thread creation equals requested threads.

mjaggi-cavium commented 6 years ago

AFAIK,

lucasclucasdo commented 6 years ago

It's more likely that the child threads get starved but the scheduler should be waking up cores to steal and run the child threads since there will be balance problems otherwise (one core with two runnable FIFO processes and one core with nothing). One thing I've been thinking about trying is spawning a bunch of threads to make the balance issue look worse and cause the scheduler to step in sooner and then affine threads to whichever unloaded core they end up on first (or exit if the core they end up already has a waiting lockhammer process).

Anyway, that's not relevant to the question I'm asking which is "does safemode successfully achieve the requested contention level." I'm guessing not since FIFO mode was added in to avoid this exact problem in the first place which is why I'm asking. In other words safemode might "solve" the issue you're seeing but it probably does it by making the test a useless measure of performance in the high core count contention case (because it likely fails to achieve it). How does the "effective parallelism" metric compare to requested thread counts for high thread counts where you were previously seeing the scheduling issue?

Edit: slight change, main thread should be free to run anywhere, not just hw thread 0 (if that's not case it's a bug).

lucasclucasdo commented 6 years ago

I created a test branch which sched_yields the thread on core 0 if all child threads are not ready yet. Unfortunately I cannot replicate this issue on systems to which I have access so please try this branch and see if it helps:

https://github.com/codeauroraforum/synchronization-benchmarks/tree/lh-yieldwait

mjaggi-cavium commented 6 years ago

Tried this, and replaced below as well / Spin until the "marshal" sets the appropriate bit / wait64_yield(&sync_lock, (nthrds * 2) | 1);

I think i missed one point, affinity of main thread is all cores, so wherever it is rescheduled and there is a contention not all threads will start. So I believe we need to put sched_yield in all atomic functions.

lucasclucasdo commented 6 years ago

If we yield the other threads then we need to add in another sync step without a yield to make sure everyone is actually both started and running. Eg, current scheme is:

  1. Startup threads
  2. Wait for all threads to startup
  3. Threads are FIFO and unyielding so if they've reported started then they must be running still
  4. Send a start signal since we know threads are all started up (because they told us) and currently running (because they must be by definition)

If we yield the startup threads it should be:

  1. Startup threads
  2. Wait with yielding for all threads to startup
  3. Thread have all started up but may be currently not running due to yielding while startup was ongoing
  4. Wait without yielding for all started up threads to get rescheduled and report back in
  5. Send a start signal since we've confirmed all threads are started up (because they told us) and currently running (because they also told us)

That said I still think this is more of a scheduler balance problem where at high core counts a single core with an extra runnable but not running process (ie, the main thread) doesn't look like too bad of a balance problem so sleeping hardware threads are not woken up to execute the main software thread for a long time in the hopes that one of the many low utilization hardware threads already running can take care of it in a short amount of time (but of course they can't because they're all running FIFO threads that are busy spinning).