pp00aa dynamic OpenMP scheduling improves load imbalance

Since poincare tracing is done using an adaptive integration routine, execution time isn't equal between points. Especially chaotic regions take the longer to integrate and are typically not equally distributed along the nptrj range, so this results in a significant load imbalance between threads. The default static scheduling divides the workloads in large, equal blocks between threads. Dynamic scheduling creates a more fine grained work distribution (round robin, 1 loop iteration per thread) and uses whichever threads are available at the moment.

This might cause a minor performance overhead in the edge case of very small nppts, with low and large nptrj, but this should be outweighed by the improved load balance. It improved wall clock time and CPU utilization in all my tested examples.

E.g. for a simple rotating ellipse with

 odetol      =   1.000000000000000E-07
 nPpts       =      1000
 nPtrj       =      8    80

this resulted in a speedup of 34s (1m25s to 51s) on 4 threads, and a speedup was measurable even in the "worst case" of nppts = 1

PrincetonUniversity / SPEC

pp00aa dynamic OpenMP scheduling improves load imbalance #207