Use OpenMP parallel-for during layout step

On a intel i7-4790k (4 physical cores) with a test set of ~1M objects, I get ~3.5x speedup. On a dual E5-2643 (8 physical cores), I get ~5x speedup.

There is still a significant single-threaded portion, I suspect it is the insertion into the quad tree. I haven't checked, but I'm not sure it's concurrent-ready...

EDIT: Running gprof on a sample run shows that indeed, 10% of the time is spent in QuadTree::insert (82% is spent in QuadTree::updateBodyForce, and 4% in Layout::updateSpringForce).

anvaka / ngraph.native

Use OpenMP parallel-for during layout step #2