Closed markdewing closed 2 years ago
On Xeon Gold 5220 I'm seeing ~4 % improvement with one kokkos --serial
process. I first saw an opposite effect on Cori, but found a mistake in my testing setup. Once I get consistent results there I'll merge this PR.
On Cori I see ~5 % improvement on kokkos --serial
when running one process on otherwise empty node
1-thread case is now witin 4 % of the serial
program.
On a fully loaded socket the improvement is smaller but still clearly visible
The Kokkos serial backend is slower than the plain serial code - see #297 for more details.
One cause is addressed here, where the Kokkos backend loops over the entire array (size 1024) rather than the number of clusters (about 10). This loop also sets ok and newclusId to zero. Those get set later and these assignments can be removed.