Open ehsantn opened 8 years ago
Does changing back to OpenMP reduce help? Last time I measured the current implementation against using OpenMP reduce with some benchmarks, there was no practical difference on multi-core, not sure about Pi though.
I see the same issue on all the benchmarks I have tested for HPAT. I'm working on testing OpenMP reduce on Cori now. I think we might have thread affinity issues on our machines.
OpenMP reduce is similar in performance seems like. I don't know where this performance difference comes from.
Are we going to do anything about this? If OpenMP performance is similar, I don't see there is an immediate remedy that can help.
I think we need deeper performance analysis (with VTune?) to find out what the problem is.
Seems like there might be a performance issue with the new "manual" reduction method. In HPAT, single node MPI is much faster than OpenMP for most benchmarks (pi is a good example).
I suspect it's because of cache line ping ponging between threads since local results of threads are stored consecutively.