IntelLabs / ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
BSD 2-Clause "Simplified" License
294 stars 32 forks source link

Reduction performance issue #63

Open ehsantn opened 8 years ago

ehsantn commented 8 years ago

Seems like there might be a performance issue with the new "manual" reduction method. In HPAT, single node MPI is much faster than OpenMP for most benchmarks (pi is a good example).

I suspect it's because of cache line ping ponging between threads since local results of threads are stored consecutively.

ninegua commented 8 years ago

Does changing back to OpenMP reduce help? Last time I measured the current implementation against using OpenMP reduce with some benchmarks, there was no practical difference on multi-core, not sure about Pi though.

ehsantn commented 8 years ago

I see the same issue on all the benchmarks I have tested for HPAT. I'm working on testing OpenMP reduce on Cori now. I think we might have thread affinity issues on our machines.

ehsantn commented 8 years ago

OpenMP reduce is similar in performance seems like. I don't know where this performance difference comes from.

ninegua commented 8 years ago

Are we going to do anything about this? If OpenMP performance is similar, I don't see there is an immediate remedy that can help.

ehsantn commented 8 years ago

I think we need deeper performance analysis (with VTune?) to find out what the problem is.