Open Mietzsch opened 5 years ago
After browsing through some of the code, I have another major concern: some codes allocate global memory in the critical path, e.g., every timestep. Global memory allocation is costly (orders of magnitude slower than malloc
). Has anyone ever tested this on a real HPC system (ideally IB or Cray where global memory is pinned)? How does it compare against the MPI version of the NPB benchmarks? In many places these global data structures can probably be allocated once and reused, so the fix should be easy.
The reason I am concerned is this: if these benchmarks end up in the repo someone will eventually grab them and use them to compare their approach to DASH. They will not make an attempt to investigate why the performance of DASH seemingly sucks. We should be careful with putting out benchmarks where we cannot show that we are at least in the same ballpark as MPI. This would come back to haunt us...
After browsing through some of the code, I have another major concern: some codes allocate global memory in the critical path, e.g., every timestep. Global memory allocation is costly (orders of magnitude slower than
malloc
). Has anyone ever tested this on a real HPC system (ideally IB or Cray where global memory is pinned)? How does it compare against the MPI version of the NPB benchmarks? In many places these global data structures can probably be allocated once and reused, so the fix should be easy.The reason I am concerned is this: if these benchmarks end up in the repo someone will eventually grab them and use them to compare their approach to DASH. They will not make an attempt to investigate why the performance of DASH seemingly sucks. We should be careful with putting out benchmarks where we cannot show that we are at least in the same ballpark as MPI. This would come back to haunt us...
No, I did not test this on a real HPC system. Unfortunately, I'm working on different projects now and I don't have the time to test and work out the new global data-structures. If anybody wants to go ahead and do it, you're more than welcome.
This is an implementation of the five original NPB kernels using DASH. We use several aspects of DASH, including DASH algorithms, CSR patterns and async_copy.