ParRes / Kernels

This is a set of simple programs that can be used to explore the features of a parallel platform.
https://groups.google.com/forum/#!forum/parallel-research-kernels
Other
404 stars 106 forks source link

add shmem4py #618

Closed jeffhammond closed 1 year ago

jeffhammond commented 1 year ago

nstream and transpose work. that's enough for now.

Signed-off-by: Jeff Hammond jeff.science@gmail.com

dalcinl commented 1 year ago

Would it make sense to add explicit shmem.free(obj) calls for all symmetric objects near the end of the main functions? Memory deallocation is a collective operation, and you cannot in general fully trust the Python's garbage collector running on each image to work coordinately to deallocate shmem memory. I'm still scratching my head about how to make this whole thing nicer, more Pythonic.

jeffhammond commented 1 year ago

I don't know how anything works in the Python garbage collector but I can imagine marking SHMEM objects eligible for deletion and then running the garbage collection during collective operations, perhaps only the memory management ones.

Of course, in the PRK cases, it doesn't matter, because the programs are so simple, and they all do the deallocation in the epilogue anyways. The primary reason I explicitly deallocate is to trigger bugs, not to prevent memory leaks.

dalcinl commented 1 year ago

My only concern so far is that the NSTREAM test is creating a lot of array temporaries, thus it is not really like the typical stream test in C or Fortran

In [1]: import numpy
In [2]: import numexpr
In [3]: n = 100_000_000
In [4]: A = numpy.full(n, 2.0)
In [5]: B = numpy.full(n, 2.0)
In [6]: %timeit A + 3.0*B
143 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit numexpr.evaluate('A + 3.0*B')
33.6 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

EDIT: Ignore the benchmark above. I forgot numexpr may be using threads.

dalcinl commented 1 year ago

Nonetheless, even if memory allocations are mostly free (because of the NumPy memory allocator cache), Python/NumPy will not use more than two streams at a time, there is no fused-multiply-add operator in Python.

jeffhammond commented 1 year ago

Yeah, this is one of the purposes of the PRK: to show where languages/runtimes are defective from a performance standpoint. I am sure I can hack around this in Numpy, and I guess Numba solves it, but if the most straightforward implementation is worse than Fortran, then people deserve to know that.