[Performance]: investigate performance issues on ARM systems

mppf commented 1 month ago

Summary of Problem

Description: There are two publications that show problems with Chapel performance on ARM systems in comparison to other models. They are:

"Performance Portability of the Chapel Language on Heterogeneous Architectures". Josh Milthorpe (Oak Ridge National Laboratory, Australian National University), Xianghao Wang (Australian National University), Ahmad Azizi (Australian National University) Heterogeneity in Computing Workshop (HCW). See also related presentation https://chapel-lang.org/ChapelCon/2024/milthorpe.pdf .
Diehl, P., Morris, M., Brandt, S.R., Gupta, N., Kaiser, H. (2024). Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java. In: Zeinalipour, D., et al. Euro-Par 2023: Parallel Processing Workshops. Euro-Par 2023. Lecture Notes in Computer Science, vol 14352. Springer, Cham. Available at https://arxiv.org/abs/2307.01117

Is this issue currently blocking your progress? No

Steps to Reproduce

Look to the 1st paper for details about how to find MiniBude. That paper used ThunderX2 processors.

Look to the 2nd paper for details about how to find the source code for the heat diffusion simulation. The 2nd paper use the Ookami cluster which uses A64FX processors.

bradcray commented 1 month ago

IIRC, the second paper was written before we supported Qthreads for ARM, and that was believed to be the major cause of the performance issue. I recall us discussing re-running the experiments after https://github.com/chapel-lang/chapel/pull/23163 went in, but can't recall offhand whether we did that or not. @jeremiah-corrado or @ronawho might remember better.

I'm not as familiar with the first paper, so don't have any information there offhand.

jeremiah-corrado commented 1 month ago

@jeremiah-corrado might remember better

I don't recall re-running those experiments

ronawho commented 1 month ago

I didn't run either of these.

The 2nd paper use the Ookami cluster. Both of these results are on systems with ThunderX2 processors.

Note that Ookami is A64FX, not ThunderX2. There may be a few ThunderX2 nodes on the system, but paper looks like it was on A64FX

damianmoz commented 1 month ago

Out of interest, with respect to the 2nd paper a) what was the speed once the latest Qthreads was supported? b) what was their complaint about how class worked differently to C++ or Java? Thanks

Please let me know if this belongs on discourse. On asking (a), I was assuming you had access to this machine. Apologies if that presumption was mistaken.

bradcray commented 1 month ago

@damianmoz : For (a), I don't believe we ever ran their code on Ookami after the Qthreads update. I'm reasonably confident that we did reproduce the relatively poor performance on local arm systems prior to the paper being submitted and attributed it to Qthreads. I'm also fairly confident we did general experiments to show that using Qthreads on such processors improved performance, though I'm not sure whether we ran these specific experiments or other ones. One other wildcard here which I'd forgotten about until Elliot pointed out the use of A64FX is that (IIRC) vectorization is pretty crucial on that chip, so that could be another source of overhead relative to conventional models. We haven't done any recent experiments with A64FX, so I don't know how we're doing there w.r.t. vectorization with all the recent work to integrate better with LLVM and keep up with newer LLVM versions.

For (b), you're referring to this excerpt from the paper:

Often, knowledge of one programming language can mislead a programmer into thinking they understand another. The Chapel “class” for example did not behave exactly as we expected from C++ and Java.

I don't have any recollection that we ever heard an explanation of this behavior difference that they're referring to, and am not finding any reference to it in our email exchanges on the paper. When I sent them comments on a pre-print of the paper, I asked for more information, but didn't receive a response.

damianmoz commented 1 month ago

Thanks heaps for the very detailed reply. A shame about (b). I think issues that relate the ease (or otherwise) of porting code to Chapel are really important. While of less importance, an answer to (a) at one stage might be useful. I found the LOC issue they raised where Chapel was way superior to everything else was pretty well what all of us have seen.

chapel-lang / chapel

[Performance]: investigate performance issues on ARM systems #26020

Summary of Problem

Steps to Reproduce