AMReX-Astro / Microphysics

common astrophysical microphysics routines with interfaces for the different AMReX codes
https://amrex-astro.github.io/Microphysics
Other
34 stars 33 forks source link

try to make a ROCm big kernel reproducer #1477

Closed zingale closed 3 weeks ago

zingale commented 7 months ago

ROCm seems to have trouble with large kernels, leading to memory issues. We can try to create a reproducer using test_react, starting with a small net and make bigger and bigger nets (via pynucastro) until we find a size that breaks things. We might also be able to strip out neutrinos, the EOS, and other bits.

BenWibking commented 7 months ago

Just for future reference: this appears to depend on whether we run the primordial_chem network from the application code or standalone (https://github.com/AMReX-Astro/Microphysics/tree/main/unit_test/burn_cell_primordial_chem). The latter does not trigger the memory fault we see when running the network in Quokka.

zingale commented 7 months ago

we should try test_react instead of burn_cell and also make the box bigger. Right now it seems that n_cell = 16, so we should do 32**3 or 64**3

yut23 commented 7 months ago

Turning off force-inlining of all functions in the kernel appears to fix the memory issues in Castro. We could try something like constexpr_for<0, N>([](int) { big_function(); }); to make an arbitrarily-large kernel.

BenWibking commented 7 months ago

@psharda found that turning off force-inlining also fixes the memory issues in Quokka.

BenWibking commented 3 weeks ago

More context on the compiler bug here: https://discourse.llvm.org/t/how-to-verify-correct-regalloc-for-a-kernel/80811

TL;DR the underlying issue is well-understood by the compiler developers, and it is supposed to be fixed by this LLVM PR: https://github.com/llvm/llvm-project/pull/93526

Should we close this issue?

zingale commented 3 weeks ago

Very nice. Does this mean a future ROCm version will have this fix?

BenWibking commented 3 weeks ago

Since the ROCm compiler is derived from the upstream LLVM sources, I think, in principle, yes. No idea when this will be.

zingale commented 3 weeks ago

closing this since it seems to be recognized to be an LLVM bug with a PR fix