Closed zingale closed 3 weeks ago
Just for future reference: this appears to depend on whether we run the primordial_chem network from the application code or standalone (https://github.com/AMReX-Astro/Microphysics/tree/main/unit_test/burn_cell_primordial_chem). The latter does not trigger the memory fault we see when running the network in Quokka.
we should try test_react
instead of burn_cell
and also make the box bigger. Right now it seems that n_cell = 16
, so we should do 32**3
or 64**3
Turning off force-inlining of all functions in the kernel appears to fix the memory issues in Castro. We could try something like constexpr_for<0, N>([](int) { big_function(); });
to make an arbitrarily-large kernel.
@psharda found that turning off force-inlining also fixes the memory issues in Quokka.
More context on the compiler bug here: https://discourse.llvm.org/t/how-to-verify-correct-regalloc-for-a-kernel/80811
TL;DR the underlying issue is well-understood by the compiler developers, and it is supposed to be fixed by this LLVM PR: https://github.com/llvm/llvm-project/pull/93526
Should we close this issue?
Very nice. Does this mean a future ROCm version will have this fix?
Since the ROCm compiler is derived from the upstream LLVM sources, I think, in principle, yes. No idea when this will be.
closing this since it seems to be recognized to be an LLVM bug with a PR fix
ROCm seems to have trouble with large kernels, leading to memory issues. We can try to create a reproducer using
test_react
, starting with a small net and make bigger and bigger nets (via pynucastro) until we find a size that breaks things. We might also be able to strip out neutrinos, the EOS, and other bits.