Opening this as a draft for reference. I think we should wait for responses from both the Umpire developers (https://github.com/LLNL/Umpire/issues/881) and HPE before deciding if and what workaround to apply. This typically, but not always, gives reasonable performance after only one warmup iteration, and the warmup iteration isn't ridiculously slow compared to the best case. However, this always allocates at least 2MiB per allocation from Umpire and can end up wasting quite a lot of memory for small tiles. As an example the gen_to_std miniapp can look like this on current master:
Opening this as a draft for reference. I think we should wait for responses from both the Umpire developers (https://github.com/LLNL/Umpire/issues/881) and HPE before deciding if and what workaround to apply. This typically, but not always, gives reasonable performance after only one warmup iteration, and the warmup iteration isn't ridiculously slow compared to the best case. However, this always allocates at least 2MiB per allocation from Umpire and can end up wasting quite a lot of memory for small tiles. As an example the
gen_to_std
miniapp can look like this on current master:and most of the time looks like this on this PR:
The best case doesn't improve, but the worst case and variance significantly improve.