Workaround for bad GPUDirect performance with unaligned GPU buffers

Opening this as a draft for reference. I think we should wait for responses from both the Umpire developers (https://github.com/LLNL/Umpire/issues/881) and HPE before deciding if and what workaround to apply. This typically, but not always, gives reasonable performance after only one warmup iteration, and the warmup iteration isn't ridiculously slow compared to the best case. However, this always allocates at least 2MiB per allocation from Umpire and can end up wasting quite a lot of memory for small tiles. As an example the gen_to_std miniapp can look like this on current master:

[0]
[0] 17.3253s 495.804GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[1]
[1] 11.5633s 742.859GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[2]
[2] 2.86979s 2993.22GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[3]
[3] 0.0939851s 91396.8GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[4]
[4] 2.95547s 2906.45GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[5]
[5] 0.0937317s 91643.8GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[6]
[6] 0.0919855s 93383.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[7]
[7] 0.0930948s 92270.8GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[8]
[8] 0.0933742s 91994.7GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[9]
[9] 0.0922234s 93142.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU

and most of the time looks like this on this PR:

[0]
[0] 0.318221s 26993.7GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[1]
[1] 0.0949778s 90441.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[2]
[2] 0.0906252s 94785.2GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[3]
[3] 0.0963228s 89178.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[4]
[4] 0.0931526s 92213.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[5]
[5] 0.0924757s 92888.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[6]
[6] 0.0923647s 93000.2GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[7]
[7] 0.09494s 90477.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[8]
[8] 0.091092s 94299.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[9]
[9] 0.091955s 93414.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU

The best case doesn't improve, but the worst case and variance significantly improve.

eth-cscs / DLA-Future

Workaround for bad GPUDirect performance with unaligned GPU buffers #1143