Realm: Slow data transposes on GPU using HIP

As mentioned in the Legion meeting on 11/06/2024, we observe very slow data copies on AMD GPUs when running HTR++ on Tioga. Two profile logs produced with the version of legion 023d0c31006f721b46a0d56c8f3ca1cbfb72df76 on Lassen ( prof_lassen.log) and on Tioga (prof_tioga.log) are attached. The profiled configuration uses only one GPU on one node and makes several copies of the same data changing its layout. The logs show that the copies on Tioga are about 10x slower than those on Lassen and, as discussed in the Legion meeting, this is most likely due to the absence of the new DMA.

For @seemamirch, the input file needed to reproduce these logs is base.json and the logs are produced on both systems by launching the code with PROFILE=1 $HTR_DIR/prometeo.sh -i base.json -o .

@elliottslaughter, can you please add this issue to #1032?

StanfordLegion / legion

Realm: Slow data transposes on GPU using HIP #1789