PermutationForDeposition was initially developed for A100. A few tweaks can be made to improve performance on MI250X, which has a smaller cache but is much less sensitive to atomic add congestion.
Additional background
Test with MI250X
I also did the same test with A100, where I forced it to use the AMD tune.
Checklist
The proposed changes:
[ ] fix a bug or incorrect behavior in AMReX
[ ] add new capabilities to AMReX
[ ] changes answers in the test suite to more than roundoff level
[ ] are likely to significantly affect the results of downstream AMReX users
[ ] include documentation in the code and/or rst files, if appropriate
Summary
PermutationForDeposition was initially developed for A100. A few tweaks can be made to improve performance on MI250X, which has a smaller cache but is much less sensitive to atomic add congestion.
Additional background
Test with MI250X
I also did the same test with A100, where I forced it to use the AMD tune.
Checklist
The proposed changes: