Open zjin-lcf opened 1 year ago
Thanks for reporting it. Will look into it.
@jatinx Did you have a chance to look into this? Thanks!
Hi @zjin-lcf, looks like the file linked is no longer available.
Could you please provide another example so we can try to reproduce this issue internally?
Sorry, the link is updated.
@zjin-lcf that code has every single thread hammering on a small set of locations in very few cache lines. Code doing that should be expected to be "slow". However, we are introducing a compiler optimization to recognize uniform addresses and reduce memory traffic. That will help this code, but won't help the general case where the uniformity can't be deduced.
Running a HIP program that calls hipManagedMalloc() functions is very slow on a MI-series GPU. If the HIP program is not written properly, please let me know. Thanks.
https://github.com/zjin-lcf/HeCBench/blob/master/src/atomicIntrinsics-hip/main-um.cu