very slow atomic operations over unified memory

ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability

https://rocmdocs.amd.com/projects/HIP/

MIT License

3.61k stars 521 forks source link

very slow atomic operations over unified memory #3177

Open zjin-lcf opened 1 year ago

zjin-lcf commented 1 year ago

Running a HIP program that calls hipManagedMalloc() functions is very slow on a MI-series GPU. If the HIP program is not written properly, please let me know. Thanks.

make source=main-um.cu
hipcc  -std=c++14 -Wall -I../atomicIntrinsics-cuda -O3 -c main-um.cu -o main-um.o
hipcc  -std=c++14 -Wall -I../atomicIntrinsics-cuda -O3 main-um.o -o main
./main 1
PASS
Average kernel execution time: 130903496.000000 (us)

https://github.com/zjin-lcf/HeCBench/blob/master/src/atomicIntrinsics-hip/main-um.cu

jatinx commented 1 year ago

Thanks for reporting it. Will look into it.

ppanchad-amd commented 3 months ago

@jatinx Did you have a chance to look into this? Thanks!

harkgill-amd commented 1 day ago

Hi @zjin-lcf, looks like the file linked is no longer available.

Could you please provide another example so we can try to reproduce this issue internally?

zjin-lcf commented 11 hours ago

Sorry, the link is updated.

b-sumner commented 10 hours ago

@zjin-lcf that code has every single thread hammering on a small set of locations in very few cache lines. Code doing that should be expected to be "slow". However, we are introducing a compiler optimization to recognize uniform addresses and reduce memory traffic. That will help this code, but won't help the general case where the uniformity can't be deduced.