ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.71k stars 528 forks source link

stream create, copy and destroy example #3470

Open jinz2014 opened 5 months ago

jinz2014 commented 5 months ago

Running the stream create and destroy example shows that the time is about 2X-3X longer than the time on an Nvidia GPU for the following cases. Thanks for your comments and suggestions.

Link: https://github.com/zjin-lcf/HeCBench/tree/master/src/streamCreateCopyDestroy-hip/

MI210 Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 49.6401 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 50.2982 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 57.4719 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 54.3432 (ms)

https://github.com/zjin-lcf/HeCBench/tree/master/src/streamCreateCopyDestroy-cuda

A100 Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 23.3694 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 23.2853 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 23.38 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 23.2302 (ms)

shadidashmiz commented 4 months ago

Looks like your test time increases linearly with num of buffer you allocate does not look like hipStream issue

jinz2014 commented 4 months ago

In the test, the two datacenter GPUs are not installed on the same host, so I am not sure if different hosts may impact the execution time. So, I tried to run the CUDA and HIP programs on a desktop computer with both Nvidia and AMD GPUs.

RTX3090

Create+Copy+Synchronize+Destroy time for 1 streams and 1 buffers and 128 iterations 0.0128745 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 1 buffers and 64 iterations 0.00757924 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 1 buffers and 32 iterations 0.0062003 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 1 buffers and 16 iterations 0.0063163 (ms) Create+Copy+Synchronize+Destroy time for 1 streams and 100 buffers and 64 iterations 0.270007 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 100 buffers and 32 iterations 0.255549 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 100 buffers and 16 iterations 0.27291 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 100 buffers and 8 iterations 0.267216 (ms) Create+Copy+Synchronize+Destroy time for 1 streams and 1000 buffers and 32 iterations 2.53417 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 1000 buffers and 16 iterations 2.52819 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 1000 buffers and 8 iterations 2.52339 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 1000 buffers and 4 iterations 2.52614 (ms) Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 12.7661 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 12.7234 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 12.7502 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 12.7159 (ms)

gfx1030

Create+Copy+Synchronize+Destroy time for 1 streams and 1 buffers and 128 iterations 1.99878 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 1 buffers and 64 iterations 0.574584 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 1 buffers and 32 iterations 0.610492 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 1 buffers and 16 iterations 0.587304 (ms) Create+Copy+Synchronize+Destroy time for 1 streams and 100 buffers and 64 iterations 1.39792 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 100 buffers and 32 iterations 1.39171 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 100 buffers and 16 iterations 1.41488 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 100 buffers and 8 iterations 1.43967 (ms) Create+Copy+Synchronize+Destroy time for 1 streams and 1000 buffers and 32 iterations 9.0404 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 1000 buffers and 16 iterations 9.03053 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 1000 buffers and 8 iterations 9.05028 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 1000 buffers and 4 iterations 9.15136 (ms) Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 43.0856 (ms) Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 43.0919 (ms) Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 43.1138 (ms) Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 43.166 (ms)

bdenhollander commented 4 months ago

I profiled your code on Windows on gfx1032. The majority of the time was spent in memcpy rather than creating and destroying streams. This code may be more of host to device copy benchmark. image

jinz2014 commented 4 months ago

Yes, most time is spent on data copy. I updated the summary of the issue.

jinz2014 commented 4 months ago

@bdenhollander What is your profiler ?

bdenhollander commented 4 months ago

The screenshot is from Visual Studio 2019's built in profiler.