Closed zjin-lcf closed 1 month ago
@zjin-lcf Apologies for the lack of response. Can you please test with latest ROCm 6.0.2 (HIP 6.0.32831)? If resolved, please close ticket. Thanks!
To reproduce:
1 goto https://github.com/zjin-lcf/HeCBench/tree/master/ccs-cuda, 2 tar -zxf data.tar.gz 3 goto https://github.com/zjin-lcf/HeCBench/tree/master/ccs-hip 4 make run
@zjin-lcf can you compile with -g and crank up rocgdb and tell us where it hangs? There are many hardware differences between AMD and nvidia GPUs. Also those links are not working for me.
Sorry, the link is https://github.com/zjin-lcf/HeCBench/tree/master/src/ccs-hip
I compiled the program with -g, and then ran rocgdb:
(gdb) r Starting program: main -t 0.9 -i ../ccs-cuda/Data_Constant_100_1_bicluster.txt -o ./Output.txt -m 50 -p 1 -g 100.0 -r 100 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Number of rows=100 Number of columns=100 warning: At least one agent is busy (debugging may be enabled by another process) warning: amd-dbgapi: unable to enable GPU debugging due to a restriction error
Hi @zjin-lcf, this doesn't hang for me on MI100 with ROCm 6.2.1. Which MI card are you seeing issues on? One thing that can cause hangs and other issues in MI100 and MI200 systems is IOMMU settings; if you have IOMMU enabled in BIOS, you will need to enable IOMMU passthrough mode with the iommu=pt
kernel boot option.
Closing this for now as I can't reproduce it. If you're still experiencing this issue and have checked your IOMMU settings, feel free to comment and we can reopen this.
Running the program https://github.com/zjin-lcf/HeCBench/tree/master/ccs-hip hangs.
The env is ROCM 4.3 on an MI-series GPU
For reference, the cuda version is https://github.com/zjin-lcf/HeCBench/tree/master/ccs-cuda. Thanks.