ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.77k stars 539 forks source link

program execution hangs #2429

Closed zjin-lcf closed 1 month ago

zjin-lcf commented 2 years ago

Running the program https://github.com/zjin-lcf/HeCBench/tree/master/ccs-hip hangs.

The env is ROCM 4.3 on an MI-series GPU

For reference, the cuda version is https://github.com/zjin-lcf/HeCBench/tree/master/ccs-cuda. Thanks.

ppanchad-amd commented 7 months ago

@zjin-lcf Apologies for the lack of response. Can you please test with latest ROCm 6.0.2 (HIP 6.0.32831)? If resolved, please close ticket. Thanks!

zjin-lcf commented 7 months ago

To reproduce:

1 goto https://github.com/zjin-lcf/HeCBench/tree/master/ccs-cuda, 2 tar -zxf data.tar.gz 3 goto https://github.com/zjin-lcf/HeCBench/tree/master/ccs-hip 4 make run

b-sumner commented 7 months ago

@zjin-lcf can you compile with -g and crank up rocgdb and tell us where it hangs? There are many hardware differences between AMD and nvidia GPUs. Also those links are not working for me.

zjin-lcf commented 7 months ago

Sorry, the link is https://github.com/zjin-lcf/HeCBench/tree/master/src/ccs-hip

I compiled the program with -g, and then ran rocgdb:

(gdb) r Starting program: main -t 0.9 -i ../ccs-cuda/Data_Constant_100_1_bicluster.txt -o ./Output.txt -m 50 -p 1 -g 100.0 -r 100 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Number of rows=100 Number of columns=100 warning: At least one agent is busy (debugging may be enabled by another process) warning: amd-dbgapi: unable to enable GPU debugging due to a restriction error

schung-amd commented 1 month ago

Hi @zjin-lcf, this doesn't hang for me on MI100 with ROCm 6.2.1. Which MI card are you seeing issues on? One thing that can cause hangs and other issues in MI100 and MI200 systems is IOMMU settings; if you have IOMMU enabled in BIOS, you will need to enable IOMMU passthrough mode with the iommu=pt kernel boot option.

schung-amd commented 1 month ago

Closing this for now as I can't reproduce it. If you're still experiencing this issue and have checked your IOMMU settings, feel free to comment and we can reopen this.