ROCm / clr

MIT License
104 stars 50 forks source link

[Issue]: ROCM5.7.3, RCCL2.19.4 GPU kernel can't printf。Hash value collision detected #73

Closed yangyangv8 closed 1 month ago

yangyangv8 commented 7 months ago

Problem Description

Problem Description

In the rccl file prims_simple.h,I have added a section of printf in this kernel function, such as :

device forceinline void genericOp( intptr_t srcIx, intptr_t dstIx, int nelem, bool postOp ) { constexpr int DirectRecv = /1 &&/ Direct && DirectRecv1; constexpr int DirectSend = /1 &&/ Direct && DirectSend1; constexpr int Src = SrcBuf != -1; constexpr int Dst = DstBuf != -1; nelem = nelem < 0 ? 0 : nelem; int sliceSize = stepSizeStepPerSlice; sliceSize = max(divUp(nelem, 16SlicePerChunk)*16, sliceSize/32); int slice = 0; int offset = 0; if(tid == 0) { printf("in genericOp \n"); }

when i run rccl test, Use this command ./build/sendrecv_perf -b 8 -e 128M -f 2 -t 1 -g 2,will report this error:

enquence.cc Current function: ncclLaunchKernel line 1090 :1:rocvirtual.cpp :2945: 74877529363 us: [pid:44406 tid:0x7f26f4922c00] Pcie atomics not enabled, hostcall not supported :1:rocvirtual.cpp :3280: 74877529375 us: [pid:44406 tid:0x7f26f4922c00] AQL dispatch failed! yz-adm3: Test NCCL failure /home/yang.yang/yy/work/test-rccl/build/src/hipify/common.cu.cpp:451 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '

After seeing the explanation here https://rocm.docs.amd.com/en/latest/about/CHANGELOG.html#non-hostcall-hip-printf, I have added the following settings in the RCCL CMakelists.txt file :

target_compile_options(rccl PRIVATE -mprintf-kind=buffered)

makefiles/common.mk: CXXFLAGS := -DCUDA_MAJOR=$(CUDA_MAJOR) -DCUDA_MINOR=$(CUDA_MINOR) -fPIC -fvisibility=hidden \ -Wall -mprintf-kind=buffered -g -Wno-unused-function -Wno-sign-compare -std=c++11 -Wvla \ -I $(CUDA_INC) \ $(CXXFLAGS)

After compiling RCCL, reported this error :

enquence.cc Current function: ncclLaunchKernel line 1090 :1:devhcprintf.cpp :265 : 81559524344 us: [pid:65800 tid:0x7f0d2c53d440] Hash value collision detected, printf buffer ill formed :1:rocvirtual.cpp :3188: 81559524353 us: [pid:65800 tid:0x7f0d2c53d440] Could not print data from the printf buffer! :1:rocvirtual.cpp :3280: 81559524355 us: [pid:65800 tid:0x7f0d2c53d440] AQL dispatch failed! :1:devhcprintf.cpp :265 : 81559524402 us: [pid:65799 tid:0x7ff8fd860440] Hash value collision detected, printf buffer ill formed :1:rocvirtual.cpp :3188: 81559524410 us: [pid:65799 tid:0x7ff8fd860440] Could not print data from the printf buffer! :1:rocvirtual.cpp :3280: 81559524416 us: [pid:65799 tid:0x7ff8fd860440] AQL dispatch failed! [rank0]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details) [rank1]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

I have set these environment variables export HIP_KERNEL_PRINTF=1 export HIP_ENABLE_PRINTF=1 export HCC_ENABLE_PRINTF=1 export AMD_LOG_LEVEL=1

Using a Linux server with two GPU cards, Without printf, the program executes normally, How should I solve this problem?

Operating System

22.04.1 LTS (Jammy Jellyfish)

CPU

12th Gen Intel(R) Core(TM) i7-12700

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 5.7.0

ROCm Component

HIP, HIPCC, rccl

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

mangupta commented 7 months ago

@yangyangv8 : Can you confirm that the test that you are running i.e. "./build/sendrecv_perf -b 8 -e 128M -f 2 -t 1 -g 2" runs fine if you rebuild rccl from source even if you don't add the printf in the kernel?

yangyangv8 commented 7 months ago

@mangupta I have confirmed that the program runs normally without adding printf in the kernel.

yangyangv8 commented 7 months ago

@mangupta hello, Is there any outcome to this issue now?

ppanchad-amd commented 2 months ago

Hi @yangyangv8, created an internal ticket to investigate your issue. Thanks!

sohaibnd commented 1 month ago

Hi @yangyangv8, sorry for the delayed response.

I am closing this issue since it is a duplicate of github.com/ROCm/ROCm/issues/3001 and is being addressed there. Also, note that this is an issue directed to the rccl repo so should ideally be created there.