NVIDIA / cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Other
6.29k stars 1.79k forks source link

nvcc fatal: Option '--ptx (-ptx)' is not allowed when compiling for multiple GPU architectures #289

Open casparvl opened 2 months ago

casparvl commented 2 months ago

I'm building the CUDA samples for multiple architectures, since it is documented one can do this with the SMS option. My build command is:

make  -j 72 HOST_COMPILER=g++ SMS='80 86' 

I've encountered the issue with both Cuda-Samples 11.3, and 12.2. The issue is present in at least two samples: memMapIPCDrv and ptxjit. It is in this line and this line of their respective makefiles, which both read (with some context):

$(PTX_FILE): memMapIpc_kernel.cu
    $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -ptx $<
    $(EXEC) mkdir -p data
    $(EXEC) cp -f $@ ./data
    $(EXEC) mkdir -p ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)
    $(EXEC) cp -f $@ ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)

I believe what should be done is store the GENCODE_FLAGS for PTX file generation separately. I.e this line should probably read:

# Generate PTX code from the highest SM architecture in $(SMS) to guarantee forward-compatibility
HIGHEST_SM := $(lastword $(sort $(SMS)))
ifneq ($(HIGHEST_SM),)
GENCODE_FLAGS += -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
GENCODE_FLAGS_HIGHEST_SM = -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
endif
endif

And then the offending section modified to:

$(PTX_FILE): memMapIpc_kernel.cu
    $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS_HIGHEST_SM) -o $@ -ptx $<
    $(EXEC) mkdir -p data
    $(EXEC) cp -f $@ ./data
    $(EXEC) mkdir -p ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)
    $(EXEC) cp -f $@ ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)

I can at least confirm that with this diff:

$ cat *.patch
diff -Nru cuda-samples-12.2.orig/Samples/3_CUDA_Features/memMapIPCDrv/Makefile cuda-samples-12.2/Samples/3_CUDA_Features/memMapIPCDrv/Makefile
--- cuda-samples-12.2.orig/Samples/3_CUDA_Features/memMapIPCDrv/Makefile        2024-07-29 12:14:28.538848000 +0200
+++ cuda-samples-12.2/Samples/3_CUDA_Features/memMapIPCDrv/Makefile     2024-07-29 12:17:02.812364739 +0200
@@ -312,6 +312,7 @@
 HIGHEST_SM := $(lastword $(sort $(SMS)))
 ifneq ($(HIGHEST_SM),)
 GENCODE_FLAGS += -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
+GENCODE_FLAGS_HIGHEST_SM = -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
 endif
 endif

@@ -394,7 +395,7 @@
 endif

 $(PTX_FILE): memMapIpc_kernel.cu
-       $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -ptx $<
+       $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS_HIGHEST_SM) -o $@ -ptx $<
        $(EXEC) mkdir -p data
        $(EXEC) cp -f $@ ./data
        $(EXEC) mkdir -p ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)
diff -Nru cuda-samples-12.2.orig/Samples/3_CUDA_Features/ptxjit/Makefile cuda-samples-12.2/Samples/3_CUDA_Features/ptxjit/Makefile
--- cuda-samples-12.2.orig/Samples/3_CUDA_Features/ptxjit/Makefile      2024-07-29 12:14:28.546771000 +0200
+++ cuda-samples-12.2/Samples/3_CUDA_Features/ptxjit/Makefile   2024-07-29 12:15:47.089354181 +0200
@@ -306,6 +306,7 @@
 HIGHEST_SM := $(lastword $(sort $(SMS)))
 ifneq ($(HIGHEST_SM),)
 GENCODE_FLAGS += -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
+GENCODE_FLAGS_HIGHEST_SM = -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
 endif
 endif

@@ -390,7 +391,7 @@
 endif

 $(PTX_FILE): ptxjit_kernel.cu
-       $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -ptx $<
+       $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS_HIGHEST_SM) -o $@ -ptx $<
        $(EXEC) mkdir -p data
        $(EXEC) cp -f $@ ./data
        $(EXEC) mkdir -p ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)

On top of the CUDA-Samples 12.2 sources, it builds correctly for multiple architectures. However, what I'm not 100% sure of, is if this makes sense, so I'm hoping someone else can confirm that :)

casparvl commented 2 months ago

To answer my own question: this is not really the solution. I can run it on an H100 (CC 90), which means the forward compatibility is working. But I can't run it on an A100 (CC 80), which was one of my actual targets in SMS:

$ memMapIPCDrv
> findModulePath found file at <./memMapIpc_kernel64.ptx>
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
> findModulePath found file at <./memMapIpc_kernel64.ptx>
> findModulePath found file at <./memMapIpc_kernel64.ptx>
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
> findModulePath found file at <./memMapIpc_kernel64.ptx>
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
checkCudaErrors() Driver API error = 0218 "a PTX JIT compilation failed" from file <memMapIpc.cpp>, line 292.
checkCudaErrors() Driver API error = 0218 "a PTX JIT compilation failed" from file <memMapIpc.cpp>, line 292.
checkCudaErrors() Driver API error = 0218 "a PTX JIT compilation failed" from file <memMapIpc.cpp>, line 292.
checkCudaErrors() Driver API error = 0218 "a PTX JIT compilation failed" from file <memMapIpc.cpp>, line 292.
Process 0 failed!

(same for the ptxjit example btw) It seems it will try to always JIT compile.

I'm really not sure what this sample is supposed to do when CUDA-Samples is build for multiple SMS's. It seems like it always wants to invoke the jit compiler on ptx code. The only thing that would create a working example is actually to replace creating the ptx code for the highest SM by creating it for the lowest SM. That way, it would at least be able to JIT-compile across all the SM architectures passed to SMS, even the lowest one.

The patch would then be:

diff -Nru cuda-samples-12.2.orig/Samples/3_CUDA_Features/memMapIPCDrv/Makefile cuda-samples-12.2/Samples/3_CUDA_Features/memMapIPCDrv/Makefile
--- cuda-samples-12.2.orig/Samples/3_CUDA_Features/memMapIPCDrv/Makefile        2024-07-29 12:14:28.538848000 +0200
+++ cuda-samples-12.2/Samples/3_CUDA_Features/memMapIPCDrv/Makefile     2024-07-29 13:02:45.134261829 +0200
@@ -313,6 +313,12 @@
 ifneq ($(HIGHEST_SM),)
 GENCODE_FLAGS += -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
 endif
+
+# Generate the explicit PTX file for the lowest SM architecture in $(SMS), so it works on all SMS listed there
+LOWEST_SM := $(firstword $(sort $(SMS)))
+ifneq ($(LOWEST_SM),)
+GENCODE_FLAGS_LOWEST_SM += -gencode arch=compute_$(LOWEST_SM),code=compute_$(LOWEST_SM)
+endif
 endif

 ifeq ($(TARGET_OS),darwin)
@@ -394,7 +400,7 @@
 endif

 $(PTX_FILE): memMapIpc_kernel.cu
-       $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -ptx $<
+       $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS_LOWEST_SM) -o $@ -ptx $<
        $(EXEC) mkdir -p data
        $(EXEC) cp -f $@ ./data
        $(EXEC) mkdir -p ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)
diff -Nru cuda-samples-12.2.orig/Samples/3_CUDA_Features/ptxjit/Makefile cuda-samples-12.2/Samples/3_CUDA_Features/ptxjit/Makefile
--- cuda-samples-12.2.orig/Samples/3_CUDA_Features/ptxjit/Makefile      2024-07-29 12:14:28.546771000 +0200
+++ cuda-samples-12.2/Samples/3_CUDA_Features/ptxjit/Makefile   2024-07-29 13:02:38.741961008 +0200
@@ -307,6 +307,12 @@
 ifneq ($(HIGHEST_SM),)
 GENCODE_FLAGS += -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
 endif
+
+# Generate the explicit PTX file for the lowest SM architecture in $(SMS), so it works on all SMS listed there
+LOWEST_SM := $(firstword $(sort $(SMS)))
+ifneq ($(LOWEST_SM),)
+GENCODE_FLAGS_LOWEST_SM += -gencode arch=compute_$(LOWEST_SM),code=compute_$(LOWEST_SM)
+endif
 endif

 ifeq ($(TARGET_OS),darwin)
@@ -390,7 +396,7 @@
 endif

 $(PTX_FILE): ptxjit_kernel.cu
-       $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -ptx $<
+       $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS_LOWEST_SM) -o $@ -ptx $<
        $(EXEC) mkdir -p data
        $(EXEC) cp -f $@ ./data
        $(EXEC) mkdir -p ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)

If I build this for SMS='80 90', it works on both A100 and H100.