Open casparvl opened 3 months ago
To answer my own question: this is not really the solution. I can run it on an H100 (CC 90), which means the forward compatibility is working. But I can't run it on an A100 (CC 80), which was one of my actual targets in SMS
:
$ memMapIPCDrv
> findModulePath found file at <./memMapIpc_kernel64.ptx>
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
> findModulePath found file at <./memMapIpc_kernel64.ptx>
> findModulePath found file at <./memMapIpc_kernel64.ptx>
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
> findModulePath found file at <./memMapIpc_kernel64.ptx>
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
checkCudaErrors() Driver API error = 0218 "a PTX JIT compilation failed" from file <memMapIpc.cpp>, line 292.
checkCudaErrors() Driver API error = 0218 "a PTX JIT compilation failed" from file <memMapIpc.cpp>, line 292.
checkCudaErrors() Driver API error = 0218 "a PTX JIT compilation failed" from file <memMapIpc.cpp>, line 292.
checkCudaErrors() Driver API error = 0218 "a PTX JIT compilation failed" from file <memMapIpc.cpp>, line 292.
Process 0 failed!
(same for the ptxjit
example btw)
It seems it will try to always JIT compile.
I'm really not sure what this sample is supposed to do when CUDA-Samples is build for multiple SMS's. It seems like it always wants to invoke the jit
compiler on ptx
code. The only thing that would create a working example is actually to replace creating the ptx code for the highest SM by creating it for the lowest SM. That way, it would at least be able to JIT-compile across all the SM architectures passed to SMS
, even the lowest one.
The patch would then be:
diff -Nru cuda-samples-12.2.orig/Samples/3_CUDA_Features/memMapIPCDrv/Makefile cuda-samples-12.2/Samples/3_CUDA_Features/memMapIPCDrv/Makefile
--- cuda-samples-12.2.orig/Samples/3_CUDA_Features/memMapIPCDrv/Makefile 2024-07-29 12:14:28.538848000 +0200
+++ cuda-samples-12.2/Samples/3_CUDA_Features/memMapIPCDrv/Makefile 2024-07-29 13:02:45.134261829 +0200
@@ -313,6 +313,12 @@
ifneq ($(HIGHEST_SM),)
GENCODE_FLAGS += -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
endif
+
+# Generate the explicit PTX file for the lowest SM architecture in $(SMS), so it works on all SMS listed there
+LOWEST_SM := $(firstword $(sort $(SMS)))
+ifneq ($(LOWEST_SM),)
+GENCODE_FLAGS_LOWEST_SM += -gencode arch=compute_$(LOWEST_SM),code=compute_$(LOWEST_SM)
+endif
endif
ifeq ($(TARGET_OS),darwin)
@@ -394,7 +400,7 @@
endif
$(PTX_FILE): memMapIpc_kernel.cu
- $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -ptx $<
+ $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS_LOWEST_SM) -o $@ -ptx $<
$(EXEC) mkdir -p data
$(EXEC) cp -f $@ ./data
$(EXEC) mkdir -p ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)
diff -Nru cuda-samples-12.2.orig/Samples/3_CUDA_Features/ptxjit/Makefile cuda-samples-12.2/Samples/3_CUDA_Features/ptxjit/Makefile
--- cuda-samples-12.2.orig/Samples/3_CUDA_Features/ptxjit/Makefile 2024-07-29 12:14:28.546771000 +0200
+++ cuda-samples-12.2/Samples/3_CUDA_Features/ptxjit/Makefile 2024-07-29 13:02:38.741961008 +0200
@@ -307,6 +307,12 @@
ifneq ($(HIGHEST_SM),)
GENCODE_FLAGS += -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)
endif
+
+# Generate the explicit PTX file for the lowest SM architecture in $(SMS), so it works on all SMS listed there
+LOWEST_SM := $(firstword $(sort $(SMS)))
+ifneq ($(LOWEST_SM),)
+GENCODE_FLAGS_LOWEST_SM += -gencode arch=compute_$(LOWEST_SM),code=compute_$(LOWEST_SM)
+endif
endif
ifeq ($(TARGET_OS),darwin)
@@ -390,7 +396,7 @@
endif
$(PTX_FILE): ptxjit_kernel.cu
- $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -ptx $<
+ $(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS_LOWEST_SM) -o $@ -ptx $<
$(EXEC) mkdir -p data
$(EXEC) cp -f $@ ./data
$(EXEC) mkdir -p ../../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)
If I build this for SMS='80 90'
, it works on both A100 and H100.
I'm building the CUDA samples for multiple architectures, since it is documented one can do this with the
SMS
option. My build command is:I've encountered the issue with both Cuda-Samples 11.3, and 12.2. The issue is present in at least two samples:
memMapIPCDrv
andptxjit
. It is in this line and this line of their respective makefiles, which both read (with some context):I believe what should be done is store the
GENCODE_FLAGS
for PTX file generation separately. I.e this line should probably read:And then the offending section modified to:
I can at least confirm that with this diff:
On top of the CUDA-Samples 12.2 sources, it builds correctly for multiple architectures. However, what I'm not 100% sure of, is if this makes sense, so I'm hoping someone else can confirm that :)