Open brycelelbach opened 3 years ago
There's probably some odd floating point nonsense happening here. Doesn't reproduce with GCC. Disabling for now.
[19:55:32]:wash@voyager:/home/wash/development/nvidia/cuda_linux_p4/sw/gpgpu/thrust:0:$ ci/local/build.bash -i gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest cub.cpp17.test.device_radix_sort.minimal cuda11.3.1-devel-ubuntu20.04-icclatest: Pulling from gpuci/cccl Digest: sha256:e20e996de6f79a75754789746ad0e3535ddc82b20706fde67db489f56ca5cefc Status: Image is up to date for gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest docker.io/gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest :: initializing oneAPI environment ... build.bash: BASH_VERSION = 5.0.17(1)-release :: compiler -- latest :: debugger -- latest :: dev-utilities -- latest :: tbb -- latest :: oneAPI environment initialized :: >>>> Determine system topology... Logical CPUs: 12 [threads] Physical CPUs: 6 [cores] Total Mem: 62.57 [GBs] Max Threads Per Core: 2 [threads/core] Min Memory Per Threads: 4 [GBs/thread] CPU Bound Threads: 12 [threads] Mem Bound Threads: 15 [threads] Parallel Level: 12 [threads] Mem Per Thread: 5.214 [GBs/thread] >>>> Get environment... TBBROOT=/opt/intel/oneapi/tbb/2021.2.0/env/.. NVIDIA_VISIBLE_DEVICES=all TOTAL_MEM=62.57 ONEAPI_ROOT=/opt/intel/oneapi SETVARS_VARS_PATH=/opt/intel/oneapi/tbb/latest/env/vars.sh HOSTNAME=854ed4de2e04 ACL_BOARD_VENDOR_PATH=/opt/Intel/OpenCLFPGA/oneAPI/Boards NVIDIA_REQUIRE_CUDA=cuda>=11.3 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450 COVERAGE_PLAN=Minimal APT_KEY_DONT_WARN_ON_DANGEROUS_USAGE=1 SDK_TYPE=cuda NCCL_VERSION=2.9.9 CMAKE_BUILD_TYPE=Release PWD=/cccl/thrust/build NVIDIA_DRIVER_CAPABILITIES=compute,utility LOGICAL_CPUS=12 MANPATH=/opt/intel/oneapi/debugger/10.1.1/documentation/man::/opt/intel/oneapi/compiler/2021.2.0/documentation/en/man/common: MIN_MEMORY_PER_THREAD=4 CXX=/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icpc CPU_BOUND_THREADS=12 TZ=US/Pacific HOME=/cccl/thrust MEM_BOUND_THREADS=15 CUDA_VERSION=11.3.1 SETVARS_COMPLETED=1 CMAKE_PREFIX_PATH=/opt/intel/oneapi/tbb/2021.2.0/env/..: CUDACXX=/usr/local/cuda/bin/nvcc SDK_VER=11.3.1-devel WORKSPACE=/cccl/thrust INFOPATH=/opt/intel/oneapi/debugger/10.1.1/documentation/info/ TERM=xterm LIBRARY_PATH=/opt/intel/oneapi/tbb/2021.2.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2021.2.0/linux/lib:/usr/local/cuda/lib64/stubs CMAKE_FLAGS=-DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER='/usr/local/cuda/bin/nvcc' -DCMAKE_CUDA_FLAGS=-allow-unsupported-compiler -DCMAKE_CXX_COMPILER='/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icpc' -G Ninja -DTHRUST_ENABLE_MULTICONFIG=ON -DTHRUST_MULTICONFIG_ENABLE_DIALECT_LATEST=ON -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_CPP=ON -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_TBB=OFF -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_OMP=OFF -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_CUDA=ON -DTHRUST_MULTICONFIG_WORKLOAD=SMALL -DTHRUST_INCLUDE_CUB_CMAKE=ON -DCUB_ENABLE_THOROUGH_TESTING=OFF -DCUB_ENABLE_BENCHMARK_TESTING=OFF -DCUB_ENABLE_MINIMAL_TESTING=ON -DTHRUST_AUTO_DETECT_COMPUTE_ARCHS=ON SHLVL=2 BUILD_TYPE=gpu OCL_ICD_FILENAMES=libintelocl_emu.so:libalteracl.so:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/x64/libintelocl.so PARALLEL_LEVEL=12 MEM_PER_THREAD=5.214 OS_TYPE=ubuntu INTELFPGAOCLSDKROOT=/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga LD_LIBRARY_PATH=/opt/intel/oneapi/tbb/2021.2.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/debugger/10.1.1/dep/lib:/opt/intel/oneapi/debugger/10.1.1/libipt/intel64/lib:/opt/intel/oneapi/debugger/10.1.1/gdb/intel64/lib:/opt/intel/oneapi/compiler/2021.2.0/linux/lib:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/x64:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/emu:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga/linux64/lib:/opt/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2021.2.0/linux/compiler/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 OS_VER=20.04 CMAKE_BUILD_FLAGS=-- -k0 cub.cpp17.test.device_radix_sort.minimal MAX_THREADS_PER_CORE=2 PATH=/usr/local/cuda/bin:/opt/intel/oneapi/dev-utilities/2021.2.0/bin:/opt/intel/oneapi/debugger/10.1.1/gdb/intel64/bin:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga/llvm/aocl-bin:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga/bin:/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64:/opt/intel/oneapi/compiler/2021.2.0/linux/bin:/opt/intel/oneapi/compiler/2021.2.0/linux/ioc/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CC=/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icc INTEL_PYTHONHOME=/opt/intel/oneapi/debugger/10.1.1/dep CTEST_FLAGS=--output-on-failure -R ^cub.cpp17.test.device_radix_sort.minimal$ CPATH=/opt/intel/oneapi/tbb/2021.2.0/env/../include:/opt/intel/oneapi/dev-utilities/2021.2.0/include:/opt/intel/oneapi/compiler/2021.2.0/linux/include DEBIAN_FRONTEND=noninteractive CXX_TYPE=icc PHYSICAL_CPUS=6 OLDPWD=/cccl/thrust CXX_VER=latest CMAKE_LIBRARY_PATH=/opt/intel/oneapi/tbb/2021.2.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2021.2.0/linux/lib:/usr/local/cuda/lib64/stubs _=/usr/bin/env >>>> Check versions... icpc (ICC) 2021.2.0 20210228 Copyright (C) 1985-2021 Intel Corporation. All rights reserved. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0 Tue Jun 29 19:55:39 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GT 710 On | 00000000:04:00.0 N/A | N/A | | 40% 50C P8 N/A / N/A | 1MiB / 2002MiB | N/A Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 RTX A6000 On | 00000000:17:00.0 Off | Off | | 34% 61C P8 34W / 300W | 1MiB / 48685MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Quadro GV100 On | 00000000:65:00.0 On | Off | | 34% 47C P0 27W / 250W | 0MiB / 32505MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ >>>> Configure Thrust and CUB... cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER='/usr/local/cuda/bin/nvcc' -DCMAKE_CUDA_FLAGS=-allow-unsupported-compiler -DCMAKE_CXX_COMPILER='/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icpc' -G Ninja -DTHRUST_ENABLE_MULTICONFIG=ON -DTHRUST_MULTICONFIG_ENABLE_DIALECT_LATEST=ON -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_CPP=ON -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_TBB=OFF -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_OMP=OFF -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_CUDA=ON -DTHRUST_MULTICONFIG_WORKLOAD=SMALL -DTHRUST_INCLUDE_CUB_CMAKE=ON -DCUB_ENABLE_THOROUGH_TESTING=OFF -DCUB_ENABLE_BENCHMARK_TESTING=OFF -DCUB_ENABLE_MINIMAL_TESTING=ON -DTHRUST_AUTO_DETECT_COMPUTE_ARCHS=ON -- The CXX compiler identification is Intel 20.2.2.20210228 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icpc - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found CUB: /cccl/thrust/dependencies/cub/cub/cmake/cub-config.cmake (found version "1.14.0.0") -- Found Thrust: /cccl/thrust/thrust/cmake/thrust-config.cmake (found version "1.14.0.0") -- Performing Test CXX_FLAG__Werror -- Performing Test CXX_FLAG__Werror - Success -- Performing Test CXX_FLAG__Wall -- Performing Test CXX_FLAG__Wall - Success -- Performing Test CXX_FLAG__Wextra -- Performing Test CXX_FLAG__Wextra - Success -- Performing Test CXX_FLAG__Winit_self -- Performing Test CXX_FLAG__Winit_self - Success -- Performing Test CXX_FLAG__Woverloaded_virtual -- Performing Test CXX_FLAG__Woverloaded_virtual - Success -- Performing Test CXX_FLAG__Wcast_qual -- Performing Test CXX_FLAG__Wcast_qual - Success -- Performing Test CXX_FLAG__Wpointer_arith -- Performing Test CXX_FLAG__Wpointer_arith - Success -- Performing Test CXX_FLAG__Wunused_local_typedef -- Performing Test CXX_FLAG__Wunused_local_typedef - Failed -- Performing Test CXX_FLAG__Wvla -- Performing Test CXX_FLAG__Wvla - Success -- Performing Test CXX_FLAG__Wgnu -- Performing Test CXX_FLAG__Wgnu - Failed -- Performing Test CXX_FLAG__Wno_gnu_zero_variadic_macro_arguments -- Performing Test CXX_FLAG__Wno_gnu_zero_variadic_macro_arguments - Failed -- Performing Test CXX_FLAG__Wno_unused_function -- Performing Test CXX_FLAG__Wno_unused_function - Success -- Performing Test CXX_FLAG__diag_disable_11074 -- Performing Test CXX_FLAG__diag_disable_11074 - Success -- Performing Test CXX_FLAG__diag_disable_11076 -- Performing Test CXX_FLAG__diag_disable_11076 - Success -- The CUDA compiler identification is NVIDIA 11.3.109 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Thrust: Automatically detected compute architectures: sm_35 sm_70 sm_86 -- Thrust: Explicitly enabled compute architectures: sm_35 sm_70 sm_86 -- Testing for supported language standards... -- Testing CXX11 Support: TRUE -- Testing CXX14 Support: TRUE -- Testing CXX17 Support: TRUE -- Testing CUDA11 Support: TRUE -- Testing CUDA14 Support: TRUE -- Testing CUDA17 Support: TRUE -- Enabling Thrust configuration: cpp.cuda.cpp17 -- 1 unique thrust.host.device.dialect configurations generated -- CPP system found? TRUE -- CUDA system found? TRUE -- TBB system found? FALSE -- OMP system found? FALSE -- CUB: Explicitly enabled compute architectures: sm_35 sm_70 sm_86 -- Performing Test CXX_FLAG__Wno_deprecated_declarations -- Performing Test CXX_FLAG__Wno_deprecated_declarations - Success -- Found Thrust: /cccl/thrust/thrust/cmake/thrust-config.cmake (found suitable exact version "1.14.0.0") -- Enabling CUB configuration: cpp17 -- 1 unique cub.dialect configurations generated -- Configuring done -- Generating done -- Build files have been written to: /cccl/thrust/build Configure Time: 0m7.171s >>>> Build Thrust and CUB... cmake --build . -- -k0 cub.cpp17.test.device_radix_sort.minimal -j 12 [0/2] Re-checking globbed directories... [2/2] Linking CUDA executable bin/cub.cpp17.test.device_radix_sort.minimal Build Time: 1m30.427s >>>> Test Thrust and CUB... ctest --output-on-failure -R ^cub.cpp17.test.device_radix_sort.minimal$ Test project /cccl/thrust/build Start 299: cub.cpp17.test.device_radix_sort.minimal 1/1 Test NVIDIA/cub#299: cub.cpp17.test.device_radix_sort.minimal ...***Failed 14.67 sec Using device 0: RTX A6000 (PTX version 860, SM860, 84 SMs, 48416 free / 48685 total MB physmem, 768.096 GB/s @ 8001000 kHz mem clock, ECC off) Sorting reference solution on CPU (5000 segments)... Done. Testing bits [0,32) of j keys with gen-mode 2 CUB keys-only cub::DeviceRadixSort 24000000 items, 5000 segments, 4-byte keys (j) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 32 Invoking segmented_kernels<<<5000, 384, 0, 0>>>(), 11 items per thread, 2 SM occupancy, current bit 0, bit_grain 5 Invoking segmented_kernels<<<5000, 384, 0, 0>>>(), 11 items per thread, 2 SM occupancy, current bit 5, bit_grain 5 Invoking segmented_kernels<<<5000, 384, 0, 0>>>(), 11 items per thread, 2 SM occupancy, current bit 10, bit_grain 5 Invoking segmented_kernels<<<5000, 384, 0, 0>>>(), 11 items per thread, 2 SM occupancy, current bit 15, bit_grain 5 Invoking segmented_kernels<<<5000, 192, 0, 0>>>(), 39 items per thread, 2 SM occupancy, current bit 20, bit_grain 6 Invoking segmented_kernels<<<5000, 192, 0, 0>>>(), 39 items per thread, 2 SM occupancy, current bit 26, bit_grain 6 Warmup done. Checking results: Compare keys (selector 0): PASS ------------------------------- Sorting reference solution on CPU (1 segments)... Done. Testing bits [0,8) of h keys with gen-mode 2 CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 1-byte keys (h) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 8 Invoking upsweep_kernel<<<1260, 256, 0, 0>>>(), 47 items per thread, 4 SM occupancy, current bit 0, bit_grain 4 Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread Invoking downsweep_kernel<<<1260, 128, 0, 0>>>(), 47 items per thread, 3 SM occupancy Invoking upsweep_kernel<<<1260, 256, 0, 0>>>(), 47 items per thread, 4 SM occupancy, current bit 4, bit_grain 4 Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread Invoking downsweep_kernel<<<1260, 128, 0, 0>>>(), 47 items per thread, 3 SM occupancy Warmup done. Checking results: Compare keys (selector 0): PASS Sorting reference solution on CPU (1 segments)... Done. Testing bits [0,32) of j keys with gen-mode 2 CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 4-byte keys (j) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 32 Warmup done. Checking results: Compare keys (selector 0): PASS Sorting reference solution on CPU (1 segments)... Done. Testing bits [0,64) of y keys with gen-mode 2 CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 8-byte keys (y) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 64 Warmup done. Checking results: Compare keys (selector 0): PASS ------------------------------- Sorting reference solution on CPU (1 segments)... Done. Testing bits [0,16) of 6half_t keys with gen-mode 2 CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 2-byte keys (6half_t) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 16 Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 0, bit_grain 6 Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 6, bit_grain 6 Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 12, bit_grain 4 Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy Warmup done. Checking results: Compare keys (selector 1): PASS Sorting reference solution on CPU (1 segments)... Done. Testing bits [0,16) of 10bfloat16_t keys with gen-mode 2 CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 2-byte keys (10bfloat16_t) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 16 Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 0, bit_grain 6 Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 6, bit_grain 6 Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 12, bit_grain 4 Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy Warmup done. Checking results: INCORRECT: [11953215]: -0 != 0 Compare keys (selector 1): FAIL (../dependencies/cub/test/test_device_radix_sort.cu: 884) 0% tests passed, 1 tests failed out of 1 Total Test time (real) = 14.68 sec The following tests FAILED: 299 - cub.cpp17.test.device_radix_sort.minimal (Failed) Errors while running CTest Test Time: 0m14.687s
ICC support is deprecated. I guess we will not investigate the cause of this issue.
There's probably some odd floating point nonsense happening here. Doesn't reproduce with GCC. Disabling for now.