NVIDIA / cccl

CUDA Core Compute Libraries
Other
1.12k stars 130 forks source link

`cub::DeviceRadixSort` `bfloat16_t` unit test fails with `-0 != 0` with ICC #873

Open brycelelbach opened 3 years ago

brycelelbach commented 3 years ago

There's probably some odd floating point nonsense happening here. Doesn't reproduce with GCC. Disabling for now.

[19:55:32]:wash@voyager:/home/wash/development/nvidia/cuda_linux_p4/sw/gpgpu/thrust:0:$ ci/local/build.bash -i gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest cub.cpp17.test.device_radix_sort.minimal
cuda11.3.1-devel-ubuntu20.04-icclatest: Pulling from gpuci/cccl
Digest: sha256:e20e996de6f79a75754789746ad0e3535ddc82b20706fde67db489f56ca5cefc
Status: Image is up to date for gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest
docker.io/gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest

:: initializing oneAPI environment ...
   build.bash: BASH_VERSION = 5.0.17(1)-release
:: compiler -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: tbb -- latest
:: oneAPI environment initialized ::

>>>> Determine system topology...

Logical CPUs:           12 [threads]
Physical CPUs:          6 [cores]
Total Mem:              62.57 [GBs]
Max Threads Per Core:   2 [threads/core]
Min Memory Per Threads: 4 [GBs/thread]
CPU Bound Threads:      12 [threads]
Mem Bound Threads:      15 [threads]
Parallel Level:         12 [threads]
Mem Per Thread:         5.214 [GBs/thread]

>>>> Get environment...

TBBROOT=/opt/intel/oneapi/tbb/2021.2.0/env/..
NVIDIA_VISIBLE_DEVICES=all
TOTAL_MEM=62.57
ONEAPI_ROOT=/opt/intel/oneapi
SETVARS_VARS_PATH=/opt/intel/oneapi/tbb/latest/env/vars.sh
HOSTNAME=854ed4de2e04
ACL_BOARD_VENDOR_PATH=/opt/Intel/OpenCLFPGA/oneAPI/Boards
NVIDIA_REQUIRE_CUDA=cuda>=11.3 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450
COVERAGE_PLAN=Minimal
APT_KEY_DONT_WARN_ON_DANGEROUS_USAGE=1
SDK_TYPE=cuda
NCCL_VERSION=2.9.9
CMAKE_BUILD_TYPE=Release
PWD=/cccl/thrust/build
NVIDIA_DRIVER_CAPABILITIES=compute,utility
LOGICAL_CPUS=12
MANPATH=/opt/intel/oneapi/debugger/10.1.1/documentation/man::/opt/intel/oneapi/compiler/2021.2.0/documentation/en/man/common:
MIN_MEMORY_PER_THREAD=4
CXX=/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icpc
CPU_BOUND_THREADS=12
TZ=US/Pacific
HOME=/cccl/thrust
MEM_BOUND_THREADS=15
CUDA_VERSION=11.3.1
SETVARS_COMPLETED=1
CMAKE_PREFIX_PATH=/opt/intel/oneapi/tbb/2021.2.0/env/..:
CUDACXX=/usr/local/cuda/bin/nvcc
SDK_VER=11.3.1-devel
WORKSPACE=/cccl/thrust
INFOPATH=/opt/intel/oneapi/debugger/10.1.1/documentation/info/
TERM=xterm
LIBRARY_PATH=/opt/intel/oneapi/tbb/2021.2.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2021.2.0/linux/lib:/usr/local/cuda/lib64/stubs
CMAKE_FLAGS=-DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER='/usr/local/cuda/bin/nvcc' -DCMAKE_CUDA_FLAGS=-allow-unsupported-compiler -DCMAKE_CXX_COMPILER='/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icpc' -G Ninja -DTHRUST_ENABLE_MULTICONFIG=ON -DTHRUST_MULTICONFIG_ENABLE_DIALECT_LATEST=ON -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_CPP=ON -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_TBB=OFF -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_OMP=OFF -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_CUDA=ON -DTHRUST_MULTICONFIG_WORKLOAD=SMALL -DTHRUST_INCLUDE_CUB_CMAKE=ON -DCUB_ENABLE_THOROUGH_TESTING=OFF -DCUB_ENABLE_BENCHMARK_TESTING=OFF -DCUB_ENABLE_MINIMAL_TESTING=ON -DTHRUST_AUTO_DETECT_COMPUTE_ARCHS=ON
SHLVL=2
BUILD_TYPE=gpu
OCL_ICD_FILENAMES=libintelocl_emu.so:libalteracl.so:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/x64/libintelocl.so
PARALLEL_LEVEL=12
MEM_PER_THREAD=5.214
OS_TYPE=ubuntu
INTELFPGAOCLSDKROOT=/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga
LD_LIBRARY_PATH=/opt/intel/oneapi/tbb/2021.2.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/debugger/10.1.1/dep/lib:/opt/intel/oneapi/debugger/10.1.1/libipt/intel64/lib:/opt/intel/oneapi/debugger/10.1.1/gdb/intel64/lib:/opt/intel/oneapi/compiler/2021.2.0/linux/lib:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/x64:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/emu:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga/linux64/lib:/opt/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2021.2.0/linux/compiler/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
OS_VER=20.04
CMAKE_BUILD_FLAGS=-- -k0 cub.cpp17.test.device_radix_sort.minimal
MAX_THREADS_PER_CORE=2
PATH=/usr/local/cuda/bin:/opt/intel/oneapi/dev-utilities/2021.2.0/bin:/opt/intel/oneapi/debugger/10.1.1/gdb/intel64/bin:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga/llvm/aocl-bin:/opt/intel/oneapi/compiler/2021.2.0/linux/lib/oclfpga/bin:/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64:/opt/intel/oneapi/compiler/2021.2.0/linux/bin:/opt/intel/oneapi/compiler/2021.2.0/linux/ioc/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
CC=/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icc
INTEL_PYTHONHOME=/opt/intel/oneapi/debugger/10.1.1/dep
CTEST_FLAGS=--output-on-failure -R ^cub.cpp17.test.device_radix_sort.minimal$
CPATH=/opt/intel/oneapi/tbb/2021.2.0/env/../include:/opt/intel/oneapi/dev-utilities/2021.2.0/include:/opt/intel/oneapi/compiler/2021.2.0/linux/include
DEBIAN_FRONTEND=noninteractive
CXX_TYPE=icc
PHYSICAL_CPUS=6
OLDPWD=/cccl/thrust
CXX_VER=latest
CMAKE_LIBRARY_PATH=/opt/intel/oneapi/tbb/2021.2.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2021.2.0/linux/lib:/usr/local/cuda/lib64/stubs
_=/usr/bin/env

>>>> Check versions...

icpc (ICC) 2021.2.0 20210228
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

Tue Jun 29 19:55:39 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 710      On   | 00000000:04:00.0 N/A |                  N/A |
| 40%   50C    P8    N/A /  N/A |      1MiB /  2002MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  RTX A6000           On   | 00000000:17:00.0 Off |                  Off |
| 34%   61C    P8    34W / 300W |      1MiB / 48685MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro GV100        On   | 00000000:65:00.0  On |                  Off |
| 34%   47C    P0    27W / 250W |      0MiB / 32505MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

>>>> Configure Thrust and CUB...

cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER='/usr/local/cuda/bin/nvcc' -DCMAKE_CUDA_FLAGS=-allow-unsupported-compiler -DCMAKE_CXX_COMPILER='/opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icpc' -G Ninja -DTHRUST_ENABLE_MULTICONFIG=ON -DTHRUST_MULTICONFIG_ENABLE_DIALECT_LATEST=ON -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_CPP=ON -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_TBB=OFF -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_OMP=OFF -DTHRUST_MULTICONFIG_ENABLE_SYSTEM_CUDA=ON -DTHRUST_MULTICONFIG_WORKLOAD=SMALL -DTHRUST_INCLUDE_CUB_CMAKE=ON -DCUB_ENABLE_THOROUGH_TESTING=OFF -DCUB_ENABLE_BENCHMARK_TESTING=OFF -DCUB_ENABLE_MINIMAL_TESTING=ON -DTHRUST_AUTO_DETECT_COMPUTE_ARCHS=ON
-- The CXX compiler identification is Intel 20.2.2.20210228
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icpc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found CUB: /cccl/thrust/dependencies/cub/cub/cmake/cub-config.cmake (found version "1.14.0.0")
-- Found Thrust: /cccl/thrust/thrust/cmake/thrust-config.cmake (found version "1.14.0.0")
-- Performing Test CXX_FLAG__Werror
-- Performing Test CXX_FLAG__Werror - Success
-- Performing Test CXX_FLAG__Wall
-- Performing Test CXX_FLAG__Wall - Success
-- Performing Test CXX_FLAG__Wextra
-- Performing Test CXX_FLAG__Wextra - Success
-- Performing Test CXX_FLAG__Winit_self
-- Performing Test CXX_FLAG__Winit_self - Success
-- Performing Test CXX_FLAG__Woverloaded_virtual
-- Performing Test CXX_FLAG__Woverloaded_virtual - Success
-- Performing Test CXX_FLAG__Wcast_qual
-- Performing Test CXX_FLAG__Wcast_qual - Success
-- Performing Test CXX_FLAG__Wpointer_arith
-- Performing Test CXX_FLAG__Wpointer_arith - Success
-- Performing Test CXX_FLAG__Wunused_local_typedef
-- Performing Test CXX_FLAG__Wunused_local_typedef - Failed
-- Performing Test CXX_FLAG__Wvla
-- Performing Test CXX_FLAG__Wvla - Success
-- Performing Test CXX_FLAG__Wgnu
-- Performing Test CXX_FLAG__Wgnu - Failed
-- Performing Test CXX_FLAG__Wno_gnu_zero_variadic_macro_arguments
-- Performing Test CXX_FLAG__Wno_gnu_zero_variadic_macro_arguments - Failed
-- Performing Test CXX_FLAG__Wno_unused_function
-- Performing Test CXX_FLAG__Wno_unused_function - Success
-- Performing Test CXX_FLAG__diag_disable_11074
-- Performing Test CXX_FLAG__diag_disable_11074 - Success
-- Performing Test CXX_FLAG__diag_disable_11076
-- Performing Test CXX_FLAG__diag_disable_11076 - Success
-- The CUDA compiler identification is NVIDIA 11.3.109
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Thrust: Automatically detected compute architectures: sm_35 sm_70 sm_86
-- Thrust: Explicitly enabled compute architectures: sm_35 sm_70 sm_86
-- Testing for supported language standards...
-- Testing CXX11 Support: TRUE
-- Testing CXX14 Support: TRUE
-- Testing CXX17 Support: TRUE
-- Testing CUDA11 Support: TRUE
-- Testing CUDA14 Support: TRUE
-- Testing CUDA17 Support: TRUE
-- Enabling Thrust configuration: cpp.cuda.cpp17
-- 1 unique thrust.host.device.dialect configurations generated
-- CPP system found?  TRUE
-- CUDA system found? TRUE
-- TBB system found?  FALSE
-- OMP system found?  FALSE
-- CUB: Explicitly enabled compute architectures: sm_35 sm_70 sm_86
-- Performing Test CXX_FLAG__Wno_deprecated_declarations
-- Performing Test CXX_FLAG__Wno_deprecated_declarations - Success
-- Found Thrust: /cccl/thrust/thrust/cmake/thrust-config.cmake (found suitable exact version "1.14.0.0")
-- Enabling CUB configuration: cpp17
-- 1 unique cub.dialect configurations generated
-- Configuring done
-- Generating done
-- Build files have been written to: /cccl/thrust/build

Configure Time: 0m7.171s

>>>> Build Thrust and CUB...

cmake --build . -- -k0 cub.cpp17.test.device_radix_sort.minimal -j 12
[0/2] Re-checking globbed directories...
[2/2] Linking CUDA executable bin/cub.cpp17.test.device_radix_sort.minimal

Build Time: 1m30.427s

>>>> Test Thrust and CUB...

ctest --output-on-failure -R ^cub.cpp17.test.device_radix_sort.minimal$
Test project /cccl/thrust/build
    Start 299: cub.cpp17.test.device_radix_sort.minimal
1/1 Test NVIDIA/cub#299: cub.cpp17.test.device_radix_sort.minimal ...***Failed   14.67 sec
Using device 0: RTX A6000 (PTX version 860, SM860, 84 SMs, 48416 free / 48685 total MB physmem, 768.096 GB/s @ 8001000 kHz mem clock, ECC off)

Sorting reference solution on CPU (5000 segments)... Done.

Testing bits [0,32) of j keys with gen-mode 2
CUB keys-only cub::DeviceRadixSort 24000000 items, 5000 segments, 4-byte keys (j) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 32
Invoking segmented_kernels<<<5000, 384, 0, 0>>>(), 11 items per thread, 2 SM occupancy, current bit 0, bit_grain 5
Invoking segmented_kernels<<<5000, 384, 0, 0>>>(), 11 items per thread, 2 SM occupancy, current bit 5, bit_grain 5
Invoking segmented_kernels<<<5000, 384, 0, 0>>>(), 11 items per thread, 2 SM occupancy, current bit 10, bit_grain 5
Invoking segmented_kernels<<<5000, 384, 0, 0>>>(), 11 items per thread, 2 SM occupancy, current bit 15, bit_grain 5
Invoking segmented_kernels<<<5000, 192, 0, 0>>>(), 39 items per thread, 2 SM occupancy, current bit 20, bit_grain 6
Invoking segmented_kernels<<<5000, 192, 0, 0>>>(), 39 items per thread, 2 SM occupancy, current bit 26, bit_grain 6
Warmup done.  Checking results:
         Compare keys (selector 0): PASS

-------------------------------

Sorting reference solution on CPU (1 segments)... Done.

Testing bits [0,8) of h keys with gen-mode 2
CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 1-byte keys (h) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 8
Invoking upsweep_kernel<<<1260, 256, 0, 0>>>(), 47 items per thread, 4 SM occupancy, current bit 0, bit_grain 4
Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread
Invoking downsweep_kernel<<<1260, 128, 0, 0>>>(), 47 items per thread, 3 SM occupancy
Invoking upsweep_kernel<<<1260, 256, 0, 0>>>(), 47 items per thread, 4 SM occupancy, current bit 4, bit_grain 4
Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread
Invoking downsweep_kernel<<<1260, 128, 0, 0>>>(), 47 items per thread, 3 SM occupancy
Warmup done.  Checking results:
         Compare keys (selector 0): PASS

Sorting reference solution on CPU (1 segments)... Done.

Testing bits [0,32) of j keys with gen-mode 2
CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 4-byte keys (j) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 32
Warmup done.  Checking results:
         Compare keys (selector 0): PASS

Sorting reference solution on CPU (1 segments)... Done.

Testing bits [0,64) of y keys with gen-mode 2
CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 8-byte keys (y) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 64
Warmup done.  Checking results:
         Compare keys (selector 0): PASS

-------------------------------

Sorting reference solution on CPU (1 segments)... Done.

Testing bits [0,16) of 6half_t keys with gen-mode 2
CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 2-byte keys (6half_t) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 16
Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 0, bit_grain 6
Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread
Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy
Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 6, bit_grain 6
Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread
Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy
Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 12, bit_grain 4
Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread
Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy
Warmup done.  Checking results:
         Compare keys (selector 1): PASS

Sorting reference solution on CPU (1 segments)... Done.

Testing bits [0,16) of 10bfloat16_t keys with gen-mode 2
CUB keys-only cub::DeviceRadixSort 24000000 items, 1 segments, 2-byte keys (10bfloat16_t) 0-byte values (N3cub8NullTypeE), descending 0, begin_bit 0, end_bit 16
Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 0, bit_grain 6
Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread
Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy
Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 6, bit_grain 6
Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread
Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy
Invoking upsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 3 SM occupancy, current bit 12, bit_grain 4
Invoking scan_kernel<<<1, 512, 0, 0>>>(), 23 items per thread
Invoking downsweep_kernel<<<420, 256, 0, 0>>>(), 47 items per thread, 1 SM occupancy
Warmup done.  Checking results:
INCORRECT: [11953215]: -0 != 0   Compare keys (selector 1): FAIL

(../dependencies/cub/test/test_device_radix_sort.cu: 884)

0% tests passed, 1 tests failed out of 1

Total Test time (real) =  14.68 sec

The following tests FAILED:
        299 - cub.cpp17.test.device_radix_sort.minimal (Failed)
Errors while running CTest

Test Time: 0m14.687s
bernhardmgruber commented 1 month ago

ICC support is deprecated. I guess we will not investigate the cause of this issue.