[BUG]: A lot of thrust tests are failing when built with `-fsanitize=null`

Artem-B commented 7 months ago

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Something else

Component

Thrust

Describe the bug

I'm porting thrust tests to our internal build at work.

One of the things we have enabled in our build by default is a subset of UB sanitizer, and I've noticed that thrust tests appear to have a lot of nupp pointer dereference failures. Some of the UB-enabled builds result in a crash. Enabling UB sometimes preserves the offending code, which would otherwise be removed by compiler, because it's allowed to treat UB in whatever way it wants.

Probing few of the failures deeper suggests that that the issues are real. E.g. test_async_copy_after test in async_copy.cu apparently attempts to dereference a null pointer.

How to Reproduce

Add the following lines to thrust/CMakeLists.txt:

# Force enable minimal UB sanitizer.
add_compile_options(-fsanitize-trap=all -fsanitize=null -fno-sanitize-trap=undefined)
add_link_options(-fsanitize=undefined)

Run thrust tests with:

PATH=$HOME/local/cuda-12.3.1/bin:$PATH bash -x ci/test_thrust.sh -cxx clang++-17 -std 20 -arch 80-real -cuda clang++-17

Adding -O0 will likely make even more tests fail as with high optimizations a lot of code gets eliminated before it gets a chance to be instrumented.

Observe the test failures:

with non-fatal ubsan: 79% tests passed, 77 tests failed out of 362. Test output: https://gist.github.com/Artem-B/9d59658f4c64940c2da4d59fd14096f4
with fatal ubsan: 23% tests passed, 280 tests failed out of 362

Note that non-failing tests also report UB violations, but they are hidden by the test framework. If you want to force all of them to turn into test failures (alas, it's just a crash with no useful diagnostics attached), use the following flags, and skip add_link_options():

add_compile_options(-fsanitize-trap=all -fsanitize=null)

Expected behavior

Thrust should not be relying on UB in general, and in particular when it comes to null pointers. Compiler can and does optimize code on the assumption that UB never happens. Sooner or later that will become a problem. It's possible that it already is, we just didn't notice it, yet.

Most of the ubsan reports are associated with the same few locations, so the root cause is probably fairly localized.

Reproduction link

No response

Operating System

Debian/testing

nvidia-smi output

not applicable.

NVCC version

release build of clang++-17
CUDA-12.3.1, though CUDA version is not particularly relevant here, as it's about host-side compilation w/ clang.

jrhemstad commented 7 months ago

Hey @Artem-B, thanks for the heads up.

Good timing! We're just starting the process of setting up nightly CI jobs to do things like build/run our tests with sanitizers enabled so we can find and resolve this kind of stuff.

I just added a task to https://github.com/NVIDIA/cccl/issues/1619 to include setting up jobs that build with these sanitizer options enabled. Once we get that infrastructure setup, we can go chip away at fixing the issues that come up.

bernhardmgruber commented 1 week ago

I just stumbled over this as well when trying UBSan to hunt a bug. We should get that fixed.

Similarly, I see issues reported by MSan as well.

Artem-B commented 1 day ago

This also appears to affect cub tests if they are built without optimizations.

(gdb) r
Starting program: /google/obj/workspace/59020db8998c499a49126ed0daf698aa034958bc800d56c31bc15c93b4d9bbce/ecad6e51-6ea9-4661-8eb4-75ae4e6417cc/blaze-out/k8-dbg/bin/third_party/gpus/cccl/v2_6_0/cub/catch2_test_util_device.lid_2_bin -a
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/grte/v5/lib64/libthread_db.so.1".

Thread 1 "catch2_test_uti" received signal SIGILL, Illegal instruction.
0x000055555908a373 in thrust::THRUST_200601___CUDA_ARCH_LIST___NS::reference<int, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_ptr<int>, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_reference<int> >::convert_to_value_type<thrust::THRUST_200601___CUDA_ARCH_LIST___NS::cuda_cub::tag> (this=0x7fffffffc550, system=0x0) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/reference.h:330
330         return strip_const_get_value(select_system(*system));
(gdb) bt
#0  0x000055555908a373 in thrust::THRUST_200601___CUDA_ARCH_LIST___NS::reference<int, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_ptr<int>, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_reference<int> >::convert_to_value_type<thrust::THRUST_200601___CUDA_ARCH_LIST___NS::cuda_cub::tag> (this=0x7fffffffc550, system=0x0) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/reference.h:330
#1  0x000055555907de86 in thrust::THRUST_200601___CUDA_ARCH_LIST___NS::reference<int, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_ptr<int>, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_reference<int> >::operator int (this=0x7fffffffc550)
    at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/reference.h:186
#2  0x0000555558ff1935 in C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_0<metal::list<> >() () at third_party/gpus/cccl/v2_6_0/cub/test/catch2_test_util_device.cu.cc:77
#3  0x0000555558fca698 in Catch::TestInvokerAsFunction::invoke (this=0x114c3fe147a0) at third_party/catch/single_include/catch2/catch.hpp:14330
#4  0x0000555558fbfe83 in Catch::TestCase::invoke (this=0x114c3fee0b40) at third_party/catch/single_include/catch2/catch.hpp:14169
#5  0x0000555558fbfd37 in Catch::RunContext::invokeActiveTestCase (this=0x7fffffffcf70) at third_party/catch/single_include/catch2/catch.hpp:13025
#6  0x0000555558fbd305 in Catch::RunContext::runCurrentTest (this=0x7fffffffcf70, redirectedCout=..., redirectedCerr=...) at third_party/catch/single_include/catch2/catch.hpp:12998
#7  0x0000555558fbbdbb in Catch::RunContext::runTest (this=0x7fffffffcf70, testCase=...) at third_party/catch/single_include/catch2/catch.hpp:12759
#8  0x0000555558fc4a7e in Catch::(anonymous namespace)::TestGroup::execute (this=0x7fffffffcf60) at third_party/catch/single_include/catch2/catch.hpp:13352
#9  0x0000555558fc34fe in Catch::Session::runInternal (this=0x7fffffffd400) at third_party/catch/single_include/catch2/catch.hpp:13562
#10 0x0000555558fc301c in Catch::Session::run (this=0x7fffffffd400) at third_party/catch/single_include/catch2/catch.hpp:13518
#11 0x0000555559018eb0 in Catch::Session::run<char> (this=0x7fffffffd400, argc=2, argv=0x7fffffffd668) at third_party/catch/single_include/catch2/catch.hpp:13236
#12 0x0000555558fe79ca in main (argc=2, argv=0x7fffffffd668) at third_party/gpus/cccl/v2_6_0/cub/test/catch2_main.cuh:68

(gdb) x/10i $pc
=> 0x555558f1c706 <_ZL43C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_0IN5metal4listIJEEEEvv+86>:   ud1    0x16(%eax),%eax
   0x555558f1c70b <_ZL43C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_0IN5metal4listIJEEEEvv+91>:   mov    %rax,%rbx
   0x555558f1c70e <_ZL43C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_0IN5metal4listIJEEEEvv+94>:   lea    -0x30(%rbp),%rdi

That ud1 instruction is a tell-tale sign that we've got to the point we should not have.

Artem-B commented 1 day ago

Ugh. What can possibly go wrong here...

https://github.com/NVIDIA/cccl/blob/18043cb6379c9339b7758048beb2e783f29379bd/thrust/thrust/detail/reference.h#L182

  // This is inherently hazardous, as it discards the strong type information
  // about what system the object is on.
  _CCCL_HOST_DEVICE operator value_type() const
  {
    // Avoid default-constructing a system; instead, just use a null pointer
    // for dispatch. This assumes that `get_value` will not access any system
    // state.
    typename thrust::iterator_system<pointer>::type* system = nullptr;
    return convert_to_value_type(system);
  }

... because, of course we jump straight to this:

https://github.com/NVIDIA/cccl/blob/18043cb6379c9339b7758048beb2e783f29379bd/thrust/thrust/detail/reference.h#L330

template <typename System>
  _CCCL_HOST_DEVICE value_type convert_to_value_type(System* system) const
  {
    using thrust::system::detail::generic::select_system;
    return strip_const_get_value(select_system(*system));
  }

Artem-B commented 1 day ago

@brycelelbach Looks like it was introduced by 4fd1b54cece96c56e49d6a3fc8df6c4ab1c9499c a while back.

Any suggestions on how we can guarantee that system is not dereferenced, or implement the value type check some other way that avoids undefined behavior?

jrhemstad commented 1 day ago

@miscco could you scope what it would take to fix this? It feels like there must be a better way to do that dispatch than passing around nullptrs.

Artem-B commented 1 day ago

As a temporary workaround, the convert_to_value_type function can be annotated with __attribute__((no_sanitize("null"))): https://github.com/NVIDIA/cccl/blob/18043cb6379c9339b7758048beb2e783f29379bd/thrust/thrust/detail/reference.h#L327 That unbreaks about half of the tests () for me (-fsanitize=null is enabled by default in my builds). Looks like there are other places where this would need to be applied.

NVIDIA / cccl