Open Artem-B opened 7 months ago
Hey @Artem-B, thanks for the heads up.
Good timing! We're just starting the process of setting up nightly CI jobs to do things like build/run our tests with sanitizers enabled so we can find and resolve this kind of stuff.
I just added a task to https://github.com/NVIDIA/cccl/issues/1619 to include setting up jobs that build with these sanitizer options enabled. Once we get that infrastructure setup, we can go chip away at fixing the issues that come up.
I just stumbled over this as well when trying UBSan to hunt a bug. We should get that fixed.
Similarly, I see issues reported by MSan as well.
This also appears to affect cub tests if they are built without optimizations.
(gdb) r
Starting program: /google/obj/workspace/59020db8998c499a49126ed0daf698aa034958bc800d56c31bc15c93b4d9bbce/ecad6e51-6ea9-4661-8eb4-75ae4e6417cc/blaze-out/k8-dbg/bin/third_party/gpus/cccl/v2_6_0/cub/catch2_test_util_device.lid_2_bin -a
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/grte/v5/lib64/libthread_db.so.1".
Thread 1 "catch2_test_uti" received signal SIGILL, Illegal instruction.
0x000055555908a373 in thrust::THRUST_200601___CUDA_ARCH_LIST___NS::reference<int, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_ptr<int>, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_reference<int> >::convert_to_value_type<thrust::THRUST_200601___CUDA_ARCH_LIST___NS::cuda_cub::tag> (this=0x7fffffffc550, system=0x0) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/reference.h:330
330 return strip_const_get_value(select_system(*system));
(gdb) bt
#0 0x000055555908a373 in thrust::THRUST_200601___CUDA_ARCH_LIST___NS::reference<int, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_ptr<int>, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_reference<int> >::convert_to_value_type<thrust::THRUST_200601___CUDA_ARCH_LIST___NS::cuda_cub::tag> (this=0x7fffffffc550, system=0x0) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/reference.h:330
#1 0x000055555907de86 in thrust::THRUST_200601___CUDA_ARCH_LIST___NS::reference<int, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_ptr<int>, thrust::THRUST_200601___CUDA_ARCH_LIST___NS::device_reference<int> >::operator int (this=0x7fffffffc550)
at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/reference.h:186
#2 0x0000555558ff1935 in C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_0<metal::list<> >() () at third_party/gpus/cccl/v2_6_0/cub/test/catch2_test_util_device.cu.cc:77
#3 0x0000555558fca698 in Catch::TestInvokerAsFunction::invoke (this=0x114c3fe147a0) at third_party/catch/single_include/catch2/catch.hpp:14330
#4 0x0000555558fbfe83 in Catch::TestCase::invoke (this=0x114c3fee0b40) at third_party/catch/single_include/catch2/catch.hpp:14169
#5 0x0000555558fbfd37 in Catch::RunContext::invokeActiveTestCase (this=0x7fffffffcf70) at third_party/catch/single_include/catch2/catch.hpp:13025
#6 0x0000555558fbd305 in Catch::RunContext::runCurrentTest (this=0x7fffffffcf70, redirectedCout=..., redirectedCerr=...) at third_party/catch/single_include/catch2/catch.hpp:12998
#7 0x0000555558fbbdbb in Catch::RunContext::runTest (this=0x7fffffffcf70, testCase=...) at third_party/catch/single_include/catch2/catch.hpp:12759
#8 0x0000555558fc4a7e in Catch::(anonymous namespace)::TestGroup::execute (this=0x7fffffffcf60) at third_party/catch/single_include/catch2/catch.hpp:13352
#9 0x0000555558fc34fe in Catch::Session::runInternal (this=0x7fffffffd400) at third_party/catch/single_include/catch2/catch.hpp:13562
#10 0x0000555558fc301c in Catch::Session::run (this=0x7fffffffd400) at third_party/catch/single_include/catch2/catch.hpp:13518
#11 0x0000555559018eb0 in Catch::Session::run<char> (this=0x7fffffffd400, argc=2, argv=0x7fffffffd668) at third_party/catch/single_include/catch2/catch.hpp:13236
#12 0x0000555558fe79ca in main (argc=2, argv=0x7fffffffd668) at third_party/gpus/cccl/v2_6_0/cub/test/catch2_main.cuh:68
(gdb) x/10i $pc
=> 0x555558f1c706 <_ZL43C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_0IN5metal4listIJEEEEvv+86>: ud1 0x16(%eax),%eax
0x555558f1c70b <_ZL43C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_0IN5metal4listIJEEEEvv+91>: mov %rax,%rbx
0x555558f1c70e <_ZL43C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_0IN5metal4listIJEEEEvv+94>: lea -0x30(%rbp),%rdi
That ud1
instruction is a tell-tale sign that we've got to the point we should not have.
Ugh. What can possibly go wrong here...
// This is inherently hazardous, as it discards the strong type information
// about what system the object is on.
_CCCL_HOST_DEVICE operator value_type() const
{
// Avoid default-constructing a system; instead, just use a null pointer
// for dispatch. This assumes that `get_value` will not access any system
// state.
typename thrust::iterator_system<pointer>::type* system = nullptr;
return convert_to_value_type(system);
}
... because, of course we jump straight to this:
template <typename System>
_CCCL_HOST_DEVICE value_type convert_to_value_type(System* system) const
{
using thrust::system::detail::generic::select_system;
return strip_const_get_value(select_system(*system));
}
@brycelelbach Looks like it was introduced by 4fd1b54cece96c56e49d6a3fc8df6c4ab1c9499c a while back.
Any suggestions on how we can guarantee that system
is not dereferenced, or implement the value type check some other way that avoids undefined behavior?
@miscco could you scope what it would take to fix this? It feels like there must be a better way to do that dispatch than passing around nullptrs.
As a temporary workaround, the convert_to_value_type
function can be annotated with __attribute__((no_sanitize("null")))
:
https://github.com/NVIDIA/cccl/blob/18043cb6379c9339b7758048beb2e783f29379bd/thrust/thrust/detail/reference.h#L327
That unbreaks about half of the tests () for me (-fsanitize=null
is enabled by default in my builds). Looks like there are other places where this would need to be applied.
Is this a duplicate?
Type of Bug
Something else
Component
Thrust
Describe the bug
I'm porting thrust tests to our internal build at work.
One of the things we have enabled in our build by default is a subset of UB sanitizer, and I've noticed that thrust tests appear to have a lot of nupp pointer dereference failures. Some of the UB-enabled builds result in a crash. Enabling UB sometimes preserves the offending code, which would otherwise be removed by compiler, because it's allowed to treat UB in whatever way it wants.
Probing few of the failures deeper suggests that that the issues are real. E.g.
test_async_copy_after
test inasync_copy.cu
apparently attempts to dereference a null pointer.How to Reproduce
Add the following lines to thrust/CMakeLists.txt:
Run thrust tests with:
Adding
-O0
will likely make even more tests fail as with high optimizations a lot of code gets eliminated before it gets a chance to be instrumented.Observe the test failures:
79% tests passed, 77 tests failed out of 362
. Test output: https://gist.github.com/Artem-B/9d59658f4c64940c2da4d59fd14096f423% tests passed, 280 tests failed out of 362
Note that non-failing tests also report UB violations, but they are hidden by the test framework. If you want to force all of them to turn into test failures (alas, it's just a crash with no useful diagnostics attached), use the following flags, and skip
add_link_options()
:Expected behavior
Thrust should not be relying on UB in general, and in particular when it comes to null pointers. Compiler can and does optimize code on the assumption that UB never happens. Sooner or later that will become a problem. It's possible that it already is, we just didn't notice it, yet.
Most of the ubsan reports are associated with the same few locations, so the root cause is probably fairly localized.
Reproduction link
No response
Operating System
Debian/testing
nvidia-smi output
not applicable.
NVCC version