NVIDIA / cccl

CUDA Core Compute Libraries
Other
1.07k stars 127 forks source link

Device-side launch of thrust::lower_bound is creating wrong results #814

Open zasdfgbnm opened 2 years ago

zasdfgbnm commented 2 years ago
#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <stdio.h>

__global__ void lowerbound(float inp_val) {
    constexpr int size = 6;
    float a[size] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.};
    auto result = thrust::lower_bound(
        thrust::device, a, a + size, inp_val);
    printf("%ld\n", result - a);
}

int main() {
    lowerbound<<<1,1>>>(0.0);
    lowerbound<<<1,1>>>(0.1);
    lowerbound<<<1,1>>>(0.2);
    lowerbound<<<1,1>>>(0.3);
    lowerbound<<<1,1>>>(0.4);
    lowerbound<<<1,1>>>(0.5);
    cudaDeviceSynchronize();
}

I get

0
0
0
0
0
0

on CUDA 11.7 with the latest thrust

zasdfgbnm commented 2 years ago

Related customer issue: https://github.com/pytorch/pytorch/pull/80714

alliepiper commented 2 years ago

This is likely the same issues as NVIDIA/cccl#767.

zasdfgbnm commented 2 years ago

Wait, is the thrust::lower_bound here a CDP launch? From nsys nvprof, I only see one kernel, which is the one I wrote:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)        Name       
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  -----------------
    100.0          347,426          6  57,904.3  56,032.0    54,560    69,057      5,541.6  lowerbound(float)
gevtushenko commented 2 years ago

@zasdfgbnm CDP kernels don't show up in profiles

ngimel commented 2 years ago

Is this thrust regression, compiler regression or CDP regression? If it's compiler regression it might show up in other circumstances, even if users are not calling thrust::lower_bound.

ngimel commented 2 years ago

This is definitely not a cdp launch 1) when compiling, rdc=true -lcudadevrt isn't required 2) nsys shows all 6 kernels launched, whereas when cdp is used, not only does it not show the kernels that contain cdp launches, it doesn't show all subsequent regular kernels. So either thrust source changed leading to wrong results, or compiler regressed (not related to cdp).

Also, since it's not cdp, thrust shouldn't be deprecating it (?) and thus it should be fixed, as silent wrong results are the worst.

alliepiper commented 2 years ago

Updated the title (removed CDP launch info), prioritizing for the 2.1 milestone.

jrhemstad commented 2 years ago

In an offline discussion with @allisonvacanti I was surprised to learn that thrust::device is expected to work in device code when relocatable device code (rdc) is not enabled.

In this scenario, it is expected that the algorithm will behave equivalently to thrust::seq.

I verified that explicitly using thrust::seq in @zasdfgbnm's reproducer produces the correct results:

#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <stdio.h>
#include <type_traits>

__global__ void lowerbound(float inp_val) {
    constexpr int size = 6;
    float a[size] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.};
    auto result = thrust::lower_bound(thrust::seq, a, a + size, inp_val);
    printf("%ld\n", result - a);
}

int main() {
    lowerbound<<<1,1>>>(0.0);
    lowerbound<<<1,1>>>(0.1);
    lowerbound<<<1,1>>>(0.2);
    lowerbound<<<1,1>>>(0.3);
    lowerbound<<<1,1>>>(0.4);
    lowerbound<<<1,1>>>(0.5);
    cudaDeviceSynchronize();
}
...
0
0
1
2
2
3

So clearly something bad is happening in the translation of thrust::device to the equivalent of thrust::seq/thrust::generic.

jrhemstad commented 2 years ago

I'm still baffled by the expectation that thrust::device inside a kernel does not imply CDP.

Looking at the generated PTX, invoking thrust::lower_bound(thrust::device,...) is definitely attempting to use CDP: https://godbolt.org/z/K7fMsPPKK

fkallen commented 2 years ago

I think the issue is in the usage of cross-system copy_n: https://github.com/NVIDIA/thrust/blob/main/thrust/system/detail/generic/binary_search.inl#L153

For example, for lowerbound<<<1,1>>>(0.5); using thrust::device works if copy_n is replaced by ordinary assignments

  thrust::detail::temporary_array<T,DerivedPolicy>          d_value(exec,1);
  thrust::detail::temporary_array<OutputType,DerivedPolicy> d_output(exec,1);

  (*d_value.begin()) = value;

  // perform the query
  thrust::system::detail::generic::detail::binary_search(exec, begin, end, d_value.begin(), d_value.end(), d_output.begin(), comp, func);

  OutputType output;
  output = *d_output.begin();

  return output;

I also tried to add assert((*d_value.begin()) == value); after the first copy_n. This assertion fails in above test case when thrust::device is used.

Stepping through the code using cuda-gdb seems impossible because of an Illegal Instruction error during execution when compiled with -G

weiguangzhao commented 1 year ago

Looking forward to the updated version. And ”thrust::seq“ works for me.