[BUG]: Combining proclaim_return_type and make_zip_function breakes in certain cases

pauleonix commented 2 months ago

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Compile-time Error

Component

Not sure

Describe the bug

With CCCL 2.2.0 it was possible to combine these two on a device lambda to take input from a zip_iterator. Since CCCL 2.3.x/CUDA 12.4 this does not work anymore when returning a thrust::tuple and the order is

thrust::make_zip_function(
    cuda::proclaim_return_type<thrust::tuple<...>>(
        [] __device__ (...) {...}))

Swapping the two seems to have solved the issue for me, i.e.

cuda::proclaim_return_type<thrust::tuple<...>>(
    thrust::make_zip_function(
        [] __device__ (...) {...}))

The compiler error is

/opt/compiler-explorer/cuda/12.4.1/bin/../targets/x86_64-linux/include/cuda/functional:92:26: error: could not convert 'cuda::std::__4::__invoke<__nv_dl_wrapper_t<__nv_dl_tag<int (*)(), main, 1> >&, int, int>(((cuda::__4::__detail::__return_type_wrapper<cuda::std::__4::tuple<int, int>, __nv_dl_wrapper_t<__nv_dl_tag<int (*)(), main, 1> > >*)this)->cuda::__4::__detail::__return_type_wrapper<cuda::std::__4::tuple<int, int>, __nv_dl_wrapper_t<__nv_dl_tag<int (*)(), main, 1> > >::__fn_, (* & cuda::std::__4::forward<int>((* & __as#0))), (* & cuda::std::__4::forward<int>((* & __as#1))))' from 'int' to 'cuda::std::__4::tuple<int, int>'
   92 |     return _CUDA_VSTD::__invoke(__fn_, _CUDA_VSTD::forward<_As>(__as)...);
      |        ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                          |
      |                          int

How to Reproduce

The reproducer is basically a one-liner. I chose to thrust::tuple<int, int> as both output and input of the device lambda which is just an identity operation here.

I also needed to combine this construct with an actual thrust::zip_iterator for the compiler error to materialize in the reproducer. Therefore I added a combination of transform_iterator, zip_iterator and counting_iterators.

auto iter = thrust::make_transform_iterator( 
            thrust::make_zip_iterator(
                thrust::make_counting_iterator(0),
                thrust::make_counting_iterator(42)),
            thrust::make_zip_function(
                cuda::proclaim_return_type<thrust::tuple<int, int>>(
                    [] __device__ (int a, int b){
                        return thrust::make_tuple(a, b);
                    })));

Expected behavior

It would be nice if this would compile independent of order as it has with CCCL 2.2.0. Naively the failing ordering seems to make more sense because then the compiler knows what return type the zip_function should "inherit".

Reproduction link

https://cuda.godbolt.org/z/McznoKnGx

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

miscco commented 2 months ago

I believe this is actually a very intricate bug in the compiler where device lambdas interact strangely with deduced return types of the invoke machinery.

This comes from instantiating result_of_adaptable_function, where it instantiates __invoke_of of the proclaim_return_type wrapped lambda.

However, it does not use the explicit return type of proclaim_return_type but tries to match that with the "return type" of the device lambda which is not the actual return type

miscco commented 2 months ago

@pauleonix I tried working around this a bit more but it seems that there is indeed a compiler bug that we need to reduce.

In the meantime you can work around the issue by adding a trailing return type to the lambda

Note that depending on you CTK version you might be able to completely skip the proclaim_return_type workaround fully in that case (CTK 12.4 and above)

pauleonix commented 2 months ago

Note that depending on you CTK version you might be able to completely skip the proclaim_return_type workaround fully in that case (CTK 12.4 and above)

That is great news (to me)! 🎉

NVIDIA / cccl