[BUG]: Latest CCCL breaks CuPy

NVIDIA / cccl

CUDA Core Compute Libraries

https://nvidia.github.io/cccl/

Other

1.17k stars 140 forks source link

[BUG]: Latest CCCL breaks CuPy #2109

Open leofang opened 2 months ago

leofang commented 2 months ago

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Compile-time Error

Component

Thrust

Describe the bug

CCCL diff (between yesterday and today): https://github.com/cupy/cccl/compare/2246a8772a78346d91e580e121cedb820da2648f...d4f928e45a6ab452ff6a9de02769d3c9532ee423

Yesterday it built fine. Today we see this (from https://ci.preferred.jp/cupy.linux.cuda-build/166014/):

[00:03:14.784005 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5136)   cupy/cuda/cupy_thrust.cu(156): error: inherited member is not allowed 
[00:03:14.784282 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5137)     bool thrust::less<complex<float>>::operator() ( 
[00:03:14.784834 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5138)                                        ^    
[00:03:14.785276 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5139)     
[00:03:14.785932 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5140)   cupy/cuda/cupy_thrust.cu(156): error: declaration is incompatible with "bool cuda::std::__4::less<_Tp>::operator()(const _Tp &, const _Tp &) const [with _Tp=thrust::THRUST_200600_700_NS::complex<float>]" (declared at line 350 of /tmp/tmp.8dn9ZcMlIG/cupy/_core/include/cupy/_cccl/libcudacxx/cuda/std/__functional/operations.h) 
[00:03:14.786311 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5143)     bool thrust::less<complex<float>>::operator() ( 
[00:03:14.786743 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5144)                                        ^    
[00:03:14.787154 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5145)     
[00:03:14.787625 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5146)   cupy/cuda/cupy_thrust.cu(154): error: "constexpr" is not valid here   
[00:03:14.788046 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5147)     constexpr   
[00:03:14.788484 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5148)     ^   
[00:03:14.788905 STDOUT 1352]](https://ci.preferred.jp/cupy.linux.cuda-build/166014/#L5149)

How to Reproduce

Build https://github.com/cupy/cupy/pull/8412 from source

Expected behavior

No compile-time errors.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

leofang commented 2 months ago

Build https://github.com/cupy/cupy/pull/8412 from source

To unblock myself I've force-pushed back to the snapshot from yesterday, but starting from this commit https://github.com/cupy/cupy/commit/08e6a3c63fe734c259a83f76381ed973d99cd7bd it should be reproducible. By tracing the code change following the compiler error, I believe the culprit is https://github.com/NVIDIA/cccl/pull/1872, in particular https://github.com/NVIDIA/cccl/pull/1872#pullrequestreview-2122194270 (cc: @bernhardmgruber for vis).

bernhardmgruber commented 2 months ago

@leofang thank you for reporting this and I am sorry for the breakage!

The specializations of Thrust's API that cupy performs looks quite extraordinary to me (I actually had to educate myself on the syntax) and I would like to suggest a better way.

In your current implementation, you specialize a member of thrust::less directly:

bool thrust::less<float>::operator() (
    const float& lhs, const float& rhs) const {
    return _real_less<float>(lhs, rhs);
}

which relies on thrust::less having an actual operator(). We changed this in #1872, since we now pull in the operator from ::cuda::std::less. If you specialize the entire functor instead, your code should work before and after the changes of #1872:

template <>
struct thrust::less<float> {
    bool operator() (float lhs, float rhs) const {
        return _real_less<float>(lhs, rhs);
    }
};

This is arguably safer and will also be forward compatible when we eventually alias thust::less to cuda::std::less entirely (we cannot do this yet, because we need to give some notice with the deprecation warnings).

As a general remark: I think redefining the meaning of thrust::less<float> is quite dangerous, since some algorithms depend on the exact meaning of this function object to provide more efficient implementations. E.g., thrust::sort detects whether the comparator is thrust::less<T> to branch into radix sort, and then takes the mathematical properties of "less than" as given, independently of how thrust::less<T> is defined.

leofang commented 2 months ago

Hi Bernhard, thanks for your reply.

which relies on thrust::less having an actual operator(). We changed this in https://github.com/NVIDIA/cccl/pull/1872 ... (we cannot do this yet, because we need to give some notice with the deprecation warnings).

To me this change of behavior requires a deprecation cycle, though I do have a hard time to see where compile-time warnings can be injected for this particular case.

As a general remark: I think redefining the meaning of thrust::less<float> is quite dangerous, ... E.g., thrust::sort detects whether the comparator is thrust::less<T> to branch into radix sort, and then takes the mathematical properties of "less than" as given, independently of how thrust::less<T> is defined.

Is this expectation documented anywhere? Would it be a concern if the comparator fulfills this "less than" property?

To give some context, CuPy had to define a custom comparator in order to follow NumPy behavior, where NaNs are sorted to the end of an array (see code comments here).

An extra note: we cannot keep breaking critical users like CuPy just casually (https://github.com/NVIDIA/cccl/issues/1494). I've been following @jrhemstad's advertisement and encouraging the CuPy team to stay at CCCL head, but my push would be questionable if they keep hitting maintenance issues.

jrhemstad commented 2 months ago

To me this change of behavior requires a deprecation cycle

It does not. The CuPy code is incorrect in how it provides a custom comparison operator. The proper way to provide a custom comparison operator is to pass one as an argument, not reach into internal bits of the library and modify them at will.

This is covered under our compatibility guidelines: https://github.com/NVIDIA/cccl?tab=readme-ov-file#compatibility-guidelines

Do not add any declarations to the thrust::, cub::, nv::, or cuda:: namespaces unless an exception is noted for a specific symbol, e.g., specializing a type trait.

I'll grant that this could be improved to say "Do not add any declarations or template specializations...", but the point remains the same.

All that said, we don't like breaking people (even when it wasn't our fault), so we will be helping to contribute a fix to CuPy.

I look forward to when you have the time to help us set up https://github.com/NVIDIA/cccl/issues/1494!

bernhardmgruber commented 2 months ago

PR for CuPi: https://github.com/cupy/cupy/pull/8446

leofang commented 2 months ago

I had a somewhat lengthy discussion with Jake offline. Below is a summary of what I asked regarding specializing thrust::less vs thrust::less::operator noted above (and the offline thread), for posterity:

At one point we were even encouraged to specialize a CUB internal template (in the context of 64-bit support), so you must forgive CuPy developers to have the impression that everything template is customizable at will.

Perhaps let me ask this way: Why is specializing thrust::less ok (as suggested above and done in https://github.com/cupy/cupy/pull/8446) but specializing thrust::less::operator is not? As per the (being updated) rule number 1, CCCL discourages users from specializing all things under thrust:: anyway. But as a user, all we see in the public API of thrust::less is that there’s one and only one member function we need to define, so it’s unclear where the line is between the two approaches from user perspective.

jrhemstad commented 2 months ago

everything template is customizable at will

The standard practice in C++ is that you should not specialize a template that does not belong to you unless you've been explicitly granted permission to do so. Everyone is at a different place with their experience and expertise in C++, so it's not unreasonable that someone may not be familiar with this yet.

Being given permission to specialize one template in one library does not grant permission to specialize another template in a different library 😅 .

Why is specializing thrust::less ok (as suggested above and done in https://github.com/cupy/cupy/pull/8446) but specializing thrust::less::operator is not?

I explained in https://github.com/cupy/cupy/pull/8446#discussion_r1698853932 that I don't think either specialization is correct. I think @bernhardmgruber was trying to be courteous and do the minimal possible set of changes to unblock cupy, but ultimately, the specialization is still ill-formed.

The good news is that it's very easy to fix by just using a custom comparator type instead of using thrust::less directly.