alliepiper commented 2 years ago

Overview

Some cub::Device* algorithms are/were documented to be run-to-run deterministic, but the implementations no longer fulfill that guarantee. This has been a major pain point for several users who were relying on this behavior. Their code does not behave as expected, and they have no way of fixing this without writing their own kernels.

An additional facet of this issue is that the existing non-deterministic algorithms are significantly faster than the former deterministic implementations, and the regression was accidentally introduced years ago specifically to provide better performance. Fixing this involves a trade-off between fixing functionally-regressed code and introducing performance regressions.

Both deterministic and non-deterministic implementations must continue to be available through public API for both sets of users.

Affected Algorithms

The following algorithms are known to violate current or former guarantees:

DeviceReduce::ReduceByKey is currently documented to provide run-to-run determinism, but doesn't.
DeviceScan::* used to provide the same broken guarantee, and the documentation was temporarily updated pending proper resolution of this issue.

These algorithms also provide such guarantees:

DeviceReduce::Reduce (and derived Sum, Min, Max, etc)
DeviceSegmentedReduce::Reduce

Requirements

Add explicit determinism tests for all affected algorithms that do/did provide determinism guarantees.
Update any algorithms that fail the new tests so that the deterministic behavior is restored.
Expose both the existing non-deterministic and the updated deterministic implementations via public API.
Document variants clearly.
Run new determinism tests in CI to guard against future regressions.

Open Questions

How to handle naming?

How should we spell the deterministic and non-deterministic APIs? The choice is between backward compatibility or a consistent, sensible API. I'm interested in hearing user feedback on which solution would be preferable:

Option 1: Prioritize backwards compatibility.

All existing algorithm names that promise(d) determinism have determinism restored and preserved. Non-deterministic implementations of these algorithms will be introduced under a new name, probably prefixed/suffixed (suggestions welcome). New deterministic APIs of existing non-deterministic algorithms will be introduced under a new name with a different prefix/suffix.

Pros:

Fixes functional regressions.

Cons:

Introduces performance regressions.
Inconsistent naming: The "basic" algorithm spelling will be deterministic for some algorithms but non-deterministic for others. Some algorithms will be suffixed when deterministic, other will be suffixed when non-deterministic.

Option 2: Prioritize consistency.

Algorithm implementations with no special guarantees are exposed through the basic spelling of the algorithm (e.g. Reduce).

Algorithms with determinism guarantees are prefixed (e.g. DeviceDeterministicReduce for run-to-run determinism on the same device, DeterministicReduce for consistently deterministic in all cases).

Pros:

Naming is consistent.
Matches how special requirements and guarantees are handled in other APIs (e.g. DeviceMergeSort has SortKeys, StableSortKeys, and SortKeysCopy).

Cons:

Introduces functional regressions as determinism guarantees on existing APIs are broken.

Scope

This issue is limited to testing and fixing pre-existing determinism guarantees in the explicitly listed algorithms. Requests for new or stronger determinism guarantees should go into new issues.

dkolsen-pgi commented 2 years ago

I prefer option 2. The people who were depending on the determinism are already upset, and I am guessing that they will be (mostly) satisfied with option 2 (because CUB developers are unlikely to make the same mistake again when the function has "Deterministic" in its name). If option 1 is chosen, then a whole new group of users (who care about performance more than determinism) will become upset. Option 2 will result in a happier community (on average, if not everyone individually) and is a better long-term solution.

maddyscientist commented 2 years ago

I also prefer option 2, and completely agree with the points raised above by @dkolsen-pgi.

leofang commented 2 years ago

cc: @kmaehashi (CuPy switched to use CUB by default despite the reproducibility/determinism concern, cupy/cupy#4897)

ekelsen commented 2 years ago

Glad to see this happening! Although not directly affected anymore, my preference would be Approach 2.

duncanriach commented 2 years ago

I also prefer option 2.

lilohuang commented 2 years ago

Option 2 :) thanks.

bhack commented 2 years ago

Is HistogramEven currently deterministic or not?

alliepiper commented 2 years ago

Is HistogramEven currently deterministic or not?

Yes, assuming that your histogram output is an integral type that is either unsigned or won't overflow at runtime.

gevtushenko commented 1 year ago

Update on this issue. We incline towards having the option 2 with some changes. Since there are many forms of determinism (run-to-run, GPU-to-GPU, etc), having everything embedded into the name would explode the number of functions. Instead of encoding requirements in the name, we need an API that would allow users to provide functional requirements as a parameter.

Apart from determinism guarantees, we'd like to co-design this new API with the notion of operator complexity. Users provide quite complex operators to CUB. @elstehle provided an example of such operator:

constexpr int element_count = 5;
using element_t = int;
using item_t = cuda::std::array<element_t, element_count>;
constexpr item_t identity{0, 1, 2, 3, 4};

struct op_t
{
    __host__ __device__ __forceinline__
    item_t operator()(item_t lhs, item_t rhs)
    {
        item_t result{};
        for(int i=0; i < element_count; i++)
        {
            result[i] = rhs[lhs[i]];
        }
        return result;
    }
};

Having something along the "compute-optimal" vs "memory-optimal" requirement in the set of functional requirements alongside determinism requirements seems reasonable. Compute-optimal solution would not be allowed to perform extra invocations of the binary operator.

Since we have to implement a new deterministic scan algorithm anyways, we might experiment with implementing an algorithm that is both deterministic and work optimal.

NVIDIA / cccl

Address issues with existing determinism guarantees #886