Open alliepiper opened 2 years ago
I prefer option 2. The people who were depending on the determinism are already upset, and I am guessing that they will be (mostly) satisfied with option 2 (because CUB developers are unlikely to make the same mistake again when the function has "Deterministic" in its name). If option 1 is chosen, then a whole new group of users (who care about performance more than determinism) will become upset. Option 2 will result in a happier community (on average, if not everyone individually) and is a better long-term solution.
I also prefer option 2, and completely agree with the points raised above by @dkolsen-pgi.
cc: @kmaehashi (CuPy switched to use CUB by default despite the reproducibility/determinism concern, cupy/cupy#4897)
Glad to see this happening! Although not directly affected anymore, my preference would be Approach 2.
I also prefer option 2.
Option 2 :) thanks.
Is HistogramEven
currently deterministic or not?
Is
HistogramEven
currently deterministic or not?
Yes, assuming that your histogram output is an integral type that is either unsigned or won't overflow at runtime.
Update on this issue. We incline towards having the option 2 with some changes. Since there are many forms of determinism (run-to-run, GPU-to-GPU, etc), having everything embedded into the name would explode the number of functions. Instead of encoding requirements in the name, we need an API that would allow users to provide functional requirements as a parameter.
Apart from determinism guarantees, we'd like to co-design this new API with the notion of operator complexity. Users provide quite complex operators to CUB. @elstehle provided an example of such operator:
constexpr int element_count = 5;
using element_t = int;
using item_t = cuda::std::array<element_t, element_count>;
constexpr item_t identity{0, 1, 2, 3, 4};
struct op_t
{
__host__ __device__ __forceinline__
item_t operator()(item_t lhs, item_t rhs)
{
item_t result{};
for(int i=0; i < element_count; i++)
{
result[i] = rhs[lhs[i]];
}
return result;
}
};
Having something along the "compute-optimal" vs "memory-optimal" requirement in the set of functional requirements alongside determinism requirements seems reasonable. Compute-optimal solution would not be allowed to perform extra invocations of the binary operator.
Since we have to implement a new deterministic scan algorithm anyways, we might experiment with implementing an algorithm that is both deterministic and work optimal.
Overview
Some
cub::Device*
algorithms are/were documented to be run-to-run deterministic, but the implementations no longer fulfill that guarantee. This has been a major pain point for several users who were relying on this behavior. Their code does not behave as expected, and they have no way of fixing this without writing their own kernels.An additional facet of this issue is that the existing non-deterministic algorithms are significantly faster than the former deterministic implementations, and the regression was accidentally introduced years ago specifically to provide better performance. Fixing this involves a trade-off between fixing functionally-regressed code and introducing performance regressions.
Both deterministic and non-deterministic implementations must continue to be available through public API for both sets of users.
Affected Algorithms
The following algorithms are known to violate current or former guarantees:
DeviceReduce::ReduceByKey
is currently documented to provide run-to-run determinism, but doesn't.DeviceScan::*
used to provide the same broken guarantee, and the documentation was temporarily updated pending proper resolution of this issue.These algorithms also provide such guarantees:
DeviceReduce::Reduce
(and derivedSum
,Min
,Max
, etc)DeviceSegmentedReduce::Reduce
Requirements
Open Questions
How to handle naming?
How should we spell the deterministic and non-deterministic APIs? The choice is between backward compatibility or a consistent, sensible API. I'm interested in hearing user feedback on which solution would be preferable:
Option 1: Prioritize backwards compatibility.
All existing algorithm names that promise(d) determinism have determinism restored and preserved. Non-deterministic implementations of these algorithms will be introduced under a new name, probably prefixed/suffixed (suggestions welcome). New deterministic APIs of existing non-deterministic algorithms will be introduced under a new name with a different prefix/suffix.
Pros:
Cons:
Option 2: Prioritize consistency.
Algorithm implementations with no special guarantees are exposed through the basic spelling of the algorithm (e.g.
Reduce
).Algorithms with determinism guarantees are prefixed (e.g.
DeviceDeterministicReduce
for run-to-run determinism on the same device,DeterministicReduce
for consistently deterministic in all cases).Pros:
DeviceMergeSort
hasSortKeys
,StableSortKeys
, andSortKeysCopy
).Cons:
Scope
This issue is limited to testing and fixing pre-existing determinism guarantees in the explicitly listed algorithms. Requests for new or stronger determinism guarantees should go into new issues.