parlay::kth_smallest fails to handle duplicated elements (segmentation fault)

Hi developers,

The parlay::kth_element() function throws a segmentation fault when inputs are highly duplicated. I.e.:

parlay::sequence<int> a( 2000, 1 );
int k = 1000;
auto less = []( int p, int q ) { return p < q; };
// auto less_or_equal = []( int p, int q ) { return p <= q; };
std::cout << *( parlay::kth_smallest( a, k, less ) ) << std::endl;

Using comparison less_or_equal will throw another segmentation fault.

This is because that when inputs are highly duplicated, the sampling based bucket method will sieve all elements into one bucket and continue recursion without triggering the granularity control.

A naive solution is to special judge the case that all elements are distributing into one bucket, i.e.:

template<typename Range, typename Compare = std::less<>>
auto kth_smallest_copy(Range&& in, size_t k, Compare&& less = {}) {
  // no change

  // Determine which of the 32 buckets each key belongs in
  internal::heap_tree ss(pivots);
  auto ids = parlay::tabulate(n, [&](long i) -> unsigned char { return ss.rank(it[i], less); });

  // Count how many in keys are each bucket
  auto sums = parlay::histogram_by_index(ids, sample_size + 1);

  //! special judge
  if (std::count(sums.begin(), sums.end(), 0) == sample_size) {
    return parlay::sort(std::forward<Range>(in), std::forward<Compare>(less))[k];
  }

  // no change
}

cmuparlay / parlaylib

parlay::kth_smallest fails to handle duplicated elements (segmentation fault) #65