Add `bifurcate(_:)` - Githubissues

mdznr commented 3 years ago

Description

Adds a bifurcate(_:) algorithm. This is very similar to filter(_:), but instead of just getting an array of the elements that did match the given predicate, also get a second array for the elements that did not match the given predicate.

This is more performant than calling filter(_:) twice on the same input with mutually-exclusive predicates since:

It only requires a single pass of the elements
If the input has a known number of elements, the cumulative space for both returned arrays is known and can avoid array buffer resizing.

let cast = ["Vivien", "Marlon", "Kim", "Karl"]
let (shortNames, longNames) = cast.bifurcate({ $0.count < 5 })
print(shortNames)
// Prints "["Kim", "Karl"]"
print(longNames)
// Prints "["Vivien", "Marlon"]"

Detailed Design

extension Sequence {
  @inlinable
  public func bifurcate(
    _ belongsInFirstCollection: (Element) throws -> Bool
  ) rethrows -> ([Element], [Element])
}

Naming

It was hard to find a word to describe this behavior without using terms like “split” or “separate”, which might have other interpretations (like only being able to get the prefix and suffix), so I chose “bifurcate”, but definitely open to hearing different names.

For consistency with filter(_:), I avoided calling this function something like bifurcating(_:), bifurcated(_:), or giving the parameter a name at the call-site, like bifurcate(where:).

Documentation Plan

Inline documentation for each new function
Comments in the implementation
Updated README.md
Added Guides/Bifurcate.md

Test Plan

Adds unit tests for example given in documentation
Adds unit tests for various inputs
Adds unit tests for empty input

Source Impact

This is purely additive

Checklist

[x] I've added at least one test that validates that my change is working, if appropriate
[x] I've followed the code style of the rest of the project
[x] I've read the Contribution Guidelines
[x] I've updated the documentation if necessary

timvermeulen commented 3 years ago

Thanks for suggesting this @mdznr! This operation is commonly called partition in other languages. The Swift standard library happens to already have an in-place partition(by:) (with a stable variant residing in this package) and this could be considered the not-in-place version of that. What do you think about calling this method partitioned(by:)?

Your implementation of the Collection overload is nicely able to avoid any array resizing, but in doing so it takes a couple of (potentially) O(n) hits. We should probably run some benchmarks using Swift Collections Benchmark to make sure this added complexity is worthwhile. There are some additional optimizations we could consider:

Don't allocate two new arrays for the return values but instead return ArraySlices that both point to the same buffer.
Taking that to extremes, don't reverse the elements on the right side at all, and instead return (a wrapper around) a ReversedCollection<ArraySlice>.
Another direction entirely could be to not allocate a new buffer with the same size of the input collection at all, but instead store which elements satisfy the predicate in a bitset. Once that is done we'll know exactly how much capacity to reserve for the return values.

There are some trade-offs here regarding performance, implementation simplicity, and having a clean API. A good next step would be to get our hands on some benchmark results that might tell us which directions are worth pursuing.

mdznr commented 3 years ago

Thanks for suggesting this @mdznr! This operation is commonly called partition in other languages. The Swift standard library happens to already have an in-place partition(by:) (with a stable variant residing in this package) and this could be considered the not-in-place version of that. What do you think about calling this method partitioned(by:)?

I like that idea a lot. I don’t know how I missed the existing stable partition function in this package. 🤦‍♂️

Your implementation of the Collection overload is nicely able to avoid any array resizing, but in doing so it takes a couple of (potentially) O(n) hits. We should probably run some benchmarks using Swift Collections Benchmark to make sure this added complexity is worthwhile. There are some additional optimizations we could consider:

Don't allocate two new arrays for the return values but instead return ArraySlices that both point to the same buffer.

Taking that to extremes, don't reverse the elements on the right side at all, and instead return (a wrapper around) a ReversedCollection<ArraySlice>.

I like that idea. That allows the caller to decide whether the conversation to Array is needed or not.

Another direction entirely could be to not allocate a new buffer with the same size of the input collection at all, but instead store which elements satisfy the predicate in a bitset. Once that is done we'll know exactly how much capacity to reserve for the return values.

I see _UnsafeBitset in swift-collections, but it needs to know the size ahead of time. Is there another API I should be looking at for this?

There are some trade-offs here regarding performance, implementation simplicity, and having a clean API. A good next step would be to get our hands on some benchmark results that might tell us which directions are worth pursuing.

Sounds like a good idea. I'll use https://github.com/apple/swift-collections-benchmark to guide this.

mdznr commented 3 years ago

GitHub won’t let me re-open this PR because I’ve force-pushed changes onto the branch. Following up with https://github.com/apple/swift-algorithms/pull/152

apple / swift-algorithms

Add `bifurcate(_:)` #151

Description

Detailed Design

Naming

Documentation Plan

Test Plan

Source Impact

Checklist