Open lorentey opened 3 years ago
Hey @lorentey have you kicked this work off? I'd love to contribute and help out if possible.
Just want to add that the types SortedSet
and SortedDictionary
would also allow for finally adding binary search to Swift. The API could look very familiar, but under the hoods it would be much faster by using binary search:
let sortedNumbers: SortedSet = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
if let firstMatchingIndex = sortedNumbers.index(where: { $0 > 5 }) {
let firstMatch = sortedNumbers[firstMatchingIndex]
}
The index(where:)
method would then include the binary search algorithm.
Also, @lorentey I don't quite understand why you're restricting adding sorted variants to Set
and Dictionary
only, I think Array
deserves one just as much. The only reason there's no OrderedArray
in this library is because Array
is order by definition. But it's not sorted, so we should also have a SortedArray
.
See also my own SortedArray
implementation that I have been already using in production for years.
I think the SortedBag
that @lorentey mentions at the end of the issue would effectively be the same as a SortedArray
. It's interesting to consider what the more approachable name is.
@kylemacomber I was already wondering what the SortedBag
is supposed to mean. Never heard Bag
as a type name before, maybe it's just me but I find it confusing. We should use known terms, I would say. Array seems an obvious one.
"Bag" is a well-established name for a set that allows duplicates; it's less clinical sounding as "multiset", but that is definitely a viable name for it. (For what it's worth, multiset seems to be more popular these days; see e.g. Guava, LLVM, or Boost. However, contrast this with Eclipse, Apache Commons, and others. Knuth also prefers the term multiset, and that probably seals the deal for me. 😉)
At this early in the implementation stage though, I think it would probably be better to think about the abstract construct and the operations it would provide, rather than the exact name we would end up labeling it under.
(E.g., note that there is a distinction between a multiset that stores each individual duplicate, and one that merely counts duplicates. For a sorted collection intended to serve as a replacement for O(n^2) sorted array implementations, I expect we'd need the former variant.)
The word "array" has some strong connotations with random-access indexing to me. I don't know if I'd like to use the term SortedArray
to mean anything other than an Array
that is sorted. (And to be clear, I don't think adding a SortedArray
in this sense would be a high priority for this package.)
\
What would the process generally look like for someone to contribute? Both from a personal aspect as well as from a professional one if possible?
Just want to add that the types
SortedSet
andSortedDictionary
would also allow for finally adding binary search to Swift.
That should probably be in the Algorithms library, so any collection that stores its elements in-order can take advantage of it, not just specific container types. (In fact, I think it has one right now, and I've proposed at least another in the past.)
Provisionally putting on the 1.1.0 milestone. This may get bumped off if we take too much time on persistent collections, but let's assume otherwise for now.
Bumping to 1.2.0 acknowledging just how much work the rest of the new features in 1.1.0 is taking. (I'm hoping 1.2.0 will come soon after 1.1.0 though.)
It would be nice if SortedSet
didn’t require Element
to conform to Comparable
, and instead could take a comparator closure ((Element, Element) -> Bool
), like the sort(by:)
family of functions. This would be useful in cases where you might want to have the same type sorted in different ways in different collections. Of course, if Element
is Comparable
, there’d be a convenience initializer for that.
For example, if you wanted a sorted list of people (SortedSet<Person>
), you may wish to have them sorted by either their given name or family name. This different valid mechanisms for sorting isn’t something that could really easily be defined in a Comparable
implementation on a Person
type.
Before:
class Person: Equatable, Hashable {
let givenName: String
let familyName: String
init(givenName: String, familyName: String) {
self.givenName = givenName
self.familyName = familyName
}
static func == (lhs: Person, rhs: Person) -> Bool {
return (lhs.givenName == rhs.givenName) &&
(lhs.familyName == rhs.familyName)
}
func hash(into hasher: inout Hasher) {
hasher.combine(self.givenName)
hasher.combine(self.familyName)
}
}
class GivenNameSortedFirstPerson: Person, Comparable {
static func < (lhs: GivenNameSortedFirstPerson, rhs: GivenNameSortedFirstPerson) -> Bool {
if lhs.givenName < rhs.givenName {
return true
} else if lhs.givenName > rhs.givenName {
return false
} else {
return lhs.familyName < rhs.familyName
}
}
}
let givenNameSortedPeople = SortedSet<GivenNameSortedFirstPerson>()
class FamilyNameSortedFirstPerson: Person, Comparable {
static func < (lhs: FamilyNameSortedFirstPerson, rhs: FamilyNameSortedFirstPerson) -> Bool {
if lhs.familyName < rhs.familyName {
return true
} else if lhs.familyName > rhs.familyName {
return false
} else {
return lhs.givenName < rhs.givenName
}
}
}
let familyNameSortedPeople = SortedSet<FamilyNameSortedFirstPerson>()
After:
struct Person: Equatable, Hashable {
let givenName: String
let familyName: String
}
extension Person {
static func compareByGivenNameFirst(_ lhs: Self, _ rhs: Self) -> Bool {
if lhs.givenName < rhs.givenName {
return true
} else if lhs.givenName > rhs.givenName {
return false
} else {
return lhs.familyName < rhs.familyName
}
}
static func compareByFamilyNameFirst(_ lhs: Self, _ rhs: Self) -> Bool {
if lhs.familyName < rhs.familyName {
return true
} else if lhs.familyName > rhs.familyName {
return false
} else {
return lhs.givenName < rhs.givenName
}
}
}
let givenNameSortedPeople = SortedSet<Person>(comparator: Person.compareByGivenNameFirst(_:_:))
let familyNameSortedPeople = SortedSet<Person>(comparator: Person.compareByFamilyNameFirst(_:_:))
It would be nice if
SortedSet
didn’t requireElement
to conform toComparable
, and instead could take a comparator closure ((Element, Element) -> Bool
), like thesort(by:)
family of functions.
This would have too many undesirable consequences to accept. The core sorted collection types will definitely not work this way.
Closure values are inherently not Equatable. Therefore, there is no way to prove that a collection that's sorted by a closure is using a particular ordering -- the order is neither reflected in the type system, nor it is verifiable at runtime without manually scanning the entire contents of the collection. This makes closure-based collection designs inherently weaker than ones that make use of protocol conformances.
In my (very, very strongly held) opinion, two SortedSet
values of the same type must be guaranteed to sort their members precisely the same way. This is a crucial feature of the core sorted collection types; violating it is not an option.
A function that takes a SortedSet
must encode its ordering expectation within the Swift type system, and the natural way to do this is to rely on the Comparable
protocol. SortedSet
and SortedDictionary
will therefore follow Set
and Dictionary
in making full use of Swift's protocols. Instead of Hashable
, they will naturally require their Element
type to conform to Comparable
.
Beyond the type-safety/expressivity issue of not actually expressing the ordering in the type system, using a closure-based ordering would also have dire technical consequences.
It would make it impossible to efficiently implement crucial methods for combining sets, such as intersection
or union
: there would be no way to quickly verify if two sets use the same ordering relation, so combining two sets would require element-wise processing with O(n*log(n)) worst-case performance. This would not be acceptable behavior for the basic sorted collection types. Instead, the statically typed SortedSet
will guarantee a linear upper bound for merges. Additionally, it will be able to recognize identical subtrees in its inputs and process them in O(1) time -- which is going to be a crucial feature for discovering differences between diverged values.
In addition, while the Swift compiler is able to specialize the implementation of generic types for specific type arguments, it is nowhere near capable of specializing functions for specific closure arguments -- so a closure-based API is expected to be generally slower (by a constant, but potentially significant factor) than ones that make good use of the type system.
Having said all this, though, I would be also open to shipping additional collection types that take closures -- once we have shipped the core sorted collection types. I would very strongly object to shipping closure-based collections before we ship the type-safe ones, or to implement type-safe sorted collections on top of lesser, closure-based variants.
Instead of designing closure-taking APIs, we also have the option of using mixin type parameters instead:
enum ComparisonResult {
case orderedAscending
case orderedSame
case orderedDescending
}
protocol ComparatorProtocol {
associatedtype Value
static func compare(_ first: Value, _ second: Value) -> ComparisonResult
}
struct GeneralizedSortedSet<Element, Comparator: ComparatorProtocol> {
...
}
struct Person: Equatable, Hashable {
let givenName: String
let familyName: String
struct ComparatorByGivenNameFirst: ComparatorProtocol {
typealias Value = Person
static func compare(_ first: Value, _ second: Value) -> ComparisonResult {
...
}
}
struct ComparatorByFamilyNameFirst: ComparatorProtocol {
typealias Value = Person
static func compare(_ first: Value, _ second: Value) -> ComparisonResult {
...
}
}
}
let givenNameSortedPeople = GeneralizedSortedSet<Person, Person.ComparatorByGivenNameFirst>()
let familyNameSortedPeople = GeneralizedSortedSet<Person, Person.ComparatorByFamilyNameFirst>()
However, this is by no means necessary -- it's simply a mechanical transformation of the far more pedestrian option of creating trivial wrapper types over Person
:
struct PersonOrderedByGivenName: Comparable {
var value: Person
func ==(left: Self, right: Self) -> Bool { left.value == right.value }
func <(left: Self, right: Self) -> Bool { ... }
}
struct PersonOrderedByGivenName: Comparable {
var value: Person
func ==(left: Self, right: Self) -> Bool { left.value == right.value }
func <(left: Self, right: Self) -> Bool { ... }
}
let givenNameSortedPeople = SortedSet<PersonOrderedByGivenName>()
let familyNameSortedPeople = SortedSet<PersonOrderedByFamilyName>()
GeneralizedSortedSet
can be built on top of SortedSet
, or vice versa.
Note that GeneralizedSortedSet
would share the type safety of the core SortedSet
-- you cannot pass a set that's sorted by given name to a function that expects to get a set that's ordered by family name. I generally consider this a feature, not a bug.
(Like closure-based data structures, GeneralizedSortedSet
is also firmly in the "possible future direction" box -- the fundamental problem we need to tackle first is shipping the SortedSet
type that requires its Element
to be Comparable
. Once we have done that, we'll have a sound technical foundation on which we can build other things.)
In current implementation. It is not possible to look up for non existing Key.
It would great if SortedDictionary
and SortedSet
provides this kind of feature.
Like java.util.NavigableMap
and java.util.NavigableSet
extension SortedDictionary {
// returns the existing Index, if key exist, else return Index which contains bigger key than the given key.
func ceilingIndex(for key:Key) -> Index
// returns the existing Index, if key exist, else return Index which contains smaller key than the given key.
func floorIndex(for key:Key) -> Index
}
extension SortedSet {
func ceilingIndex(for item:Element) -> Index
func floorIndex(for item:Element) -> Index
}
I hope it won't be too hard, since _BTree
already has lastIndex(forKey:)
and firstIndex(forKey:)
OrderedSet
andOrderedDictionary
work great when we need to keep their elements in the order they were inserted, or if we only need to infrequently reorder/sort them. However, inserting (or removing) an element from the middle of the collection takes linear time in both cases, which makes these generic types less suitable for maintaining a sorted collection. It would be very much desirable for Swift developers to have access to efficient sorted collection types.Self-balancing search trees naturally keep their elements sorted, and they implement efficient (O(log(n))) insertions and removals from anywhere in the collection -- so they seem an ideal choice for the implementation of a standard suite of sorted collections.
Binary search trees and similar low-fanout search tree variants (such as AVL trees or red-black trees) need to maintain a multitude of internal references, so they come with an inappropriately large memory overhead. High fanout search trees such as in-memory B-trees organize their elements into a tree of small contiguous buffers (up to a couple thousand items or so), and this makes them far more efficient in practice -- in terms of both space and time.
Unlike collections that store their elements in a single, contiguous buffer, tree-based collections also allow different versions of the same tree to share some of their storage, by simply referring to the same nodes within the tree. This makes them potentially more efficient than the existing standard collection types when implementing the copy-on-write optimization. (In B-tree's case, we can maintain a nice balance between lookup and CoW performance by having the inner nodes of the tree have a lower maximum fanout number than leaves.)
We'd like this package to implement a great set of sorted collection types based on an in-memory B-tree implementation.