OrderedSet, OrderedMap with Treap, SkipList as underlying structures

rapiz1 commented 4 years ago

Background

Previously I proposed Treap and SkipList https://github.com/chapel-lang/chapel/issues/15670 https://github.com/chapel-lang/chapel/issues/15658 . After discussing about the unified interface of Vector, I realized Treap and SkipList could also follow this pattern. They have pros&cons but the interfaces are completely identical and the use case is overlapped. These structures can be organized in a wiser way. Also, they should be named in a better way to indicate their usage in programming, rather than names of structures. So I propose OrderedSet and OrderedMap here.

Chapel already has Set and Map based on Hash Table, which depends on hashing. What I'm proposing here is dependent on comparing.

OrderedSet

OrderedSet can be based on any fast structures that support querying and inserting in log(n) time, like Skip List and various Search Trees. Instead of a hash function, it requires comparable elements and supports more operations. The interface is identical to Set, but with additional procedures.

kth(idx: int): return k-th element in the sorted sequence, throw an error if out of range
lowerBound(x: eltType): (bool, eltType) same as std::lower_bound(), return false in the tuple if no occurence.
upperBound(x: eltType): (bool, eltType) same as std::upper_bound(), return false in the tuple if no occurence.
predecessor(x: eltType,): (bool, eltType) return the predecessor of one element, return false in the tuple if it's the first one.
successor(x: eltType): (bool, eltType) return the successor of one element, return false in the tuple if it's the last one.
init(type eltType, param parSafe = false, param implType: impl) enum setImpl { treap, skipList };
OrderedMap

We can easily build a kind of map on OrderedSet by storing tuples like (key, value) in the set. Interfaces are identical to Map except the initialization.
init(type eltType, param parSafe = false, param implType: impl) enum mapImpl { treap, skipList };

I'm not sure about the name of OrderedMap. In Java, there is a similar strcuture namedTreeMap but that's built on Red-Black Tree. We have SkipList option here so it's not appropriate.

LouisJenkinsCS commented 4 years ago

Why not name them OSet and OMap? I prefer concise naming conventions.

rapiz1 commented 4 years ago

Why not name them OSet and OMap? I prefer concise naming conventions.

Interesting. But I haven't seen these names before so I'm a little worried about that's too subtle. Can you provide some examples from other languages that use these names? Maybe I'm just being naive.

LouisJenkinsCS commented 4 years ago

Prolog Library 1 Library 2 Library 3 Internal Library Library 4

bradcray commented 4 years ago

Why not name them OSet and OMap? I prefer concise naming conventions.

I feel as though, historically, we've usually had more regrets about choosing too-terse names than too-descriptive ones. Note that with a descriptive name, a user can always use a type alias to call it something shorter.

LouisJenkinsCS commented 4 years ago

Then can a type alias be provided with it? I.E., in the file including the definition of the type, add an official type alias.

bradcray commented 4 years ago

It could be, though we usually try to avoid introducing aliases so that people don't have to learn "What's the difference between x and y... oh, nothing?" (whereas if a user adds the alias to their code, it's self-documenting).

Both my comments here should just be interpreted as opinions, not requirements.

e-kayrakli commented 4 years ago

@bradcray suggested (under https://github.com/chapel-lang/chapel/issues/15950) making these just an implementation of set and map. This idea follows from a similar one about list https://github.com/chapel-lang/chapel/issues/15913

@Rapiz1 was opposed to the idea and what he said here resonated for me:

OSet/OMap is designed to utilize Treap, SkipList, or some ordered structures(e.g. Red-Black Tree) we may add later, while Map utilizes HashTable. I think the differences like O(1) vs O(lgN), ordered structures vs unordered structures, are significant enough to split up Set and OSet. In addition, possibly we have far more ordered structures(various trees) than unordered structures. It makes more sense to group them together.

This tells me that the ordered-ness of the collection can be provided by different data structures with different characteristics. As such, I see OrderedSet in the same class as List and Treap in the same class as buffer and multibuffer.

In other words if we want to parametrize set that determines the ordered-ness, than how do we pick the underlying structure? Do we choose between hashtable, treap, and skiplist when we are creating a set? Then the ordered-ness would be implicit and I don't like that.

I think in general it is a good practice to use the same collection interface that can choose between different data structures, but in this case it would seem too contrived for me, with my current understanding. To me, we can have Set and OrderedSet just like we have Block and Cyclic for example.

Assuming we go this way: I am against OSet and OMap and having such aliases. OrderedSet and OrderedMap are really good I think. Not too wordy, tells about what they do clearly etc.

cassella commented 4 years ago

I agree with OrderedSet and OrderedMap as names for the type. I wouldn't expect to have to type out the name of the type very often in code using the objects, so I prefer the more descriptive names. And as a reader of someone else's code, the longer names would be more self-documenting.

I don't see the point of making an OrderedSet be a variant of set like with List. If you ever want an ordered set, it's because you're using it in a way that depends on it maintaining the order. Therefore swapping the param argument to get an unordered set won't "just work" with your code. And if you have code using an unordered set, why would you swap to a more expensive ordered set if you're not going to rely on the order? So the benefit of being able to easily swap one for the other doesn't seem like it'd ever actually be a benefit. It's not just a matter of untidiness that one implements more functions than the other -- they have different interfaces so aren't interchangeable.

rapiz1 commented 4 years ago

@bradcray @e-kayrakli @cassella @krishnadey30 I also want to further propose to put orderedSet and orderedMap in one module named Ordered (or something else). They share the underlying implementation and also the hierarchy to chose between implementation. For example, the enum they use to choose implementation should be the same.

e-kayrakli commented 4 years ago

@bradcray @e-kayrakli @cassella @krishnadey30 I also want to further propose to put orderedSet and orderedMap in one module named Ordered (or something else). They share the underlying implementation and also the hierarchy to chose between implementation. For example, the enum they use to choose implementation should be the same.

I am not so sure whether that decomposition is the right approach. If you do that, use Ordered and import Ordered would bring in both data structures to the scope, whereas only one might be needed. What about a separate module OrderedCommon.chpl or something (not fond of the name, myself) that includes all the underlying implementation. Then OrderedMap and OrderedSet modules doing a private use OrderedCommon and providing the interface around the common implementation?

My main worry is the precision of use and import statements, and if I am missing a way to workaround that, let me know.

rapiz1 commented 4 years ago

If you do that, use Ordered and import Ordered would bring in both data structures to the scope, whereas only one might be needed.

Fair enough. In that case, I prefer to remain the same as before.

cassella commented 4 years ago

I think you can put both modules OrderedMap and OrderedSet into one file. Then as long as that file is in the module search path, users can just use OrderedMap or import OrderedSet, and the compiler can find it. (At least it does when I add an extra module to Set.chpl). Then that file could include a private module OrderedCommon and the other two modules can private use it as Engin suggests.

bradcray commented 4 years ago

I think you can put both modules OrderedMap and OrderedSet into one file. Then as long as that file is in the module search path, users can just use OrderedMap or import OrderedSet, and the compiler can find it.

I don't think that's right. Specifically, if a file Filename.chpl contains a module M.chpl, use M; I don't believe the module will be found unless something else does a use Filename; or import Filename; (which is what I suspect is happening in your Set example, Paul).

rapiz1 commented 4 years ago

I tried to make Treap a submodule of OrderedSet in separate files but came across https://github.com/chapel-lang/chapel/issues/16094

rapiz1 commented 4 years ago

lowerBound(x: eltType, ref result: eltType): bool same as std::lower_bound(), return false if no occurence.
upperBound(x: eltType, ref result: eltType): bool same as std::upper_bound(), return false if no occurence.
predecessor(x: eltType, ref result: eltType): bool return the predecessor of one element, return false if it's the first one.
successor(x: eltType, ref result: eltType): bool return the successor of one element, return false if it's the last one.

I wrote ref result: eltType because I wanted to support types that can't be default initializable, which are actually non-nilable classes, by avoiding returning result: eltType, which should be nil when no hit. But I realize that it makes more sense to return a tuple (found: bool, result: eltType) and just give out a compiler error when trying to call these methods on non-nilable classes. Any thoughts? @bradcray @e-kayrakli @cassella @krishnadey30

e-kayrakli commented 4 years ago

But I realize that it makes more sense to return a tuple (found: bool, result: eltType) and just give out a compiler error when trying to call these methods on non-nilable classes.

This looks way better than returning via a ref argument to me.

bradcray commented 4 years ago

I also tend to prefer returning tuples in Chapel over returning via arguments, at least if the value being returned is fairly intrinsic to the routine's nature. If we were to continue returning these cases by argument for any reason, I think using out probably would communicate the intent better than ref (?)

mppf commented 4 years ago

kth(idx: int): return k-th element in the sorted sequence, throw an error if out of range
lowerBound(x: eltType): (bool, eltType) same as std::lower_bound(), return false in the tuple if no occurence.

Have we considered making all of these throw if they are used in a way that cannot return an element? kth already does so; could lowerBound et all also throw if they cannot return an element? That way, the return slot would still be the normal return slot.

Returning a (bool, eltType) just seems like a poor version of an option type which I think is something we hope to have at some point (and it might even be written just as eltType?). It's probably not reasonable to try to do that as part of GSoC but it would be possible in the near term to make a record option that contained your tuple, so at least the name and concept might match with something we want in the future.

So anyway, my preference is probably to have all of these return an option type or to throw (and probably in that order but I think both are reasonable).

Regarding this part: init(type eltType, param parSafe = false, param implType: impl) enum setImpl { treap, skipList }; - I know there has been a bunch of discussion that I haven't been very active in - but I generally don't like the idea of parameterizing totally different data structures. (For closely related data structures I am more OK with it, but even there I lean more towards separate types). I think the APIs for treap and skipList might be identical but that we should handle that in the future by making interfaces/constrained generics rather than trying to put all of the data structures into one generic type. The main problem I see with the param approach here is that to add another implementation, somebody has to modify orderedSet. Whereas if orderedSet is an interface/constraint, then anybody can create something that implements orderedSet. As a result, I would prefer to see treap and skipList be separate record types in separate modules. They can share a helper module if that is useful. The documentation could define an orderedSet in terms of what functions are available (which would one day become the interface/constrained generic). treap and skipList documentation can mostly link to this orderedSet document.

e-kayrakli commented 4 years ago

Regarding this part: init(type eltType, param parSafe = false, param implType: impl) enum setImpl { treap, skipList }; - I know there has been a bunch of discussion that I haven't been very active in - but I generally don't like the idea of parameterizing totally different data structures. (For closely related data structures I am more OK with it, but even there I lean more towards separate types). I think the APIs for treap and skipList might be identical but that we should handle that in the future by making interfaces/constrained generics rather than trying to put all of the data structures into one generic type. The main problem I see with the param approach here is that to add another implementation, somebody has to modify orderedSet. Whereas if orderedSet is an interface/constraint, then anybody can create something that implements orderedSet. As a result, I would prefer to see treap and skipList be separate record types in separate modules. They can share a helper module if that is useful. The documentation could define an orderedSet in terms of what functions are available (which would one day become the interface/constrained generic). treap and skipList documentation can mostly link to this orderedSet document.

The constrained generic approach you outline here would be nice to have in the future.

If we foresee such a future, do you think that we should drop implType argument as it would be a breaking change when/if we switch to that approach (as far as I can see, that's a future-looking argument in any case and has no use from the first PR)? Do you think there are other issues with the current orderedSet design in that regard?

mppf commented 4 years ago

If we foresee such a future, do you think that we should drop implType argument as it would be a breaking change when/if we switch to that approach

Yes. Regarding the current PR, if I understand correctly, it would rename the implementation to treap. orderedSet would become a documentation-only type (for now) and an interface (in the future).

rapiz1 commented 4 years ago

I think having orderedSet as an interface is great. But I don't like having a documentation-only type for some time and I have to use Treap when I actually want orderedSet. Also, I don't like having a module called Treap. Though we have Heap in standard, Treap is less familiar to some users and even less descriptive than Heap.

I wonder can we combine these two approaches together? We can introduce something like orderedSetInterface in the future to enable users to add implementations. In the meantime, for orderedSet, parameters can still be used to quickly switch between built-in implementations. OrderedSet can also be a place to group Treap, SkipList together, and give them a more descriptive name. No breaking change is involved.

rapiz1 commented 4 years ago

Have we considered making all of these throw if they are used in a way that cannot return an element? kth already does so; could lowerBound et all also throw if they cannot return an element? That way, the return slot would still be the normal return slot.

Oh I forgot to update that. kth actually returns a tuple now.

in the near term to make a record option that contained your tuple

Can you describe more about the record option or link to some discussion? I haven't heard of it.

mppf commented 4 years ago

Can you describe more about the record option or link to some discussion? I haven't heard of it.

From #12614

A type modifier is available to indicate a class type is possibly nil. This proposal follows the Swift syntax for these. (see https://developer.apple.com/documentation/swift/optional ). We expect that this syntax will eventually be generalized to create optional record types (or optional int, etc) however the focus of this proposal is on the nilability of class instances.

You could look at Swift's optional type for some motivation. But a sketch of a prototype of it would be:

record optional {
  type valueType;
  var value: valueType;
  var isNone: bool;
  proc init(type valueType) {
    this.valueType = valueType;
    // default initializes this.value -- in the future we would avoid that
    this.isNone = true
  }
  proc init(in value) {
    this.valueType = value.type;
    this.value = value;
    this.isNone = false;
  }
}

However a problem with this approach is that then we have to do API design on optional if we want the orderedSet interface to have any stability. Even so, I think I'd prefer a prototype optional to a tuple.

But you didn't answer my first question yet:

Have we considered making all of these throw if they are used in a way that cannot return an element?

rapiz1 commented 4 years ago

But you didn't answer my first question yet:

I needed further information about optional to give feedback. : )

Have we considered making all of these throw if they are used in a way that cannot return an element?

I generally consider throw statement as a way to handle errors. Procedures like lowerBound are more like searching. Is it reasonable to throw an error when a search returns no occurrence?

So I lean to the optional record. Do you mean this module needs to contain a prototype of the optional record and return it instead?

e-kayrakli commented 3 years ago

OrderedSet was added with #16124, OrderedMap was added with https://github.com/chapel-lang/chapel/pull/16271.

chapel-lang / chapel

OrderedSet, OrderedMap with Treap, SkipList as underlying structures #15921

Background

OrderedSet

OrderedMap