eclipse-archived / ceylon

The Ceylon compiler, language module, and command line tools
http://ceylon-lang.org
Apache License 2.0
399 stars 62 forks source link

pull 'Set'-producing operations | & ~ up to 'Collection' #7435

Open xkr47 opened 6 years ago

xkr47 commented 6 years ago

I was longing for Set operators/operations in Range instances. Namely union, intersection but why not also complement and exclusiveUnion.

Sure, I can do set(3..5).union(set(4..6)), but is there any reason that Range couldn't satisfy Set? I read through #3557 and the current interfaces and their implementations and although I didn't perhaps catch everything, I felt I should ask.

Since all operations except intersection may result in split ranges, the implied return value of Set is actually just fine. One could of course have a more specific Range<Element> intersectionWithRange(Range<Element> other) for convenience.

For these set operations one would naturally not care if the ranges are increasing or decreasing - the result would be the same i.e. the result would be normalized.

As for an actual implementation it would perhaps be beneficial to override the default implementations from the Set interface for more efficient operation (as long as the "other" set is of same type as "this" set), but they would still be "just Sets"...

Thanks for listening..

xkr47 commented 6 years ago

Oh and yes, you could then do (4..8) | (6..10) and get a set containing 4, 5, 6, 7, 8, 9, 10. Or (monday..friday) & (thursday..sunday) and get a set with thursday, friday.

jvasileff commented 6 years ago

I'm not necessarily disagreeing, but the performance characteristics are much more clear when written as set(3..5).union(set(7..9)), which matters for large ranges.

CPColin commented 6 years ago

Yeah, you'd probably want these operations to be lazy, especially for something like ~1..2.

xkr47 commented 6 years ago

@jvasileff wrote:

I'm not necessarily disagreeing, but the performance characteristics are much more clear when written as set(3..5).union(set(7..9)), which matters for large ranges.

Did you mean that they are unclear when written as (4..8) | (6..10) ? I was thinking one would implement a (probably module-local) MultiRange class that looks like a Set but internally keeps track of the ranges making up the set instead of having individual numbers stored. It would itself be immutable; further set operators would just create new instances.

@CPColin wrote:

Yeah, you'd probably want these operations to be lazy, especially for something like ~1..2.

You mean (0..10) ~ (1..2) ? Afaik ~ is not defined as an unary operator?

But yeah the MultiRange proposed above could well be lazy, but one has to be careful not to kill the performance in case e.g. incremental updates to such a Set are performed. There are several ways to mitigate such problems I guess.. flatten-by-threshold, flatten-manually (something analoguous to .sequence() of Iterable, flatten-on-read-access, ...

CPColin commented 6 years ago

You mean (0..10) ~ (1..2) ? Afaik ~ is not defined as an unary operator?

Oops, you're right, my bad.

gavinking commented 6 years ago

Sure, I can do set(3..5).union(set(4..6)),

You can do set(3..5) | set(4..6).

but is there any reason that Range couldn't satisfy Set?

Yep, the reason is that Ranges are Lists, and therefore can't be Sets.

So what you're really asking for here is an operator for appending lists, something I've often wondered about adding. For Strings we use + for this. I suppose we could extend that to Lists-of-same-element-type. (List<T> is also a semigroup.) I have not looked closely at that option.

So you would be able to write: (3..5) + (4..6).

I dunno, I've never been in love with the idea of using + to mean append, though it is intellectually consistent.

gavinking commented 6 years ago

I suppose we could extend that to Lists-of-same-element-type. (List<T> is also a semigroup.)

Oh, well, now, I remember, or course the problem here was that Summable<T> is invariant and so if List<Character> were a Summable<List<Character>> then String could not be a Summable<String>.

And nor could Range<T> be a Summable<Range<T>>, since the "sum" of two ranges isn't a range.

gavinking commented 6 years ago

Interestingly, and subversively, nothing's stopping us from making List<T> a Multiplicable<T>, letting you write it using the * operator! (I'm not serious.)

gavinking commented 6 years ago

I suppose we could extend that to Lists-of-same-element-type. (List<T> is also a semigroup.)

Oh, well, now, I remember, or course the problem here was that Summable<T> is invariant and so if List<Character> were a Summable<List<Character>> then String could not be a Summable<String>.

And nor could Range<T> be a Summable<Range<T>>, since the "sum" of two ranges isn't a range.

Ah, excuse me, that's all rubbish, because the append() method is defined not by List, but by Sequential, and Strings are not sequences in Ceylon, but Ranges are.

So I think it would indeed be possible to make Sequential<T> satisfy Summable<T[]>, giving you a + operator for appending immutable sequences (but not possibly-mutable Lists).

That doesn't actually sound like terrible thing to do.

gavinking commented 6 years ago

Oh, I'm wrong again, the real problem is that Sequential is covariant, and Summable is invariant. So it can't be done at that level either.

gavinking commented 6 years ago

So the only thing to do would be, I guess, to factor out Unionable and Intersectable interfaces and make Map and Set implement both (union and intersection for Map are intuitive, since maps are intuitively sets of entries), and make List implement just Unionable.

That's a possible—and reasonable—thing to do. The only question is do we think it's a good idea to let people non-destructively join lists using |?

luolong commented 6 years ago

What is the proper union of this expression:

map{"key" -> "v1"} | map{"key" -> "v2"}
gavinking commented 6 years ago

@luolong oops, sorry, you're right, intersection can be defined for Map, but not union.

xkr47 commented 6 years ago

@gavinking wrote:

Yep, the reason is that Ranges are Lists, and therefore can't be Sets.

I understand that it is convenient that ranges can be expected to have a predictable order of elements, but aren't they also effectively Sets? I'm of course not asking to actually make them mutually inclusive, just as a thought exercise..

And nor could Range be a Summable<Range>, since the "sum" of two ranges isn't a range.

Could Range instead satisfy Summable<MultiRange<T>> ?

Anyway I would still miss the intersection, exclusiveunion etc. I guess I could have a shot at implementing a separate MultiRange class that would satisfy Set and allow for easy co-operation with Range instances.

I'm closing this now but feel free to keep discussing.

gavinking commented 6 years ago

I understand that it is convenient that ranges can be expected to have a predictable order of elements, but aren't they also effectively Sets?

Well, sure, every Iterable is effectively a Set.

gavinking commented 6 years ago

One possible reasonable thing to do would be to define | and & for the Collection interface, but have them always produce Sets. That is, simply pull the current union() and intersection() methods up to Collection from Set.

That would be a change with pretty minimal impact on the language, that makes the existing | and & operators much more useful.

For example, [1->"hello", 3->"world"] | map { 1->"goodbye", 2->"sweet" } would be equal to set { 1->"hello", 3->"world", 1->"goodbye", 2->"sweet" }.

xkr47 commented 6 years ago

@gavinking wrote:

Well, sure, every Iterable is effectively a Set.

Hmm, through using Java I have empirically developed the (perhaps incorrect) understanding that Sets guarantee uniqueness of entries. So in that context, a [ 1, 1 ] would not be considered to be a Set..?

Your example with a sequence and a map is of course fine because all the elements of the set are unique. But based on my hypothesis above I would expect [ 1, 1, 3 ] | set { 4, 1 } would be equal to set { 1, 1, 3, 4, 1 } which is equal to set { 1, 3, 4 }.

gavinking commented 6 years ago

So in that context, a [ 1, 1 ] would not be considered to be a Set..?

Well, I mean, mathematically, a set is a value with just one operation, ∊, which in Ceylon we write as in. So as long as an object has a contains() method, it can be considered a set.

In particular, [ 1, 1 ] can be "considered" a Set by applying the set() function to it.

(Note that the definition of set in mathematics doesn't say anything at all about iterating values, or any of the other things that distinguish Sets from Lists.)

But based on my hypothesis above I would expect [ 1, 1, 3 ] | set { 4, 1 } would be equal to set { 1, 1, 3, 4, 1 } which is equal to set { 1, 3, 4 }.

Right, set { 1, 1, 3, 4, 1 } == set { 1, 3, 4 }.

xkr47 commented 6 years ago

One possible reasonable thing to do would be to define | and & for the Collection interface, but have them always produce Sets. That is, simply pull the current union() and intersection() methods up to Collection from Set.

The idea sounds really nice.

So one would put default implementations in Collection and assumably override in Range with versions checking whether the "other" collection is also a Range, and fall back to the Collection implementation if not?

(Reopening since there is now light again :)

gavinking commented 6 years ago

override in Range with versions checking whether the "other" collection is also a Range, and fall back to the Collection implementation if not?

They could be overridden on Sequential to make use of immutability, yes.

gavinking commented 6 years ago

I've implemented this idea (it was really straightforward) and it seems to be working well. I'll push it to a branch later.

gavinking commented 6 years ago

I've pushed my work to the 7435 branch. Now we need to see if there is really a consensus in favor of this. Feedback, please!

davidfestal commented 6 years ago

I'm in favor of this.

fill0llif commented 6 years ago

@gavinking wrote:

Well, sure, every Iterable is effectively a Set.

Hmm, through using Java I have empirically developed the (perhaps incorrect) understanding that Sets guarantee uniqueness of entries. So in that context, a [ 1, 1 ] would not be considered to be a Set..?

Your example with a sequence and a map is of course fine because all the elements of the set are unique. But based on my hypothesis above I would expect [ 1, 1, 3 ] | set { 4, 1 } would be equal to set { 1, 1, 3, 4, 1 } which is equal to set { 1, 3, 4 }.

[ 1, 1 ] is indeed not considered to be a mathematical set, because a set is a collection of distinct objects. Therefore an iterable object may or may not be a set. If Set will still retain its defining feature, that is, having distinct objects, then why not.

gavinking commented 6 years ago

[ 1, 1 ] is indeed not considered to be a mathematical set, because a set is a collection of distinct objects.

This isn't really correct, at least not in modern axiomatic set theory, where "set" is a primitive notion. The only operation a mathematical set has is ∊, and one can't distinguish [1] from [1,1] using ∊.

fill0llif commented 6 years ago

This isn't really correct, at least not in modern axiomatic set theory, where "set" is a primitive notion. The only operation a mathematical set has is ∊, and one can't distinguish [1] from [1,1] using ∊.

Both of them are theories anyway, that's not what I really care about. The thing is, I believe having a collection that deals with distinct objects is very useful. Though if a Set doesn't specify anymore if it contains distinct objects or not, what's the difference between a Set and a Collection then?

gavinking commented 6 years ago

if a Set doesn't specify anymore if it contains distinct objects or not, what's the difference between a Set and a Collection then?

There is no suggestion to change the semantics of Set. The iterator for a Set always returns distinct values.

fill0llif commented 6 years ago

Then it's more than ok. That would be very useful.

someth2say commented 6 years ago

Sorry I came a bit late to this discussion, but I think I should drop my five cents here.

I don't think | & and ~ are actually part of the semantics for Collection, but just for Set. I'm not saying they are not useful, but just no the right place.

I.e. what's the meaning of the union of two Lists? it is not unreasonable thinking that may retain elements from even if they are repeated (i.o.w. work as a concatenation); but also may just add the "missing" elements to the end of the list; or remove duplicate elements before concatenate...

Instead, I propose creating utility methods in collections, with similar meaning, that can be used to define Set methods. Say:

Those methods can serve as foundations for Set operations, but also can be refined to adapt the results to different collection kinds.

gdejohn commented 6 years ago

@someth2say The idea is that arbitrary collections can be trivially viewed as sets, because conceptually a set is just something that tells you whether or not it contains a given element. Taking the union of two lists means that you want to treat them as sets, you don't care about order, and you want to get back a set containing each distinct element that appears in either list. That's a useful thing to do.

I.e. what's the meaning of the union of two Lists? it is not unreasonable thinking that may retain elements from even if they are repeated (i.o.w. work as a concatenation); but also may just add the "missing" elements to the end of the list; or remove duplicate elements before concatenate...

The Set return type of these operations means that it is unreasonable to think that. It's another one of those things where you might get the wrong idea the very first time you encounter it, before you understand what it's supposed to be doing, but you only have to learn it that one time.

someth2say commented 5 years ago

@gdejohn

The idea is that arbitrary collections can be trivially viewed as sets, because conceptually a set is just something that tells you whether or not it contains a given element

That's one of the root arguments I don't agree with. Quoting @gavinking: mathematically, a set is a value with just one operation, ∊. But practically, sets have many other characteristics we should not ignore (like all contained elements being unequal...and equality is really a hairy topic!). If you are just keeping the ∊ operation, we usually use the name Collectionor Bag.

Taking the union of two lists means that you want to treat them as sets... It's another one of those things where you might get the wrong idea the very first time you encounter it...

That's another argument against pulling methods up. You should learn that, when you apply Setmethods to two Collections, those will "automagically" be transformed to Sets, and then the method will be applied; It is not intuitive what the result will be (at least not for me), as a hidden transformation is applied. And Ceylon always tried to avoid hidden transformations.

I'm proposing specific operations for all Collections (so you can write [3,4,5].with([6,7,8])) while keeping Setspecific operations as they already are (so you can still use set(3..5) | set(4..6)). It is just matter of making explicit when you are working with Sets and when with Collections.

Also, as @jvasileff said, performance impact is much more clear when set transformation is explicit.

gdejohn commented 5 years ago

But practically, sets have many other characteristics we should not ignore (like all contained elements being unequal...and equality is really a hairy topic!). If you are just keeping the ∊ operation, we usually use the name Collection or Bag.

Conceptually, we only care about ∊ with respect to the collection operands, but we care about more than just ∊ with respect to the result, which is why we return a Set. The return type communicates all of that.

That's another argument against pulling methods up. You should learn that, when you apply Set methods to two collections, those will "automagically" be transformed to sets, and then the method will be applied; It is not intuitive what the result will be (at least not for me), as a hidden transformation is applied. And Ceylon always tried to avoid hidden transformations.

How is that a hidden transformation in some undesirable way that any other function isn't? In general, you don't know how any function is implemented, how it transforms its inputs. Why is that a problem here?

Also, as @jvasileff said, performance impact is much more clear when set transformation is explicit.

I could understand the objection if the time complexity jumped from linear to quadratic or something, but it's still linear, so what's the big deal?