rename list-set - Githubissues

shriram commented 8 years ago

Can we please rename list-set to something else (maybe linear-set)? Two reasons:

list-set doesn't actually reveal the underlying complexity, because you don't know how the lists are implemented.
More to the point, in PAPL I now introduce lists and sets together, as two different datatypes. It's therefore going to be incredibly confusing to see something like:

> [list: 1, 2, 3]
[list: 1, 2, 3]
> [set: 1, 2, 3]
[list-set: 1, 2, 3]

Natural reaction: “Wait: is a set a list or not a list or … what?”

jpolitz commented 8 years ago

I like linear-set.

Related question: how big of a deal are you making in PAPL about the different complexities of linear vs. tree sets? Because it seems like having the set constructor called just set is useful, but really the user should pick which one to use at construction time (and having set just be an alias for linear-set betrays an odd bias).

shriram commented 8 years ago

Initially, I don't make a big deal at all. I just say "there are lists and there are sets". However, I do have a whole chapter devoted to sets of different complexity —two list-based (with and without duplicates) and one tree-based— because it's one of the nicest ways I've seen of getting basic algorithmic issues (and representation independence) across. (There's also the subtle point that the list-based reps only need equality comparison but the tree-based one requires ternary.)

Previously the sets were ill-motivated, but now I want to work them in as a contrast to lists. As I've mentioned in the past, sets are quite pervasive in our assignments; but (a) we defaulted to lists for no historical reasons, and (b) even once we added sets, the Pick interface was too ugly to teach early on, so we stuck with lists. Hopefully the new for-based style will address that and give sets a reasonable place in the early curriculum. In which case, the chapter on complexity re. sets is even more well-placed. I would also drag join lists into the main book, so that we end up with two representation chapters for collective data: sets as list+dup, list-dup, tree; list as cons, append. I think it would be beautiful.

We have talked about set being called just set before. I would frankly prefer that early on. Maybe just pick duplicate-free lists as the default representation and call that set. The problem is always with printed output…

@blerner was involved in the decision to rename set, so let's have him weigh in too.

jpolitz commented 8 years ago

Several modern languages have built-in tree and/or hash sets. Even OCaml defines a partial ordering over all values (except functional values), and uses that to make Pervasives.compare work well for structural comparisons, and hence sets are straightforward.

I wonder if the chapter on two set implementations could just be "here's how the internals of sets work", and Pyret should define an ordering on values to get a reasonably efficient built-in set implementation – essentially providing default _greater methods on data would give us most of this for free. Then there would just be one default set implemented by, say, an AVLTree, and we wouldn't have this problem of which set to use in the language.

From a pragmatic point of view, I've always had this sense that linear sets are a bit silly and slow to put front-and-center.

But maybe this is too complicated; previously it had seemed reasonable to keep the simplest set implementation as the default, and not complicate the language on its account. I'm less sure of that given this kind of discussion about what to print for linear vs. tree sets when all we want is a notion of "a set".

shriram commented 8 years ago

Don't worry about the book. Let's focus on what the language should do without conflating the two. The book is anyway about concepts in Pyret, not about the details of Pyret.

I agree that having the default notion of set be inefficient is a bad decision overall. We don't want users trying to do something reasonable on a meaningful dataset and then being burned because the didn't read some random chapter of PAPL: that's utterly would annoy the heck out of a user (rightly) and would be self-defeating for us as language creators.

So I would be totally in favor of making a functional tree version the default.

@blerner?

blerner commented 8 years ago

Just thinking through possibilities here. Pyret values include numbers, roughnums, booleans, strings, nothing, raw-arrays, functions, methods, objects, and data values. We'll also soon have tuples. Which of these are comparable? Said another way, if I wanted to create a Set<Any>, is there a plausible total ordering on these values? We can drop functions and methods, because they don't make sense in comparison with anything, so having a set of them is nonsense. We'd need to lift the notion to lexicographic ordering for objects and tuples and raw-arrays, and that's doable. I think the others can all incorporate a notion of ordering, but I don't know how to extend it across types.

I wouldn't mind having a more efficient set implementation by default, and wouldn't mind a cleaner name than list-set and tree-set, but I don't understand the semantics yet.

jpolitz commented 8 years ago

I believe assigning each data declaration, newtype declaration, opaque FFI constructor, and built-in type a natural number will suffice for between-type comparisons. This is most easily done dynamically with a per-runtime counter.

This allows the runtime to only call the greater method when those numbers are equal, and decide on its own when they differ.

There may need to be a new runtime error for newtype-branded values that don't define comparison methods.

blerner commented 8 years ago

But that can't be done in a separate-compilation manner, right?

jpolitz commented 8 years ago

I think the place to insert the counter increment is in makeVariantConstructor, which would set up the numbers at runtime while instantiating modules. So data would work fine with separate compilation.

Similarly, when declaring newtypes, there's a call into the runtime called namedBrander, which could close over the same counter. So newtype is fine.

We'd have to change the way opaque constructors are registered with the runtime. There are relatively few of those, so it shouldn't be too intrusive to create a makeOpaqueConstructor entrypoint into runtime that tracks a bit more information about wrapped types like images.

jpolitz commented 8 years ago

Separately, it's probably a good idea to not make this comparison be the behavior of < – there are too many errors that could result from comparing a number and a string that would be masked by this.

A new built-in function, probably called compare, would be the way to go.

blerner commented 8 years ago

Sorry, I wasn't clear. This isn't incompatible with separate-compilation, but it is somewhat non-deterministic with respect to loading. Since our loader makes no guarantees about the relative order that two independent modules load in, we'll make no guarantees about the comparison order between datatypes provided by those two modules. But our loader is deterministic, which means that most of the time, comparisons between datatypes will be stable (for example, all datatypes defined by the compiler will always sort before user datatypes, because the compiler loads first).

I'd rather that we have either fully-deterministic and stable comparison orders, or specified-to-be-randomized comparison orders.

As for a builtin function called compare: do we want to expose that to user-writable code, or make it compiler-internal? If the latter, then users can't write their own ordered collections over arbitrary data. If the former, then how do we educate users that this is a hazardous function to use?

jpolitz commented 8 years ago

Good point on loader determinism!

We could fully-specify the load order to make this deterministic, because import declarations appear in order. This requires specifying a particular topological sort (rather than any topological sort), but if we fix that, then there will be a unique dependency list for every single whole-program.

Of course, that means re-ordering import statements could change the program, which is quite a consequence!

So it may in fact be better to do something random so programs cannot rely on the load order of data definitions for compare, and programmers will notice that it is deterministic across runs.

compare should absolutely be user-visible. I think we simply write good documentation, I don't see it as hazardous, just necessarily not deterministic across different runs.

blerner commented 8 years ago

Couldn't we just have the order be dependent on the type name? Lexicographic ordering is stable regardless of load order...

jpolitz commented 8 years ago

URI + type name would be unique.

brownplt / pyret-lang

rename list-set #684