Check for IDL hash collisions in Motoko type checker

nomeata commented 5 years ago

I think our type checker needs to reject code that tries to use records (or variants) as sharable types when the field’s IDL hashes collide, e.g. in { foo_ : (); foo : ()}, or in cases where the hashes collide.

We could either

consider such records not sharable (and give a helpful error message), or
ban them outright, even if they are not used in a shared context.

The latter ensures that programmers can be sure that all record types with shareable fields are shareable? Otherwise there is a risk that the programmer later wants to send a type over the wire and finds that he has to refactor the code, renaming some fields, in order to do so.

Assigning to Andreas for the policy decision.

rossberg commented 5 years ago

Agreed. I have a slight preference for the first option, but the second seems simpler, so I'd be fine with that as well, provided we have a hash function that does not get in the way.

Do "foo" and "foo_" actually collide or was that just meta? If they do then we should pick a better hash function. There is no good reason why there should be any collision for names with length <=4.

nomeata commented 5 years ago

Do "foo" and "foo_" actually collide or was that just meta? If they do then we should pick a better hash function. There is no good reason why there should be any collision for names with length <=4.

They collide because of the field escaping; we want to treat async_ as async, because we can’t write async as a field name (see https://github.com/dfinity-lab/actorscript/issues/695 for a summary of the unescape function).

If we choose the IDL’s has function for AS as well (see #733) (or maybe the last 31 bits of it) then we have less conflicts. It’s a bit dirty again to co-design the IDL with AS, but maybe we should just be pragmatic.

rossberg commented 5 years ago

Maybe the best solution would be to have an escaping mechanism for identifiers in AS itself, so that any string can be used as an identifier. Maybe _'...' or \'...'?

nomeata commented 5 years ago

That would work too, no strong opinions here. But we'd need two, right, one for arbitrary stings, and another one to encode a specific IDL hash value (_1_ under the current plan)

nomeata commented 5 years ago

We still need a plan here. But I guess a first good step is to make the typechecker complain if there are collisions, and think about a possible escaping mechanims later.

rossberg commented 5 years ago

SGTM. Your point that we would need two forms makes me less enthusiastic about an escape mechanism.

nomeata commented 5 years ago

One alternative, viable now that we have the type description header, is of course drop field hashing in the IDL, and just put the strings in the type description (with an indication into a list of strings). This would also make pretty-printing or dynamic use of the IDL format much nicer.

The downside is of course a bigger message (but still better than JSON) and that dynamic use of the IDL format is much nicer.

rossberg commented 5 years ago

Interesting idea. The other disadvantage I can see is yet more complexity, at least if you want to avoid duplication.

What about numeric labels? I suppose you'd need to encode them as text, inverting the current scheme (into something more like JavaScript).

nomeata commented 5 years ago

Interesting idea. The other disadvantage I can see is yet more complexity, at least if you want to avoid duplication.

Yes, that’s what I meant with an ~indication~ indirection into a list of strings.

It doesn't sound so bad… unti you wonder if you want it to be sorted :-)

Requiring it to be sorted makes serialization harder; essentially requires an additional pass over the data (or at least the types). Requiring it not to be sorted might be fine, if we require all lists of field names to be sorted by field name (and not just field name index). So that would work.

What about numeric labels?

What would that gain us?

rossberg commented 5 years ago

What would that gain us?

I meant, how to encode tuples, which some langs have? Currently, they naturally are just records with indices 0,1,... With textual labels, that becomes more cumbersome, too, and you'll need to add all the text encodings of numbers to the string list.

nomeata commented 5 years ago

Right. We could

allow both numbers and texts as field names (basically what happens when we promote the syntactic shorthand to a real thing)
address tuple fields by a string, ("1" etc…), as you say
add proper tuples to the IDL (and maybe use them instead of the second-class sequences for functions)

Neither is particularly pretty. But saying “we hash record field names because of tuples” is also a bit odd. Weird.

dfinity / motoko

Check for IDL hash collisions in Motoko type checker #744