dfinity / motoko

Simple high-level language for writing Internet Computer canisters
Apache License 2.0
499 stars 98 forks source link

Add identifier escaping #1668

Open rossberg opened 4 years ago

rossberg commented 4 years ago

On a number of occasions, it is useful to have identifiers that do not follow the usual lexical conventions or collide with keywords:

Here is one possible suggestion:

I believe that char literals are rare enough that this shouldn't be a significant problem, and it frees up the use of a precious ASCII character. Thoughts?

nomeata commented 4 years ago

to work around names being reserved (e.g., and, or)

Would we also allow keywords in things like Bool.and and maybe variant { #and; #or }? (This would probably require treating .foo or #foo as an parser atom instead, which I think was rejected before.

For char literals, we use double quotes and rely on the existing overloading mechanism.

We could go a step further and have Char <: Text, with characters being single-character test values. Like Nat <: Int. One could use # with Char and Text alike, and mix them and one would need less annotation. But maybe more confusing than helpful, and probably not nice since we have chosen to constrain out subtyping by candid’s subtyping relation…

rossberg commented 4 years ago

Would we also allow keywords in things like Bool.and and maybe variant { #and; #or }? (This would probably require treating .foo or #foo as an parser atom instead, which I think was rejected before.

Well, I dismissed it exactly because it should use the same lexical syntax token as identifiers. But here we're talking about changing that very fact.

That said, I don't think it requires making them a single token. We can also do it with an "extended id" production in the parser that includes keywords.

Char <: Text

That would force Char into a non-scalar representation. Doesn't that seem rather undesirable?

nomeata commented 4 years ago

That would force Char into a non-scalar representation. Doesn't that seem rather undesirable?

I wouldn't overly worried. If the programmer experience is better this way (which I am not sure, just brainstorming here) I think a slight performance his is justifiable. In general we already have to bit-tag Char values, so putting them into the heap is not much more. And if it does then we could start to put small text strings (<= 3 bytes) into such tagged scalars, which may improve Text code as well, e.g. ocurrences of "\n" or "" would no longer need to be heap allocated.

But either ways these are optimizations that seem much less relevant than the developer ergonomics of whether they have to type-annotate "\n" when they want this to be a Char.

nomeata commented 3 years ago

I just ran into this while trying to run the latest Candid test suite in Motoko. There we have values like (func "aaaaa-aa"."🐂"), but there is no way to express such types or values in Motoko.

Our IDL-Motoko design doc (https://github.com/dfinity/motoko/blob/master/design/IDL-Motoko.md) as well as the implementation in mo_idl/idl_to_mo.ml currently says we escape/unescape method names like record fields (including falling back to the hash) but that’s pretty bogus, as it's not reversible.

rossberg commented 3 years ago

You mean that's bogus because method names are not actually hashes, unlike record labels? Yeah, using that mapping seems broken then.

Agreed, Motoko can't express these. But whatever we do for Motoko, so can't many other languages. So, I'm not sure this can be solved in general. If somebody defines an interface with exotic method names, they're asking for trouble.

One could argue that we should not allow anything exotic in the first place, but that would feel over-restrictive -- there might be niches where it is useful. So perhaps this rather is something to put into an interface style guide?

nomeata commented 3 years ago

Maybe… in that case I’ll beef up the Candid test suite runner to skip tests not expressible in Motoko

chenyan-dfinity commented 3 years ago

hmm, I was under the impression that method names are sorted by the hash as well. We don't need reversibility. If Candid has a method 🐂, Motoko can just call the hash value of 🐂?

nomeata commented 3 years ago

hmm, I was under the impression that method names are sorted by the hash as well.

No, method names are stored as strings:

T : <methtype> -> i8*
T(<name>:<datatype>) = leb128(|utf8(<name>)|) i8*(utf8(<name>)) I(<datatype>)

We don't need reversibility. If Candid has a method ox, Motoko can just call the hash value of ox?

I think we do need it for the type import/export, see first bullet point of https://github.com/dfinity/motoko/blob/master/design/IDL-Motoko.md#notes

nomeata commented 2 years ago

Just ran into this again; I used the empty string as a method name in Candid (to have shorter hand-written Candid data) in the test suite, but that didn’t work.

It’s only a matter of time until someone offers a service on the IC with a method name class or stable etc.…

To solve this, we could do method name escaping somehow at the interface to Candid, not changing Motoko. Or we could add arbitrary identifier escaping to Motoko.

crusso commented 2 years ago

BTW, does the candid spec specify the ordering use to sort method names. Is there actually a standard ordering on utf8 we can appeal to?

crusso commented 2 years ago

I think adding method name escaping to Motoko makes a lot of sense and don't see much difficulty doing it. C# has a similar feature and I'm sure many other industrial languages do too.

crusso commented 2 years ago

... and then we could be tempted to use zero width spaces to encode type identifier stamping (tempting but I'm not actually serious.)

nomeata commented 2 years ago

BTW, does the candid spec specify the ordering use to sort method names. Is there actually a standard ordering on utf8 we can appeal to?

I assume it's lexicographic ordering of the utf8 encoding, but being explicit is helpful of course.