encoding mini-languages into the type system

CeylonMigrationBot commented 11 years ago

[@gavinking] Now that we have tuple types, it's possible to encode the grammar of a simple "data format" or "mini-language" into the type system itself, obviating the need for some kind of regex language or compiler plugin.

What's going on here is that since we have alternative |, repetition [], and sequencing, [X,Y,Z], we can express a BNF in a type expression.

For example, the BNF of a time could be written as the following type:

alias TimeFormat =>
        [Integer, COLON, Integer, String=]
       |[Integer, COLON, Integer, PLUS|MINUS, Integer];

which would accept the following tuples:

[12, colon, 30]
[12, colon, 30, 'GMT']
[12, colon, 30, plus, 5]

The BNF for a date could be written as this type:

alias DateFormat =>
         [Integer,SLASH,Integer,SLASH,Integer]
        |[Integer,MINUS,Integer,MINUS,Integer]
        |[Integer,Month,Integer];

which would accept the following tuples:

[1,slash,1,slash,2012]
[2012,dash,1,dash,1]
[1, january, 2012]

Now, it's rather ugly to write out those tuples like that, but if we add a little escape syntax, which suppresses the usual interpretation of Ceylon's punctuation characters, we could write them like this:

\[12:30]
\[12:30 'GMT']
\[12:30+5]

\[1/1/2012]
\[2012-1-1]
\[1 january 2012]

And then, given:

class Time(TimeFormat tf) { ... }
class Date(DateFormat df) { ... }

We can write:

Time time = Time(\[12:30+5]);
Date date = Date(\[1 january 2012]);

And get a typing error if the format of the time or date is incorrect.

Since this is something can be implemented almost completely within the typechecker, it's a candidate for inclusion in Ceylon 1.0.

[Migrated from ceylon/ceylon-spec#522]

CeylonMigrationBot commented 11 years ago

[@gavinking] Note: actually it's not quite right to say that this supports repetition. It supports only an extremely limited kind of repetition: [A, B, C...], not sometime more general like [A, [B,C]...]. This is probably not a problem for things like dates, times, durations, cron expressions. It is a problem for URLs and email addresses. Is there a workaround?

CeylonMigrationBot commented 11 years ago

[@gavinking]

It is a problem for URLs and email addresses.

Actually a second problem with URLs and email addresses is that we would need, contrary to the above, to be able to interpret an unquoted "word" as string data instead of as an identifier, so it would be:

\[12:30 GMT]

And in:

\[1 January 2012]

The text January would not be validated at compile time.

Well, that's OK, I suppose.

Then an email address could be written:

\[name@domain.org]

But the best we would do in terms of validating it at compile time would be the following "BNF":

alias EmailFormat =>
        [String, AT, STRING, DOT, STRING|DOT...];

which unfortunately accepts name@domain..org..

What I really want to write is:

alias EmailFormat =>
        [String, AT, STRING, (DOT, STRING)...];

But I don't know if we can represent that in our type system.

CeylonMigrationBot commented 11 years ago

[@loicrouchon] I'm not a big fan of \[] syntax to disable interpretation of Ceylon's punctuation characters.

Actually, I'm don't really like to disable those characters interpretation, because if I understand well your proposition, it means that only Ceylon's punctuation characters can be used as separtor and that no expression will be allowed inside and we'll have to use the already existing syntax [1, now.month, now.year + 1]

For the [A, [B,C]...] problem, I don't understand why it's not working. Introducing an alias alias D => [B,C] wouldn't allow [A, D...]?

CeylonMigrationBot commented 11 years ago

[@gavinking] @ loicrouchon we probably need an "unescape" sequence. To be able to write something like this:

\[ 1 / 1 / {year} ]

Or, given your example:

\[ 1 / {now.month} / {now.year + 1} ]

CeylonMigrationBot commented 11 years ago

[@gavinking]

For the [A, [B,C]...] problem, I don't understand why it's not working.

Because that is a tuple containing an A then an arbitrary number of pairs of type [B,C].

The type of [a, b, c, b, c] (or \[a b c b c]) is [A,B,C,B,C], not [A,[B,C],[B,C]], unfortunately.

CeylonMigrationBot commented 11 years ago

[@gavinking] So the following alias would (partially) solve the problem:

alias Repeat<X,Y> 
        => Tuple<X|Y,X,Tuple<X|Y,Y,Repeat<X,Y>>>;

But this blows up the typechecker as noted in #3630, and is probably undecidable. (@RossTate WDYT?)

To solve the problem completely, my guess is that you would need to make repetition [A, (B,C)...] a primitive construct in the type system, not just a syntax sugar the way [A, B...] is today.

But, in fact, I think there is a simpler solution to the practical problem, which I'll outline in the next comment.

CeylonMigrationBot commented 11 years ago

[@loicrouchon] So the problem is that it would force us to write using this syntax [a,[b,c],[b,c]] for type [A,[B,C]...] if we don't have a Repeat<X,Y> construct as you proposed.

CeylonMigrationBot commented 11 years ago

[@gavinking] So, at the cost of losing a little flexibility in terms of the patterns that are supported, we can tighten up the validation of the patterns as follows:

input of the form x.y.z has the type DotSeparated<[X,Y,Z]>
input of the form x y, w z has the type CommaSeparated<[[X,Y],[W,Z]]>
input of the form ( x ) has the type Parenthesized<X>
input of the form [ x ] has the type Braced<X>

So, for example, the syntax:

\[float f(float x, float y)]

Would actually mean:

['float', 'f', Parenthesized(CommaSeparated([['float', 'x'],['float', 'y']]))];

And would match this pattern:

alias FunSigFormat => 
        [String,String,Parenthesized<CommaSeparated<[String,String][]>>];

And we could define:

alias EmailFormat =>
        [String, AT, DotSeparated<[String,String...]>];

Now, of course, this means assigning fixed precedences to certain symbols, that might not be appropriate for every imaginable format. But if we nail certain important cases, that's probably good enough. This is not supposed to be—and can't possibly be—a totally general parsing facility.

An open question is: if we kept the usual precedence of symbols in Ceylon expressions, would that be sufficiently flexible to be able to express most of the data formats we're interested in? Probably, I imagine.

CeylonMigrationBot commented 11 years ago

[@gavinking] Upon further reflection, I would probably go for a syntax like fun(\ ..... ) and fun { arg=\ ..... ; } instead of fun(\[ ..... ]).

a { href=\http://hibernate.org; }

by(\Gavin King, name@domain.org)

Address(\3340 'Peachtree Rd', Atlanta GA, 30325)

Datetime(\12/12/2012 12:30+5)

I feel that this approach, despite its manifest limitations, is going to be better than some kind of compile-time-regex based solution, since it doesn't just validate the format, it also parses the input into a data structure. Plus it's pretty cool how it re-uses the type system to express the grammar.

However, I'm not enough sure of this to want to implement this in Ceylon 1.0. Indeed, it's extremely unclear that any of the above is sufficiently more readable and typesafe than:

a { href='http://hibernate.org'; }

by(Name('Gavin', 'King'), Email('name@domain.org'))

Address(3340, 'Peachtree Rd', 'Atlanta', 'GA', 30325)

Datetime(Date(12, 12, 2012), Time(12,30,+5))

to justify the admitted complexity of the feature.

CeylonMigrationBot commented 11 years ago

[@loicrouchon] To come back to the \ notation to disable interpretation of Ceylon's punctuation characters, what is the type of this \0.0?

Is it [Integer, String, Integer] or [Float]?

If the . is not interpreted this would mean that we can't use float for \ notation which is pretty annoying. But if it is, then, we cannot determine (or at least, I don't see how) the type without having a look at the left side which as far as I know is not the way it should work in ceylon.

CeylonMigrationBot commented 11 years ago

[@gavinking]

what is the type of this [0.0]?

That's one I ran into too. It would be a Float, because that's how it is tokenized by the lexer. This could be a problem with the idea in the case of stuff like IP addresses.

CeylonMigrationBot commented 11 years ago

[@gavinking] Here's a scaled-back version of this idea. This time around I feel like using single quotes to delimit the data format ;-)

The contents of a single-quoted literal is tokenized by the lexer.
A single-quoted list of tokens results in a tuple, with one element for each token.
The type of an integer literal is Integer.
The type of a float literal is Float.
The type of a double-quoted string literal is String.
The type of an identifier, or a dot-separated list of identifiers is String.
The type of a symbol token depends on the symbol (i.e. Comma for a ,, Slash for a /, Colon for a colon, etc).
The type of an interpolated expression is the expression type.

This scheme would be sufficient for very basic validation of simple data formats like the one's we're looking at here.

You would be able to write all of the following:

a { href='http://hibernate.org'; }

by('Gavin King, name@domain.org')

Address('3340 "Peachtree Rd", Atlanta GA, 30325')

Datetime('12/12/2012 12:30+5')

The expressions are equivalent to:

a { href=["http", Colon(), Slash(), Slash(), "hibernate.org"]; }

by(["Gavin", "King", Comma(), "name", At(), "domain.org"])

Address([3340, "Peachtree Rd", Comma(), "Atlanta", "GA", Comma(), 30325])

Datetime([12, Slash(), 12, Slash(), 2012,12, Colon(), 30, Plus(), 5])

For more complex URLs and email addresses, you would need to pass the address as a string, and you would lose the validation.

CeylonMigrationBot commented 11 years ago

[@RossTate] What happens with floats and IP addresses?

CeylonMigrationBot commented 11 years ago

[@gavinking]

What happens with floats and IP addresses?

You need to quote IP addresses. Like I said, it's the ordinary lexer tokenizing this.

CeylonMigrationBot commented 11 years ago

[@gavinking] Oh, and you would be able to write:

Date('1\1\' year '')

CeylonMigrationBot commented 11 years ago

[@loicrouchon] Using simple quotes, it's much better :) But reading the work the lexer will, it is going to work for IPv4 address? 0.0.0.0 will be of type understood as [Integer, Integer, Integer, Integer] or [Float, String, Float] ([0.0, '.', '0.0])?

Note that this may not be a problem if by identifier you mean valid ceylon identifier, in that case, the dot after 0.0 will not be understood as a String but maybe as the dot punctuation character and in that case, it will be [Float, Dot, Float].

So my question is how the lexer is going to deal with that? is it greedy? is there some backtrack mechanism?

CeylonMigrationBot commented 11 years ago

[@luolong] Yeah, I concur with Loic - While reading this proposal, I immediately started asking myself, if you are going to use the escape character, why not use single quotes for escape char. Gavin beat me to it :)

As for the more exotic formats, the version numbers and IP4 addresses come to my mind, where the parser rules get a bit ambiguous.

I am just thinking out loud here and I have next to no knowledge of the internals of the Ceylon compiler (yet), so bear with me if I'm talking rubbish:

Parser could just choose not to tokenize floats while in quoted mode. This way, the expression '192.128.0.11' would evaluate to [Integer,Dot,Integer,Dot,Integer,Dot,Integer]. This is probably the easiest to implement and would also offer the element of least surprise.

Given that the quoted literal syntax is not very likely to be used in the context where the intended type of the expression is unknown, the compiler could also perform some voodo to check that if the type expects a Float, the [Integer, Dot, Integer] combo would be collapsed into Float instead.

CeylonMigrationBot commented 11 years ago

[@loicrouchon]

Given that the quoted literal syntax is not very likely to be used in the context where the intended type of the expression is unknown, the compiler could also perform some voodo to check that if the type expects a Float, the [Integer, Dot, Integer] combo would be collapsed into Float instead.

The problem with that "Voodo" is when there is no left side or when left side is value What would it print 192 or 192.168?

print('192.128.0.11'.first.string);

and

value ip = '192.128.0.11';
print(ip.first.string);

So I don't think we can rely on expected type.

Maybe a solution would be to consider to analyze the single quoted expression excluding Float. This would give [Integer, Dot, Integer, X, ...] and then transforms into [Float, X, ...] If X is not a Dot. But even that solution seems weird to me.

CeylonMigrationBot commented 11 years ago

[@gavinking] Problem is that Float and Integer are distinguished in the lexer, not in the parser.

CeylonMigrationBot commented 11 years ago

[@loicrouchon] Just to know, what would be the current behavior when trying to analyze an IP using the lexer?

CeylonMigrationBot commented 11 years ago

[@luolong] @loicrouchon: Yeah, if there is no left side, the whole thing falls apart, but the examples you picked are also quite artificial and stretched.

In fact, in the sane world both expressions - print('192.168.0.11') and value ip = '192.168.0.11', the quoted expression should be treated as a String.

The whole minilanguage is only useful if there is a type to validate against. Otherwise the single quoted string should be treated as a String.

@gavinking: So if I understand you correctly, the trouble is that in the Lexer we don't still have enough type information to validate and "magically" adjust the expression?

eclipse-archived / ceylon

encoding mini-languages into the type system #3628