dart-lang / language

Design of the Dart language
Other
2.66k stars 205 forks source link

[records] parse errors when trying to annotate a record type class field #2469

Closed pq closed 1 year ago

pq commented 2 years ago

The following code generates what look like parser errors.

image

error: Annotation must be either a const variable reference or const constructor invocation. (invalid_annotation at [sandbox] lib/public.dart:27)
error: Variables must be declared using the keywords 'const', 'final', 'var' or a type name. (missing_const_final_var_or_type at [sandbox] lib/public.dart:28)

/fyi @jensjoha

bwilkerson commented 2 years ago

My first guess is that we're parsing the record type as an argument list for the annotation. Not sure whether the proposal says how we should disambiguate this case.

@munificent

lrhn commented 2 years ago

Ouch, that is an ambiguity. Not sure there is any good solution.

@foo 
(int, int) Function(int) bar;

is completely ambiguous.

There is no simple disambiguation. We can't add something to @constantName to show that it doesn't want the argument. We can't add something to (int, int) Function(int) to show that it does want the (otherwise optional) return type.

The best hacks would be:

@foo 
@DUMMY()
(int, int) Function(int) x;

or

typedef typeof<T> = T;

@foo 
typeof<(int, int) Function(int)> x;

Neither is something we can reasonably ask people to do.

bwilkerson commented 2 years ago

@stereotype441

pq commented 2 years ago

@leafpetersen

leafpetersen commented 2 years ago

I guess one (unfortunate) option is to disambiguate based on space. @foo(...) is an annotation constructor call, @foo (...) is an annotated record type. We'd want to be sure that dartfmt didn't change the spacing in that case though.

munificent commented 2 years ago

Moving this over to the language repo because this is an ambiguity in the actual language feature and not a parser bug. We'll have to change the proposal to handle this. :(

munificent commented 2 years ago

One idea is to put Record at the beginning of record types, like:

class C {
  @deprecated
  Record(int, int) pair;
}

@foo
Record(int, int) Function(int) bar;

I don't like the verbosity, but it's consistent with function type syntax and resolves this ambiguity. It also makes the zero-element record type Record() instead of () record, which I kind of like because the latter feels a little terse and opaque for a type annotation.

I'll think about this some more and see if I can come up with any other ideas.

So how about them types on the right?

munificent commented 2 years ago

Another (probably bad) idea: We could extend the language to support named type parameters. That could arguably be useful for other user-defined generic types because (as with optional named parameters) it would let use sites omit the type parameters can be inferred/defaulted and write only the ones the user cares about.

If we had that, then record types (and in principle function types) could look like Record<int, String s>.

But given that we already have function types which use parentheses and users seem to be OK with, this feels like the wrong approach.

lrhn commented 2 years ago

So we can either change the record type grammar, or the annotation grammar, or add something that users can use to disambiguate in cases where there is ambiguity.

The things we can do without changing the grammar are annoyances to users, and pretty much undiscoverable (like the typeof<T> wrapper).

We could introduce <type> as grouping parentheses for types, without the leading typeof, so you can do @foo <(a, b)> bar(){} and have (a, b) be a type instead of a parameter list. Will probably cause other ambiguities instead, though.

We can change the annotation grammar. Lots of options, all of them breaking and requiring migration, but probably something that can be automated.

We could decide to disallow annotations without parentheses. It would be slightly annoying for a few annotations. Mainly @override, which should be a language feature anyway. And probably a bunch of others that I don't know because I never use them anyway.

We could change the annotation grammar entirely to use, say, the C#-style: [expression] instead of @expression. If we are changing something, we might as well change everything 😁 .

Or we can introduce some sort of delimiter you can put after annotations to separate them from the following thing. Trailing comma? Optional semicolon after annotations? (Giving people optional semicolons, just not the ones they want!)

Changing the record type grammar is less breaking because it doesn't exist yet, so we won't break any code. It's just so. darn. simple. right now, and any change will make it worse.

Options like Record(current record type) or #(current record type) disambiguate a lot. They're also verbose and somewhat unintuitive. The # is better than Record, mainly from being shorter, and it won't conflict with symbols. Both are better than using Record<...>, that looks too much like a generic thing. I'd use the same rules as for Function: if followed by ( it's parsed specially, otherwise it's just the type Record itself. Also, then the type won't look like the destructuring syntax, or the literal syntax.

bwilkerson commented 2 years ago

FWIW, I actually find Record(int, {String b}) to be intuitive, and I especially like the symmetry with function types because the two kinds of types have a lot of similarity.

stereotype441 commented 2 years ago

I think my favorite of the suggestions above is using Record(int, {String b}) to denote a record type. I really like the similarity to what we do for function types.

One other random idea: add an optional : to the end of the annotation syntax (which maybe we make required in Dart 3.0). Then the user disambiguates via the placement of the colon, e.g.:

@deprecated:
(int, int) pair;

vs.

@Deprecated('message'):
foo;

If the user leaves off the : we should assume the thing in parentheses is an annotation argument, for backwards compatibility.

pq commented 2 years ago

FWIW, I lean towards Record(int, {String b}) for record types too and agree that the symmetry with functions is nice.

lrhn commented 2 years ago

I like the Record(actualRecordType) syntax on principle, it's consistent, readable and easy to parse, but I worry it is too verbose.

When you want to use a record type, it's in a place where you don't want the ceremony of introducing a class. Adding extra verbosity to those places makes it less convenient, and might just remove some of the lure of having records. (Not all, but some.) Or maybe I'm over-thinking it.

scheglov commented 2 years ago

Looking on the motivation in the specification, I see three reasons: (1) no behavior, (2) verbosity, and (3) coupled to a class definition.

Surprisingly, I don't see performance, i.e. I'd expect that because we don't need allocating a new object instance, and then GC it, and instead just return a few values on the stack, it is slightly faster.

I don't understand (3) as well. What is wrong with being coupled to a class name?

For (2) I think the biggest issue is not verbosity itself, but non-locality. I have to scroll down to the bottom of the file to declare a new class that I will return from this method. Not a huge work, I admit, but still. Typing Record is still more lightweight than scrolling. Although of course adds some verbosity.

And I also like the symmetry with Function.

lrhn commented 2 years ago

I'll admit, I also find Function too verbose. And I can't even typedef my way around it, because even with typedef Fn = Function, the int Function(int) syntax is a special syntactic form unrelated to the similarly named class, so the alias doesn't apply.

Compare Verbose Concise Overly Brief?
Record(int, int) Rd(int, int) #(int, int)
int Function(int) int Fn(int) (int)->int

I think a point could be made for shorter keywords. :)

Jetz72 commented 2 years ago

Could you make the word Record optional perhaps? Allowing both Record(int, int) and the shorter (int, int)? The latter would probably be used in most cases, but if you needed to precede it with an annotation (or some other ambiguous case comes up down the line) or you just wanted to be more explicit, you could write out the full Record.

mateusfccp commented 2 years ago

And I can't even typedef my way around it, because even with typedef Fn = Function, the int Function(int) syntax is a special syntactic form unrelated to the similarly named class, so the alias doesn't apply.

Could we fix this?

TimWhiting commented 2 years ago

I prefer the more concise syntax for records types. I agree with @lrhn that already Function is a bit verbose.

Compare:

Record(String, String) doSomething(Record(String, String) Function(Record(int, int), String Function(int)) transform) {
 return transform(((1, 2), (i) => '$i'));
}

with the current (with allowing Fn)

(String, String) doSomething((String, String) Fn((int, int), String Fn(int)) transform) {
...
}

Or:

doSomething(transform: ((int, int), int -> String) -> (String, String)) -> (String, String) {
...
}

With disallowing : for default named parameters I think @munificent hinted that would make it easier for types on the right, but that would be a hugely breaking change. Would that make it easier for optional semicolons and other requests though? :) One can hope.

In any case, I know this is a contrived example, but Record seems unnecessarily wordy. Especially for a language that has void main() rather than public static void main(String[] args).

I like @stereotype441's suggestion of adding an optional : after metadata (could even do a ;) for disambiguation, but if you really want a prefix I liked @lrhn's suggestion of #(int, int)

stereotype441 commented 2 years ago

Another idea that came up in this morning's language team meeting is to disambiguate the two types of annotations based on whether there's whitespace between the identifier and the (. So e.g. @foo(...) would parse as an annotation that constructs an instance of foo, whereas @foo (...) would parse as the annotation @foo followed by a record type.

Although technically breaking, this disambiguation rule would probably do the right thing in 99.9% of cases.

leafpetersen commented 2 years ago

Summarizing discussion from this morning.

There are two broad approaches we could take: do something with the annotation syntax to disambiguate, or change the record type syntax.

On the first approach:

On the second approach:

There isn't broad agreement on the team as to a direction yet, we'll keep discussing this.

leafpetersen commented 2 years ago

One possibility that we didn't discuss yet is using a different infix operator for the type syntax. An idea which probably doesn't work is to use , without parentheses:

The ML family of languages uses infix *.

A variant of the above would be to use parentheses with an infix * as a delimiter:

Other delimiters we could use (with or without parentheses):

jensjoha commented 2 years ago

I had hoped to stay out of this discussion, but feel I have to voice my dislike of using spaces (and lack of spaces) as disambiguation. On an objective level, we've never done anything like that before and e.g. actively chose a potential exponential (as I recall) parsing for disambiguating between null-aware bracket or a conditional. As also mentioned elsewhere here it will also invalidate assumptions for everything having to do with dart code (say for instance code generation and code formatting). On a completely subjective level I also just don't like it.

(And now that I'm already writing a comment:)

As for, e.g., Rd I'd point out that I'd probably not get through a code review with code including such a weird abbreviation.

As for having two different syntaxes so that one can use one in normal cases and the other if the normal case is ambiguous seems complete bonkers to me. To me, at least, having a single consistent syntax would be much preferable.

Also I personally don't mind the verbosity of Record. My mind would have an easier time remembering that, than some (combination of) punctuation(s) or abbreviation(s). That it becomes too long when nesting enough of them doesn't sound like a good argument to me --- with enough nesting it's going to be completely unreadable no matter what we do.

lrhn commented 2 years ago

The Record(...) syntax is definitely the safe and rational choice.

It's consistent with Function. It's safe in all the right ways, least likely to set us up for problems in the future, because the grammar is well known. Accepts all record types without problems.

Using significant whitespace may fix this particular problem, but it leaves us wide open to the next conflict with an untagged parenthesis occurring after something else. It locks us down with regard to what other syntax we can add. I'd be happy using the significant whitespace solution, but I don't think it's the most forward-looking choice. (Edit: And we already had another one: on (e, s) {} after a try. Raw parentheses are darn dangerous.)

The problem with Record(...) is that it doesn't bring me joy. And I know how much writing Function annoys me today. It's a constant annoyance, something abrasive that adds to my daily amount of wear and tear every single time I write it. I doesn't go away. It doesn't get better. It's just "here we go again". It's a cost of using Dart that I have to write Function when I want a function type. And I think that matters.

In that comparison, I'd prefer #(...) for the type, and not for the values, even if it takes up precious grammatical space, because it will keep records a low syntactic overhead. I won't mind writing it. It's still searchable (#( is unique). And if we can't use our syntax for something as central as the product type, what can we use it for?

ds84182 commented 2 years ago

I really like the colon suggestion from @stereotype441. I think allowing it (not making it required) wouldn't be a breaking change right? The point is there needs to be some delimiter between annotations and what they're annotating. Of all the solutions presented, it feels like the cleanest.

Also this applies to annotated parameters as well. int foo(@annotation (int, int) latLong) has the same ambiguity.

Wdestroier commented 2 years ago

Sometimes I feel like the parenthesis are extra unnecessary code, let's say, compared to Golang.

Example: int, int getOrigin() { return 0, 0; } vs (int, int) getOrigin() { return (0, 0); }.

However, it's not too much code and may make the code more readable.

In my opinion, having to write Record defeats the main advantage of records by requiring so much boilerplate and making the code not look good.

TimWhiting commented 2 years ago

As another consideration. Right now the Records feature does not break user written classes named Record because the user written declaration shadows the dart:core Record, and the record type does not need the name Record in it. If you make it mandatory to write Record as part of the type, it might make the disambiguation a little weird.

// User written
class Record {}

Record(int, int) f0(Record(int, int) value);

Record f1(Record value);

I'm assuming you would do the right thing here and resolve to the dart:core Record type for f0, but the user written class for f1, (they are disambiguated because of the parentheses, but it definitely makes it confusing to read).

lrhn commented 2 years ago

We'd have to be absolutely certain that we can distinguish whether Record(int) should be a type or a constructor invocation. We generally can do that. A lot of Dart would break down if we couldn't distinguish whether a position should be a type or an expression, since identifiers can be both. So it would probably work out. Like for Function, the following ( force-disambiguates it into a type in a type-position, but not in an expression position.

TimWhiting commented 2 years ago

I can't think of any true ambiguity (like you say the following ( disambiguates like Function) but there will be a lot during editing. Here are a few I can think of.

Record(int, int) f0(Record(int, int), value);
//                         ^ record parameter missing a name, or is this a function parameter named Record 
// (function parameter as written)

void main() {
  Record(int, String)
 { //  ^ is this a local function (yes), but also could be 
// an object instantiation / function call with a missing semicolon
// a variable declaration with a missing name & semicolon
// (Makes it harder to be rid of semicolons), though I guess this is an existing problem with blocks
  }
}
natebosch commented 2 years ago

Would using #( for types, but not for expressions, open up the possibility for record type literals? If we use either #( or Record do we plan to allow for record type literals?

munificent commented 2 years ago

Would using #( for types, but not for expressions, open up the possibility for record type literals?

Yes, it would.

If we use either #( or Record do we plan to allow for record type literals?

I don't think we support Function(...) for function type type literals. I believe it's because relying on a built-in identifier (and not a true reserved word) makes that weird. So we probably wouldn't support record type literals if we use Record(...). Honestly, I think type literals are often more trouble than they're worth. I'm not inclined to double down on them because making the type annotation syntax overlap the expression syntax just paints us into a corner in a lot of ways. (Like the fact that we couldn't use class names as tear-offs for the unnamed constructor.) So I'm not inclined to make record type type literals a priority. But if we picked a syntax like #(...) which made it clearly unambiguous, we definitely could allow them.

lrhn commented 2 years ago

I too would avoid allowing type literals for anything new. If it doesn't follow the existing grammar (identifier, identifier + type parameters), then I won't introduce new syntax to allow the type literal. It takes syntax away from much more worthwhile features (aka. any other feature).

We can define typedef typeof<T> = T; today and allow typeof<any type> as a type literal for any type. Use $ if you want it even shorter. Creating a type literal for a type is not a problem which needs a syntactic solution.

Also #2393, for the extreme position.

munificent commented 2 years ago

We discussed this in the language meeting today. While the alternatives on the table have their merits, we prefer the original syntax that's currently in the proposal.

In order to resolve the ambiguity, we will take @leafpetersen's original suggestion to make whitespace significant. That means:

// Metadata with argument list on function with no return type:
@annotation(a, b) function() {}

// Metadata with no arguments on function returning a record:
@annotation (a, b) function() {}

Note that we'll make any whitespace significant, not just newlines. Making newlines significant would work for function and field declarations where idiomatic formatted code puts annotations on the previous line. But it doesn't work for annotations on function parameters which are kept inline, as in:

function({
  @deprecated (int, int) pair
}) { ... }

The idea is that we'll try this approach out while the feature is still behind an experiment flag. During the implementation, we'll get a better sense of whether making the parser whitespace-sensitive in this way feels brittle or regretful. And we can try out this behavior to see how it feels in practice. If it feels like a bad choice, we can revisit it. But if it works (and we think it will), then it lets us keep the syntax we like the most.

We have not figured out exactly how whitespace is sensitive here yet. Is it only for cases that are actually ambiguous? Or do we never treat a parenthesized thing after a metadata annotation as its argument list if there is space before it?

Here's an initial suggestion:

Thoughts?

bwilkerson commented 2 years ago

Here's an initial suggestion: ...

@jensjoha

I have to wonder how that will impact the performance of the parser. Today the parser is very eager when it comes to invoking methods on the listener, so this will effectively require it to parse every piece of metadata twice (once to decide how to parse it and once to actually parse it). It might not be a huge hit, but my understanding that the performance of the parser is fairly important.

There was some discussion in the past about building a general mechanism to support recovery (basically capture the events that would have been sent to the listener and then either actually send them or completely discard them based on whether the attempted parse succeeded. It might be time to revisit some of those discussions.

leafpetersen commented 2 years ago

We have not figured out exactly how whitespace is sensitive here yet. Is it only for cases that are actually ambiguous? Or do we never treat a parenthesized thing after a metadata annotation as its argument list if there is space before it?

The latter is my preference, and I think it shouldn't have any performance implications at all. If you see @<identifier><whitespace> you're done with the metadata, otherwise whatever is there instead of the whitespace is an argument list. Does that fall down somewhere?

bwilkerson commented 2 years ago

That would certainly resolve the parsing issues. There's no need for backtracking under that definition.

munificent commented 2 years ago

That sounds good to me too.

lrhn commented 2 years ago

We need to decide where whitespace can go when there is a type argument.

We can choose any of:

@foo<int>(bar) @foo <int>(bar) @foo<int> (bar) @foo <int> (bar)
No internal space ✔️
No space before arguments ✔️ ✔️
No space after identifier ✔️ ✔️
No ambiguity if type arguments ✔️ ✔️ ✔️ ✔️

The remaining combinations would be inconsistent.

We can definitely say that any whitespace after identifier ends the metadatum, that reduces it to "no space after identifier" and "no internal space", both valid options.

The grammar production for those could be:

No internal whitespace:

'@' <identifier> ( ~<WHITESPACE> <typeArguments>? ~<WHITESPACE> <argumentList>)?

No whitespace after identifier:

'@' <identifier> ( ~<WHITESPACE> <typeArguments>? <argumentList>)?

No whitespace before arguments:

'@' <identifier> ( <typeArguments>? ~<WHITESPACE> <argumentList>)?

No ambiguity if type arguments:

'@' <identifier> ( <typeArguments> <argumentList> | ~<WHITESPACE> <argumentList>)?

(I'm fairly sure I'm stretching the grammar by using negative lexical tokens in grammatical rules as negative lookahead, where in lexical rules they match an actual character other than the negated set.)

leafpetersen commented 2 years ago

We can definitely say that any whitespace after identifier ends the metadatum, that reduces it to "no space after identifier" and "no internal space", both valid options.

I don't have extremely strong opinions here, but I feel like "no space after the identifier" is easy to explain, and should be easy to implement (you could in principle do it in the tokenizer, I think).

lrhn commented 2 years ago

The one risk of doing "no space after identifier" vs "no internal space" is that if we ever want to allow explicit generic instantiation as metadata, @foo<typeargs> and nothing more, we would introduce another ambiguity. If we go with "no internal space" now, we don't have to change it later.

On the other hand, if we never allow @foo<typeargs> by itself (which would be weird, since we're then annotating with either a function or a Type object), making our current life easier is also valid.

We can change it later. The formatter will have ensured that no existing formatted code at that time will be affected by such a change.

So, "no space after identifier" SGTM.

munificent commented 2 years ago

I hate myself for even asking this, but what about:

@foo/*comment*/(a, b) thing;
lrhn commented 2 years ago

Don't hate yourself. Hate society for bringing us to this point!

Having a comment isn't even completely unreasonable at that point. Consider a generic annotation, and

@Marker/*<int|String>*/(const bool.hasEnvironment("flooNum") 
    ? const int.fromEnvironment("flooNum") 
    : const String.fromEnvironment("floo"));

Ok, a little speculative maybe, it's in a position where no reader benefits from the comment.

I'm very tempted to say that if the next source token (ignoring comments and whitespace) after the identifier does not start right where the identifier ends, it's not part of the annotation. So no comments either.

If we allow comments, we get into cases like:

@foo// Nyah, nyah
(int, int) get foo => 1;

and we'll have to decide whether the newline is part of the comment or counts as a separate whitespace. (Or forget to consider some similar case, and have implementations do different things.)

The KISS is (/< must be right after the identifier. That's unambiguous. Allowing anything there opens us up to unforeseen complications and special cases.

There are cases of:

@deprecated // deprecated

in the world today, but I don't see any cases of the / being flushed against the identifier (thank you Dart formatter!).

That brings up another way to consider it: What would the formatter do? It does:

@C /*x*/ ()
@C /*x*/ <int>()
@C<int> /*x*/ ()
@C<int>() /*x*/
@C /*x
  y*/
()
@C /*x
  y */
<int>()
@C<int> /*x
  y*/
    ()
@C<int>() /*x
  y*/
@C // x
<int>()
@C<int> // x
    ()
@C<int>() // x
void main() {}

The formatter wants to put spaces between comments and other things. All existing formatted code will have the space, which means that existing code with comments between identifier and </( will stop working anyway. Making the comment itself breaking won't change that. (If there was any such existing code,)

stereotype441 commented 2 years ago

That brings up another way to consider it: What would the formatter do?

I want a bumper sticker that says "What would the formatter do?" 😃

The KISS is (/< must be right after the identifier. That's unambiguous. Allowing anything there opens us up to unforeseen complications and special cases.

Agreed. And it's consistent with the way the scanner treats comments elsewhere in the language. E.g. int/* */x means int x, not intx. And i-/* */- means i- - (a syntax error), not i--.

lrhn commented 2 years ago

I wouldn't say that int/* */x means int x, just that comments are token separators the same way as whitespace. It means int and x are separate tokens, and i-/**/- is three tokens instead of two.

For @id(), the id and ( are going to be separate tokens anyway, with or without intermediate token separators. Adding or removing whitespace or comments between id and ( doesn't currently affect parsing at all.

The difference for the "whitespace sensitive parsing" is that we make the parsing depend not only on the sequence of tokens, but also on their physical location, the presence or absence of non-token source characters between them. That's something we don't do anywhere else in the current grammar (as far as I know).

The formatter is guaranteed to work, because all it does is to change the position of tokens, but it makes sure to not change the order or the tokenization (no making two identifiers touch). With whitespace sensitive parsing, the formatter must also make sure to not put space between the id and (, or remove such space, which is a safe operation everywhere else.

It's nothing insurmountable, but it is a new thing, and a special case that must be handled specifically by new code that doesn't already exist.

I guess you could say that we introduce new tokes IDLT : IDENTIFIER '<' and IDPAREN : IDENTIFIER '(' and make the grammar of metadatum be:

 '@' (<identifier> | IDLT <typeList> '>' <arguments> | IDPAREN (<argumentList> ','?)? ')'

(and then everywhere we otherwise accept identifier '<'/identifier '(' we should also accept IDLT/IDPAREN too.)

Well, except that it's not IDENTIFIER, it's qualified, so we need to tweak that too.

munificent commented 2 years ago

That brings up another way to consider it: What would the formatter do?

I want a bumper sticker that says "What would the formatter do?" 😃

"But, doctor, I am Pagliacci!"

The formatter wants to put spaces between comments and other things.

It does have some logic to not put spaces before /* and after */ in certain contexts, mainly to try to make the old generic method comments look right. That logic is heuristic-based and pretty brittle.

It's nothing insurmountable, but it is a new thing, and a special case that must be handled specifically by new code that doesn't already exist.

It's not entirely new. In addition to not adhering identifiers together (which is obvious), the formatter also has to take care to insert a space between nested unary minus so that it doesn't turn - -2 into --2.

lrhn commented 2 years ago

The formatter is probably more aware of whitespace than any other tool. It makes sense that it is already doing things with it. The new code I was talking about was also in the parser, which should not have anything today which checks whether two tokens are adjacent. Maybe it's easy, but it will be new.