dart-lang / language

Design of the Dart language
Other
2.67k stars 205 forks source link

URI shorthands, allow reserved words. #3984

Closed lrhn closed 2 months ago

lrhn commented 3 months ago

I propose to allow reserved words to occur in unquoted URIs like any other identifier, because URI/file-system paths are not Dart code, and have no reason to be restricted by what happens to be Dart reserved words.

The proposal for URI shorthands allows sequence of tokens of the form: <dottedIdentifierList> ('/' <dottedIdentifierList>)* as unquoted imports.

It's a syntactic grammar, not a lexical grammar, which means that it does not affect tokenization, which would otherwise need to be context sensitive, and so it can reuse existing grammar productions.

An example import could be import somepackage/somelibrary;.

By using <dottedIdentifier> it only allows identifiers as parts of the path segments of the URI, which means that reserved words are not allowed. Directory names and URI path segments have no notion of Dart identifiers, so linking the shorthand to only Dart identifiers seems unnecessarily restrictive, and may disallow using the syntax for a URIs like:

import package.for.dart/banana; // contains `for`.
import mypkg/src/new/library; // contains `new`

Dart reserved words are small common words that can occur as directory names, or as parts of dotted names (if anyone used dotted names).

It's easy to allow reserved words in unquoted URIs. Rather than using <dottedIdentifierList>, use:

<dottedUriWords> ::= (<uriWord> '.')* <uriWord> 
<uriWord> ::= <identifier> | <RESERVED_WORD>

It does nothing except allow reserved words too. Grammatically it should be no harder to work with than <dottedIdentifierList>, and it allows users to not worry about whether their file paths contain Dart reserved words, which have no reason to be special in that context.

Further, if we add more reserved words in the future, we will not introduce compile-time errors if someone used that word as a directory name.

(Why only disallow reserved words, and not built-in identifiers? If it's because there is no reason to disallow built-in identifiers, then there is equally no reason to disallow reserved words.)

We should consider that URIs can be followed by if in conditional imports:

import foo/bar.
  if (condition) foo/qux;

A mistaken trailing ., like after bar above, would include the if in the URI, then complain about the (. That won't happen if we don't allow reserved words. It also won't happen if we disallow internal whitespace (#3983), and it's not significantly different from import foo/bar. hide banana;, which we do allow. I don't think it's a reason to not allow reserved words.

munificent commented 3 months ago

Further, if we add more reserved words in the future, we will not introduce compile-time errors if someone used that word as a directory name.

I don't think that buys us much because adding reserved words will still break plenty of other things even if they don't break imports.

For what it's worth:

So far, I haven't found a widely used language with unquoted imports that does allow reserved words.

tatumizer commented 3 months ago

My main argument is that there's no language (AFAIK) that disallows quoting the path. All examples of (unquoted) dot-separated names in other languages refer to hierarchical module names, not pathnames. But as soon as the language provides a way of mapping one hierarchy to another, the pathname gets quoted. e.g. #[path = "foo.rs"] (in rust).

The idea of angle brackets was to support quoted paths, it's just the quote symbols are different ( <...> rather than '...'). There's a precedent for this (C). The fact that <a/b/c> means something different from 'a/b/c' won't surprise anybody. But that import a/b/c is different from import 'a/b/c' is quite surprising IMO.

munificent commented 3 months ago

There's a precedent for this (C). The fact that <a/b/c> means something different from 'a/b/c' won't surprise anybody. But that import a/b/c is different from import 'a/b/c' is quite surprising IMO.

Technically, the angle brackets and quotes aren't part of C at all but are part of the preprocessor.

That matters because the includes and angle brackets are gone before the C lexer ever sees them, so it doesn't have to worry about having different lexical grammar rules inside the angle brackets versus when angle brackets are used for comparison operators.

tatumizer commented 3 months ago

I see. In dart, import is not a "first-class" reserved word. There can be an identifier import, and if it happens to be a name of a parametrized function, import<int>(5) will clash with the package import import <int> (the equivalent of import 'package:int/int.dart'). This conflict can be resolved by taking into account the whitespace after import. The lexer can handle the sequence of import, whitespace, < as a trigger for parsing everything between angle brackets <a/b/c> as a string, and creating the same output as while parsing import 'package:a/b/c.dart'.

munificent commented 2 months ago

We discussed this in the language meeting this morning and in another meeting following that. Based on those discussions, we've decided that, yes, we will allow reserved words as path components in unquoted imports. So this will be allowed and valid:

import if.for/do.class;

Obviously, users should rarely rely on this support. It's certainly better style to avoid path component names that collide with reserved words.

But allowing them means that tools that generate unquoted imports aren't required to know the full set of reserved words and carefully route around them. Also, for better or worse, Dart has a long and somewhat confusing set of reserved words amended by an even longer and more confusing set of "built-in identifiers" and "contextual keywords". Given that, it's actually fairly difficult for a user to know whether a given identifier is fully reserved or not. Is class? (Yes.) What about mixin? (No.) How about is? (Yes.) And as? (No.)

The grammar of unquoted imports/exports is restricted enough that we can allow reserved words there without ambiguity and it avoids users having to worry about accidentally stepping on a reserved word.

I'll write up a PR to update the spec and close this issue when that lands.

bwilkerson commented 2 months ago

There is an implication here for the UX. If a user is in the middle of typing in an import there could now be an ambiguity that wasn't there before. In particular, consider the following (where ^ indicates the cursor position):

import some.^

class C {}

Because the parser is greedy, it will, by default, decide that this should be read as

import some.class;

C {}

That will result in a poor UX in which diagnostics are generated that are not helpful to solving the real issue.

It's likely that this will be rare enough that we'll choose to ignore it, but I wanted to make sure it has been considered.

munificent commented 2 months ago

That's a good point. Even without allowing reserved words, that UX problem exists:

import some.^

SomeType x;

Again, the parser will try to make SomeType part of the import and then report a confusing error on x.

I definitely don't like making it the parser implementer's job to conjure up a good UX here, but I suspect that's going to be necessary no matter what. Do you think allowing reserved words makes this problem noticeably worse?

bwilkerson commented 2 months ago

In a different issue you wrote

We've decided that as this issue proposes, we will be restrictive and not allow internal whitespace or comments inside the path part of an unquoted import.

If I'm understanding correctly, that ought to mostly prevent this kind of problem from occurring.