should we use whitespace-sensitive operator fixity?

zygoloid commented 3 years ago

For full details, see #168, and in particular this section.

It would be useful to be able to use * as all of:

a prefix operator (for dereference),
an infix operator (for multiplication), and
a postfix operator (for forming pointer types) ... but there are problems with the same operator being both the second and third kind. For example, if we also allow (say) + as both an infix operator (for addition / type-type composition) and a prefix operator (like C++'s unary operator +), then the expression a * + b is ambiguous: it could be either (a *) + b, or a * (+ b).

There are a few ways to handle this, as detailed in #168. In this issue, I'd like to determine whether we're happy with Swift's answer to this: the fixity of an operator depends on its surrounding whitespace. That is:

a* + b is (a*) + b
a * +b is a * (+b)
Any other whitespace positioning (eg a * + b or a* +b) is an error

More generally:

There can be no whitespace between a prefix or postfix symbolic operator and its operand.
There must be whitespace surrounding every infix operator.

We would treat a.foo and a->foo and a[i] and a(args) as postfix. For non-symbolic unary operators (eg not), we can't avoid whitespace in general, but I don't think we anticipate having the same non-symbolic operator with multiple fixities, so that seems unproblematic.

Are those rules acceptable?

josh11b commented 3 years ago

Making whitespace significant around operators seems like a big cost to me:

Presence or absence of whitespace is subtle.
It is a divergence from C/C++.
There are edge cases around using multiple unary operators (++*p, **p, etc.).

Since I view these rules as a cost, I have to ask "what is the benefit and is the benefit worth the cost?" And to answer that I feel like I need to know what the alternatives are.

Is this only for *? In general it seems like we are pushing towards single meanings for operator symbols and keywords, so wanting to support multiple meanings for * is mostly to be consistent with C/C++. I think this means that we are not likely to need to disambiguate other symbols. So I think we should be asking a different question. Instead of asking "should we make whitespace around operators significant so we can distinguish binary from postfix uses (of *)?", I think we should be asking "given that we want to use binary * for multiplication, how should we designate pointer types and dereference?" I think there are a lot of options to consider there:

Use prefix * for dereference, postfix * for forming pointer types, and use whitespace to disambiguate whether operators are prefix, postfix, or infix.
Use prefix * for dereference, postfix * for forming pointer types, and use precedence rules to disambiguate.
Use prefix * for dereference, postfix * for forming pointer types, but assume we are not going to have any binary operators on pointer types, and use that to disambiguate (not sure this works, but maybe?).
Use prefix * for dereference, prefix & for forming pointer types (analogy to "type of the tuple is the tuple of the types" here is "type of a pointer is a pointer to the type").
Use prefix * for dereference, Ptr(T) for forming pointer types.
Use ^ for some pointer operations, stop using it for binary xor.
(Probably a lot of other choices.)

I think phrasing the question this way makes it much clearer what the costs and alternatives are.

geoffromer commented 3 years ago

Another drawback of making whitespace significant in this way: it seems like it would make Carbon substantially more difficult to parse using tools like lex, yacc, and their derivatives. The best option I've come up with so far is for the lexer to have separate tokens for "+ followed by whitespace", "* not preceded by whitespace", etc., but that would substantially complicate the grammar, and I'm not 100% sure it's feasible with a lowest-common-denominator scanner generator, since it requires some form of lookahead and lookbehind (both positive and negative).

There may also be accessibility issues; for example, will screen readers reliably surface the difference between a* + b and a * +b?

zygoloid commented 3 years ago

I think we should be asking "given that we want to use binary * for multiplication, how should we designate pointer types and dereference?"

I've moved that question out into #523. I'd like to keep this issue focused on the specific approach of using whitespace for this purpose, though I expect the arguments here to drive #523 and vice versa, because the * operator is the only currently known case where we might want an operator to have both infix and postfix forms.

chandlerc commented 3 years ago

I really like the approach used by Swift here from a clear and principled approach that gives us a strong evolutionary path.

While we don't currently plan on user-defined operators, this leaves that door open in the future. It also aligns really well with the goals for ensuring language evolution by providing a large space to evolve into with operators in the future, even if they aren't user-defined.

Whenever we choose an operator here, we can still be very cautious on a case-by-case basis that any collision with other uses of the same symbols don't become confusing. I really do agree that this is something with high cost and to be generally avoided. Maybe we want to deal with the different meanings of * because of long legacy from C/C++ here, but it is far from a clear trade-off and I'm glad to have a specific issue called out for it.

All this said, I find the point raised by @geoffromer really concerning. Making Carbon intractably hard to parse with tools like yacc/bison seems like a pretty unfortunate consequence. I'd like to understand if there is any reasonable way to address this with these kinds of tools... Because if not, I think that's a high cost.

FWIW, I also want to address the meta-level cost raised by @josh11b around giving whitespace this level of significance. I originally was reluctant to follow Swift's lead here for exactly these reasons. However, I changed my mind because this is specifically the presence of whitespace being significant, and not the amount of whitespace. To me at least, that seems like a really important difference. Literals, identifiers, keywords, and other syntactic elements are defined by the presence of whitespace as well. And there is a language that has experimented with these rules and I've not heard from anyone that these end up being a source of user confusion with Swift in practice. So while I originally shared the high level concern of leaning on whitespace in this way, these factors have largely convinced me that this would be fine for humans.

Regarding divergence from C/C++, changes to encode widespread practice and gain greater evolutionary freedom seem reasonable to me. Especially early on, I would expect us to have a strong ability to diagnose common mistakes here.

Regarding edge cases around repeated unary operators -- I feel like this is a somewhat orthogonal issue to choosing fixity of symbols based on whitespace. EIther approach provides similar challenges around **p and ++*p. We'll have to choose howe to tokenize these symbol sequences. We can either allow those examples or not. We'll have to make similar choices for even more challenging cases like ---x. The whitespace fixity question may force the use of parentheses to disambiguate rather than whitespace (--(-x) vs. -- -x), but at least for me, I would prefer the parentheses where necessary (potentially minimizing how often they are necessary) rather than whitespace regardless of the outcome of this question.

My several cents here...

geoffromer commented 3 years ago

All this said, I find the point raised by @geoffromer really concerning. Making Carbon intractably hard to parse with tools like yacc/bison seems like a pretty unfortunate consequence. I'd like to understand if there is any reasonable way to address this with these kinds of tools... Because if not, I think that's a high cost.

From the limited digging around I've done, my guess is that it's closer to "ugly and annoying" than "intractable". I'd recommend that we treat this concern as non-blocking for now, but ask that any concrete proposals along these lines include a prototype implementation in executable-semantics, at least until we build confidence that it's manageable.

zygoloid commented 3 years ago

I'd [...] ask that any concrete proposals along these lines include a prototype implementation in executable-semantics, at least until we build confidence that it's manageable.

I think that's a reasonable request. I've put together an example change showing how this can be done for a * operator.

chandlerc commented 3 years ago

There was a bunch of discussion of this (in the context of #523 where we want postfix * for pointer type and prefix * for dereference).

The open discussion minutes have some more details, but the suggested initial rules are:

binary operators must have space on both sides
prefix operators must not have a space after
postfix operators must not have a space before
unary operators might not any whitespace on either side

We talked a bunch about whether we can recover well and correct common errors here, and there don't seem to be big problems there for catching the common mistakes. Having some real world experience will also be good for recovery.

We may eventually discover enough pain points and need to move toward a set of rules closer to what Swift uses so that we accept more different formatting patterns. But it seems reasonable to wait for those pain points to emerge before we adopt the more complex rules.

This also seems to match what Richard has prototyped w/ Flex and Bison.

Last but not least, the goal is still to be very cautious in the use of this flexibility. It looks useful for * because it is heavily motivated by a desire to match C++ syntax for familiarity. But we shouldn't leverage this without good reasons and a clear understanding of any human confusion that might be caused.

So, what do folks think? This a reasonable place to start?

zygoloid commented 3 years ago

Summary of some discussion from open discussion sessions follows.

The proposal described in the previous comment received push-back in two directions:

Will this create accessibility problems, for example with tools like screen readers, which might not distinguish between the presence or absence of whitespace when translating the code to another medium?
Will this create readability problems, for example where an expression such as 5*x*x + 3*x + 2 is more readable when written without spaces around the * operator than when written with the spaces?

However, the desire to use * for both multiplication and the formation of pointer types is sufficiently strong that we wanted to keep exploring this direction and see if those concerns can be addressed.

For the accessibility concern, we observe that the rule we are considering, for the specific case of pointer types and multiplication, will typically be resolving only the failure of the grammar to be LR(1) (or indeed LR(k) for any k), not an actual ambiguity, and to a human we expect the parse to typically be obvious without whitespace cues. In particular, we expect this to be the case because we don't expect types to appear as arbitrary subexpressions much, and instead to mostly appear in the constrained domain of an argument to a function call (where the type will always be followed by ) or ,), a function return type (where the type will always be followed by { or ;), a variable's type (followed by = or ;), the pointee type of a pointer (followed by another * to which the same constraints apply), or similar. Given that observation, it was not clear that this would be more severe than existing cases such as n * *p where the multiple meanings of * can be resolved by there being only one possible / plausible parse. However, this is certainly not our area of expertise and would welcome more feedback on the potential for accessibility issues here.

For the readability concern, we agreed that this was a real concern, and noted that this is in fact a pre-existing problem with automatic formatters for other languages, often handled by turning off the automatic formatter for the code in question. That outcome seems far from ideal. We further noted that in the motivating cases, the characters / tokens immediately adjacent to the operator directly indicate the intended interpretation: ...x*y... is clearly multiplication, but ...x*[... is clearly not (assuming that expressions can't start with [ -- but this might be an array type, depending on array syntax).

We suggest revising the rule as follows:

An operator token is interpreted as infix if:
1. it has whitespace on both sides, or
2. it has whitespace on neither side, and the preceding token is any closing bracket (), }, ]) or an identifier or literal, and the following token is ( or an identifier or literal (but not { or [).
An operator token is interpreted as prefix / postfix otherwise, and there shall not be whitespace between the operator and the operand.

This is somewhat closer to the Swift rule than we were previously, but still rejects cases that the Swift rule might accept. Note that the "shall not be whitespace" rule for unary operators is not essential, but we would like to try the more-constraining rule first and only consider relaxing it if we discover it to be a source of friction.

Some additional considerations (not discussed in the open discussion session):

This functionality should still be used sparingly. In particular, I'd suggest we avoid use of this approach except in situations where we expect there to either be no genuine ambiguities or for ambiguities to be vanishingly rare.
The formatting tool could, and perhaps should, respect the programmer's choice to include or exclude whitespace around binary operators. This would remove one reason to disable the formatting tool for a region of code, but would also mean that in the common case, explicit spaces would need to be added when typing code prior to running the tool. This will likely need some consideration when the tool is built.
We could, and perhaps should, reject code where the presence or absence of whitespace is inconsistent with the operator precedence rules. For example, we might reject 5+x * 6+y on this basis.

What do people think of this revised approach?

chandlerc commented 3 years ago

BTW, I checked with folks in the C++ #include group (they have a dedicated accessibility forum) to understand how much constructs like a * *b cause problems in practice. Got a lot of great info, although I don't know that it changes much. My summary follows:

Lots of existing ways to handle this. They're not perfect, but also not a large or even medium problem.

Screen readers can often be configured to be more verbose to help w/ stuff like this, some users even have that bound to a key to re-read lines that weren't clear.
Other users may have a braille display where this isn't any different from visually reading the code. But not everyone has one (quite expensive) so not something to rely on.

Overall, it doesn't seem to be a pressing problem in need of solving. But (similarly to the visual and parsing side) it also isn't something we would want happening all over the place. So the direction of trying to minimize and/or avoid code having patterns where this might be confusing is basically the right direction. Seems unlikely that we need to stress about the edge cases here given the tools available, provided they really are edge cases.

chandlerc commented 3 years ago

Closing this with the decision in @zygoloid's recent comment: https://github.com/carbon-language/carbon-lang/issues/520#issuecomment-852574230

carbon-language / carbon-lang

should we use whitespace-sensitive operator fixity? #520