Concrete syntax - Githubissues

evincarofautumn commented 4 years ago

I’d like to use Hap as an opportunity to explore some fun new syntax ideas, while being careful not to blow the weirdness budget entirely, deferring to conventional dynamic imperative languages like JavaScript when there’s no good reason to break familiarity. Some ideas:

Clearly differentiate initialisation, reassignment/update, reactive binding, and equality
Expression-oriented syntax (e.g. if can be used as a statement or expression, on and async return handler IDs, &c.)
Operator precedence grammar (e.g. if X Y and X else Y as operators)
Multi-word identifiers (e.g. left arrow key) disambiguated by syntax/keywords/stropping
Iteration operators to subsume common looping patterns
Range operators
Event- and game-specific features (e.g. continuous operations and time expressions like every (1 second) { … })
Anaphora and type-based references (it, the Image)

evincarofautumn commented 4 years ago

Sketching out some thoughts…

Clearly differentiate initialisation, reassignment/update, reactive binding, and equality

= has a strong precedent for definition, initialisation, and assignment as in var name = initialiser; or variable = expression;
I would prefer to use a single = to denote equality comparison
== has a strong precedent for equality
Confusion between = for assignment and == for equality can lead to bugs if they’re permitted in the same context
=/== confusion can be mitigated by having = for assignment act as a statement, or as an expression, return a unit value that cannot be mixed up with a conditional/Boolean
Other languages generally lack a notion of reactive binding, so we’re free to use any evocative symbol; I like := for this
A leftward arrow <- is evocative of updating (put the rhs into the lhs), but this is ambiguous with the desire to include unary range operators, such that <-5 could be interpreted as either (<- (5)) or (< (- (5)))
Restricting to ASCII, leftward arrows and range operators could be differentiated by whitespace or brackets, and it’s easy to provide good error messages: “The expression <-5 looks like an assignment expression missing its left operand. If you wanted a range instead, use < -5 or <(-5).”
A rightward arrow -> or => is also suitable for assignment (the lhs “goes to” or “becomes” the value of the rhs)
We could use keywords instead, either as operators (e.g. is for initialisation, gets, becomes, or is now for reassignment; means or is always for reactive binding; and is, equals, or eq for equality) or statement-like forms (let name be expression;, set lvalue to expression;, def name as expression;)
There is a conflict between descriptive, verbose keywords for these primitive operations and allowing multi-word identifiers

Expression-oriented syntax

This is fairly easy to do with an operator precedence parser, but introduces considerably more flexibility in combining forms that must be accounted for; for instance, if else can be used as an operator separately from if, then it must have a consistent meaning in other contexts.

This is an attractive approach, though, because it makes it easy to add new, user-defined syntactic forms in a consistent and predictable way, and makes parsing simple and regular. It also requires accounting for all combinations of the semantics of things that can be combined syntactically.

I think statement forms should be able to indicate some form of failure, perhaps by returning an error value rather than raising an exception, and then else can check the success or failure of its left operand and evaluate its right operand if the left failed. Conditional operators like if succeed if they evaluate their body. loops like while succeed if they evaluate their body at least once, so the pattern while (A) { B } else { C } executes C if A was initially false and thus B was never evaluated, and similarly for for each (A) { B } else { C }. Asynchronous loops like for all succeed as long as there are elements in their container operand, and fail when that operand becomes empty, enabling e.g.:

for all (block : blocks) {
  when (overlapping(block, any(goal))) {
    remove (block) from (blocks);
  }
} else {
  // The player has beaten the level when all the blocks are in goals.
  win();
}

An undefined value or user-specified failure allows the use of else to select alternative or default values, e.g. input source = (controller) else (keyboard);.

Multi-word identifiers

The idea is to interpret a series of adjacent name parts as a single name. Just like many languages use a pattern like /[A-Za-z_][0-9A-Za-z_]*/ for identifiers, requiring that names begin with an alphabetic character or underscore but allowing them to contain digits, name parts might have different constraints in the head or tail of a multi-word name, such as allowing digit-only name parts after the first, e.g. player 1.

This requires some disambiguation with keywords to prevent them from coalescing. A simple approach is to disallow names from containing or beginning with a keyword, but since keywords tend to be common, short words, this may be too limiting; it would disallow names like all levels or switch on (if all and on are keywords). Keywords could be allowed in identifiers and separated using symbolic syntax, e.g. case of supplies is a name but case (of supplies) is a keyword followed by a bracketed name (if case is a keyword).

A uniform way to deal with this is stropping, marking either keywords or variable names explicitly. Most languages that allow keywords to be used as variable names do so by marking the variables and leaving keywords unmarked, as in C# class (keyword) vs. @class (identifier) or F# let (keyword) vs. ``let`` (identifier). The problem with this is that existing code can still break when new keywords are added. Unfortunately, there’s a strong precedent against stropping keywords, and it leads to visual clutter, e.g.:

\for each (ghost \in ghosts) {
  \let e = ghost.ectoplasm level;
  \if e < séance.power {
    cross over(ghost);
  } \else {
    ++ghost.anger;
  }
}

So another option is to lexically distinguish keywords from names so they can’t collide at all, such as by capitalising one or the other:

For Each (ghost In ghosts) {
  Let e = ghost.ectoplasm level;
  If e < séance.power {
    cross over(ghost);
  } Else {
    ++ghost.anger;
  }
}

for each (Ghost in Ghosts) {
  let E = Ghost.Ectoplasm Level;
  if E < Séance.Power {
    Cross Over(Ghost);
  } else {
    ++Ghost.Anger;
  }
}

This introduces difficulty for beginners, though, who already struggle with case-sensitivity.

It’s also valid to strop all identifiers, which solves the problem of adding new keywords and allows using more keywords rather than symbols, but also introduces considerable noise:

for each [ghost] in [ghosts] {
  let [e] = [ghost].[ectoplasm level];
  if [e] < [séance].[power] {
    [cross over]([ghost]);
  } else {
    ++[ghost].[anger];
  }
}

for each [ghost] in [ghosts] do
  let [e] be [ectoplasm level] of [ghost];
  if [e] is less than [power] of [séance] then
    [cross over] ( [ghost] );
  else
    increment [anger] of [ghost];
  end if;
end for

Iteration operators

For parallel iteration, some expression e containing subexpressions of the form each e, e₀[each e₁, …, each e_n], is equivalent to zip with (λx₁. … λx_n. e₀[x₁/each e₁, …, x_n/each e_n]) e₁ … e_n, that is, all of the containers are zipped together with the expression, so each [1, 2, 3] + each [4, 5, 6] = [5, 7, 9].

For nested iteration, e₀[every e₁, …, every e_n], is equivalent to flat map (λx₁. … flat map (λx_n. e₀[x₁/every e₁, …, x_n/every e_n]) e_n …) e₁, so every [1, 2, 3] * every [5, 7, 11] = [5, 7, 11, 10, 14, 22, 15, 21, 33]

In the simple case of a single filter parameter, e₀[which e₁] = filter (λx₁. e₀[x₁/which e₁]), so which [5, 10, 15] <= 10 = [5, 10]. When multiple parameters are involved, they are combined as if by every with tupling, and the condition is tested on each tuple: e₀[which e₁, …, which e_n] = filter (λ (x₁, …, x_n). e₀[x₁/which e₁, …, x_n/which e_n]) (zip e₁ … e_n), so which [5, 10] < which [10, 20] = [(5, 10), (5, 20), (10, 20)], that is, all combinations of values from each container such that the condition is true: W(P, e₁, …, e_n) = { (x₁, …, x_n) | x₁ ∈ e₁, …, x_n ∈ e_n, P(x) }.

all, some, none, and how many operate like which, filtering the Cartesian product of their container operands, except that they return Booleans indicating the number of tuples for which the condition held.

all returns whether which with the opposite condition would return empty, or equivalently, whether how many with the opposite condition would return zero: ∀x. P(x), ¬∃x. ¬P(x), or |W(¬P, ê)| = 0
some returns whether which would not return empty, or whether how many would return nonzero: ∃x. P(x), ¬∀x. ¬P(x), or |W(P, ê)| > 0
none returns whether which would return empty, or whether how many would return zero: ¬∃x. P(x), ∀x. ¬P(x), or |W(P, ê)| = 0
how many returns the size of the result of which: |W(P, ê)|

These could have several derived forms based on other English determiners/quantifiers, but these seem less generally useful:

one returns whether how many would return exactly 1
multiple returns whether how many would return more than 1
proportion returns the result of how many divided by the product of the sizes of the inputs
most returns whether proportion exceeds 1/2

where, on indexed containers, performs a selection, returning the set of keys for which a condition is true, rather than the values, so if xs = [1, 2, 3, 4], then where(xs) < 3 = {0, 1} because xs[0] < 3 and xs[1] < 3, and if m = { a: 1, b: 2, c: 3 }, then where(m) mod 2 <> 0 = { "a", "c" } because m.a mod 2 <> 0 and m.c mod 2 <> 0.

To confine the scope of the iteration to a subexpression rather than a whole expression, it may be necessary to introduce some form of scoping, but I think it’s preferable to keep these expressions simple and prefer factoring out separate expressions rather than using complex nesting.

Range operators

Unary relational operators such as <x, =x, and >=x return ranges that allow union, intersection, testing for membership, testing for emptiness, and use in case branches.

Continuous operations and time expressions

Numbers can be equipped with time units, and used in operations that run continuously or at intervals, such as after (1 second) denoting a Boolean that becomes true when 1 second has elapsed after the evaluation of the expression, or every (1 second) for a repeating timer (although this collides with every for iteration).

Likewise, events could be related in time: when (1 second after x = 0) { f(); } is equivalent to something like when (x = 0) { wait(1 second); f(); }.

Anaphora and type-based references

A limited form of anaphora to refer to values by things other than their names could be useful, although it could make code difficult to read if it has complex rules or encourages excessive use. the (type) to refer to the nearest in-scope value matching type seems to strike a good balance amongst utility, readability, and maintainability. it would be suitable for short anonymous functions in a similar vein to the iteration quantifiers above: e[it] = λx. e[x/it], so 5 * it = function (x) { return 5 * x }. This also has the issue of scope, though: how big is the lambda? “As large as possible” and “as small as possible” are both the wrong heuristic in some common circumstances.

evincarofautumn commented 4 years ago

A major source of inspiration for the syntactic–semantic design here is Pane, Ratanamahatana, & Myers: Studying the Language and Structure in Non-Programmers’ Solutions to Programming Problems.

Hap already has the following, or they’re in progress:

The overall program structure is biased toward events, with imperative actions secondary
Iteration quantifiers (described above) provide container-level and function-level iteration over sets and subsets, rather than object-level loops; the vast majority of looping is implicit
State is maintained using mainly behaviours attached to entities, with a minority described using explicit updates
and is primarily Boolean conjunction, secondarily sequencing
or is primarily Boolean disjunction, secondarily “else”, “otherwise”, clarification, or restatement
then is primarily sequencing
Conditionals are specified primarily using mutually exclusive rules, or a general case with exceptions (using e.g. but), and secondarily with Boolean logic
Time and motion are continuous; relationships between past and present are implicit in events, or specified using time relations like after

Other questions directly from or inspired by the paper that don’t have a clear answer yet:

How should Hap differentiate between sequences of actions that can be interrupted vs. those that must execute as a unit? The tentative idea is to differentiate e.g. the discrete/imperative while (evaluate the body as a unit) from the continuous/event-oriented as long as (as soon as the condition becomes false, at any point in the body, it stops evaluating) but this raises hairy questions of “transactions” (such as needing to use atomically within as long as to group statements)
How should the user specify constraints or invariants that should always hold (“the player cannot move outside the screen”) or declarative specifications of situations (e.g. “there are 4 blocks”)
There should be some way to talk about all instances of an object or entity (#2) and refer to nearby/obvious objects anaphorically
What is the perspective of program structures? First-person as the user or as the programmer, second-person as the programmer addressing the user, or third-person narrator?
Iteration constructs and quantifiers should allow talking about negation or inverses of sets, even if they aren’t actually enumerable
What is the precedence of not? Boolean logic uses a convention of high precedence, but the default English interpretation has low precedence

evincarofautumn commented 4 years ago

Some recent decisions:

[x] Require delimiters around statement bodies, to avoid both “dangling else” and ambiguity with map/set literals
[x] Split keywords into primary/secondary/contextual; an identifier may contain any of them, but may not begin with a primary keyword, and will only parse as a contextual keyword when it appears alone
- [ ] Make anything that doesn’t need special parsing not a keyword and just handle it in name resolution
[x] Standardise on spaces as word/digit separators
- [ ] Allow other reasonable word characters like apostrophes and dashes
- [x] Remove _ as a word character
- [x] Use _ as the subscript operator, freeing up [ ] brackets
[x] Don’t bother with exponential/scientific notation
[ ] Use # as a number prefix for alternate bases
- [ ] Make default base 16 for things like colour codes: slate gray = #708090
- [ ] Subscript for explicit radix, like mathematical notation: #CAFE BABE_16, #1010_2, #Aa0+/=_64
[x] Use nested quotes for text splices instead of backslash escapes (need to supply character name constants in standard library instead)
- [ ] Support curved quotes
- [ ] Support multi-line text (leaning toward blockquote style, with prefix on each line)
[ ] Retain comments instead of discarding them

Possible next directions:

[ ] Limited whitespace sensitivity
- [ ] Allow newline+noindent as sugar for semicolon as statement terminator
[ ] Add generic block statement with keyword (e.g. do { … }, but maybe not do)
[ ] Add multi-line comment notation (not a lot of evidence that this is intuitive/usable/desirable)
[ ] Merge statements and expressions (“no sublanguages”)

evincarofautumn / hap-hs

Concrete syntax #4