asoffer / Icarus

An experimental general-purpose programming language
Apache License 2.0
9 stars 2 forks source link

Pattern Matching #65

Open asoffer opened 3 years ago

asoffer commented 3 years ago

Compile-time Pattern Matching

Pattern matching is the ability to concisely describe the form that a value takes on in a declarative manner. In the context of generic programming, this comes up frequently. For example, we may want to restrict a function to only accept arrays, though we do not care about the length of the array or the type of the array elements.

We propose a new operator ~, which allows matching expressions to patterns, provided that the matching can be executed entirely at compile-time. The left-hand side of this operator is the expression to be matched, and the right-hand side is the pattern we intend to match. If the expression does not match the pattern, an error is emitted.

Defining and Using Patterns

The simplest pattern is a constant. Constants can be used to enforce compile-time invariants. For a contrived example,

T ::= compute_some_type()
T ~ [2; i64]  // Asserts that T must be [2; i64]

More useful is a pattern to which we can bind further identifiers. To do this, we introduce a new declaration syntax using a backtick (`). Syntactically, the backtick accepts an identifier. It denotes a pattern which matches anything. A match also binds the identifer to the value being matched within the local scope.

For example,

T ::= comute_some_type()
T ~ [2; `ElementType]  // Asserts that T must be a two-element array,
                       // emitting an error otherwise. If so, it also
                       // binds ElementType to the type of the elements
                       // of the array.

While pattern matching against types is a common use case, we allow patterns to match against any values.

// A contrived example with integers
N ::= 100
N ~ 2 + 2 * `FORTY_NINE
FORTY_NINE ~ 7 * `SEVEN

// A slightly less contrived example:
T ::= compute_some_type()
T ~ [2 * `N; `ElementType]  // Asserts that T is an array of even length.

in some contexts, we also allow ~ to be used as a unary operator. Specifically in contexts where inference is allowed, the unary ~ will match the pattern against the value that would be inferred. The most common case would be with type-inference in declarations.

p: ~*`T = returns_a_pointer()

a: ~[2 * `N; `E] = returns_an_array_of_even_length()

In each of these examples, the pattern in the declaration's type-expression matches the type inferred from the initial value. This constrains p to be a pointer and a to be an array of even length.

Note that because ~ is a unary operator, it is syntactically valid to write something such as

a: [3; ~`T] = something()

The semantics of such an expression may seem strange at first. ~`T still matches the type of something(), but then the type of a becomes [3; T], rather than T. Such a declaration will then require an implicit conversion from T (the type of something()) to [3; T] (the type of a). Such a conversion does not exist for T in general, but there may be a user-defined conversion for this specific type. Whether or not a conversion exists, the pattern match will succeed, because the pattern `T matches everything. However the type of a is inferred to be [3; T] and no conversion exists from T to that, so an error will be emitted due to this missing conversion.

Function paremeters

Pattern matching in function parameters is no different syntactically or semantically from above but is ubiquitous for generic programming and worth calling out explicitly. Just as in the example with declarations, a pattern placed in a declarations type-expression matches against the type inferred for the declaration. In the case of function parameters, this is the type of the argument passed in at the call-site.

Here are a few examples:

// Identity function
id ::= (x: ~`T) -> T { return x }

// Takes the value by reference
by_reference ::= (x: *~`T) -> () { ... }

// Accepts a constant N, and partitions an array whose length is a multiple of N
// into an array of length-N arrays of the same elements.
partition ::= (N :: u64, x: ~[N * `M; `T]) -> [M, N; T] { ... }

Implicit conversion subtleties

There are some subtleties worth calling out. For instance, should the pattern *`T match [*]i64? Should an array-type match a slice-type Should functions with parameter names match the corresponding nameless function type (e.g., (n: i64) -> i64 versus i64 -> i64)? The underlying question here is whether a type should match a pattern to which it can be implicitly converted. While this seems like a nice feature in general, it can be difficult or impossible to determine the right implicit conversion without knowing the type we want to convert to (just a pattern for that type). There may even be multiple such conversions. It therefore seems prudent to disallow conversions entirely. Because buffer-pointer-to-pointer is a conversion, the pattern *Twill not match[*]T`.

However, this opens up a trap for Hyrum's law. Users of a library may match a library function's return value against a pointer such as *`T, thereby causing breakages when the library author decides to change the return type to a buffer pointer, which would otherwise be allowed. We still believe that not matching through implicit conversions is the right solution, but we would want to

  1. Advise users not to match patterns against types they don't own
  2. Build a tool that can fix such issues.

Syntactic conflicts

Both ` and ~ are currently currently used symbols. The symbol ` is used to denote a character literal, and the symbol ~ is used to invert flags. We believe these symbols are more appropriate for pattern matching. Specifically, the tilde fits well because it is commonly used in contexts of similarity. It can be read as "looks like." The reason for preferring a backtick with declarations is a bit more tenuous. We want a symbol that does not have many pixels set so that the focus is on the identifier name. Moreover, we expect ` to (in later additions to the language) accept a left-hand argument further constraining the identifier. For this reason, the whitespace in the sigil ` makes a`b easier to read than something else like a!b.

To make this happen, we need change character literals to be prefixed with a different symbol. We propose ! for this symbol.

Similarly, we need a replacement for flag inversion. For this we propose the keyword not. In fact, there's no reason to have &, | or ^ when we have and, or, and xor as well.

User-defined patterns

At the moment, we do not provide any mechanism allowing users to define their own patterns. This is a feature we are interested in, but is not crucial.

Run-time pattern matching.

It would be ideal to entirely support run-time pattern matching as well, however this comes with two primary difficulties.

  1. A pattern could fail to match, in which case we would need some failure-handling mechanism.
  2. Pattern matching is likely to generate less efficient code than hand-written approaches. We would prefer to nudge users towards the more efficient option.

If we can come up with an unobtrusive syntax for failure-handling and find that the performance costs are minimal and often optimizable away, we should revisit this decision for run-time pattern matching.

For the time being we support pattern matching against runtime values, but only if the pattern can be matched at compile-time. The identifier being declared must be the unary ~ operator applied to a pattern.

For example, we allow array or struct expansion (a la C++ structured bindings) because the structure of the array is only dependent on the length which is a property of the type.

Arrays

// Match an array of 3 elements. Fails to compile if the array does not have length 3.
~[`x, `y, `z] := function_returning_array()

// Use both type matching and decomposition together.
~[`x, `y, `z]: ~[3; `T] = function_returning_array()

Note: We may also want to add ... as is done with designated initializers (see below).

Designated initializers

For designated initializers, there is a question about what happens for fields that are not bound as part of the pattern. Neither solution is particularly nice. We could silently drop the fields, but then users are likely to miss necessary values and end up with incorrect structs. We could also require all fields to bound, but this also runs a risk with Hyrum's law, making it difficult for a library author to add new fields to a struct.

we propose the following solution: By default, all fields must be present for a match. However, users may also add a ... marker inside the designated initializer to indicate that more values are being dropped. This allows library authors to add fields to structs, coupled with a tool that inserts ... appropriately.

Also as a small amount of syntactic sugar, we allow omitting the type leading the designated initializer in favor of simply .{ ... }, as the type must be deduced anyway.

// Decompose a struct
S ::= struct { n: i64 \\ b: bool }
~S.{ n = `n \\ b = `b } := function_returning_s()
~S.{ n = `n \\ ... } := function_returning_s()

// Use decomposition in function parameters.
f ::= (~.{ n = `n \\ b = `b }: S) -> () { ... }
wrhall commented 3 years ago

Whether or not a conversion exists, the pattern match will succeed. However if no such conversion exists, but an error will be emitted due to the missing conversion.

I do not understand this section; edit for clarity?

asoffer commented 3 years ago

Hope that helps

wrhall commented 3 years ago

To make this happen, we need change character literals to be prefixed with a different symbol. We propose ! for this symbol.

I don't love this In c languages, the implicit conversion from char to bool (and vice versa) makes this syntax surprising. While I do like symbols in Ruby and elixir, : might be a better option for char literal

wrhall commented 3 years ago

Looking at this

// Accepts a constant N, and partitions an array whose length is a multiple of N
// into an array of length-N arrays of the same elements.
partition ::= (N :: u64, x: ~[N * `M; `T]) -> [M, N; T] { ... }

Is it possible to build a pattern for chunk, ie where the overflow goes into the last "part" of the array?

asoffer commented 3 years ago

To make this happen, we need change character literals to be prefixed with a different symbol. We propose ! for this symbol.

I don't love this In c languages, the implicit conversion from char to bool (and vice versa) makes this syntax surprising. While I do like symbols in Ruby and elixir, : might be a better option for char literal

: might be hard to parse because of its use as in declarations. Does it help that Icarus doesn't use ! for logical negation nor does it have that conversion? Like, it's not ideal, but at least you get decent error messages. For the first time I'm really starting to feel the sigil real estate issues.

asoffer commented 3 years ago

Looking at this

// Accepts a constant N, and partitions an array whose length is a multiple of N
// into an array of length-N arrays of the same elements.
partition ::= (N :: u64, x: ~[N * `M; `T]) -> [M, N; T] { ... }

Is it possible to build a pattern for chunk, ie where the overflow goes into the last "part" of the array?

It should be, though it's either quite subtle or not achievable at all right now. If we allowed users to define their own patterns it would be quite attainable. Or if we allowed binding with further matching. For example, I could imagine also having ` take a left-hand side:

partition ::= (N :: u64, x ~[(N * `M + Length % N)`Length; `T] -> ...

In this case we would first bind Length, then checking to ensure it satisfied the pattern, which would further end up binding M.

wrhall commented 3 years ago

Honestly, as I read further I was fairly well won over by !. It's probably untenable if you also wanted it as the not operator, but given that it isn't I think I'm ok with your choices.

I might need a sigil map to help me understand what they all do. Maybe that's just a view of really structured documentation (eg all single character reserved words)


Can you build up multiple line patterns / bindings?

asoffer commented 2 years ago

Can you build up multiple line patterns / bindings?

Am I correct that you're asking about composability? I think this requires making patterns first-class citizens (at compile-time) in the language. I'm not opposed to this, but I don't see an elegant way to do it right now. The problem is, you wouldn't want to do the naive thing:

even_length_array ::= [2 * `N; `T]

// Many lines later

[4; i64] ~ even_length_array
assert(N == 2)  // Woah, where did `N` come from?

The problem is the binding variables are declared with the pattern, but actually bound when they're used with ~.

You'd really want to do something that forces users to name the bindings at the usage cite. Something like:

even_length_array ::= pattern(half_length, element_type; [2 * `half_length, `element_type])
`
[4; i64] ` even_length_array(`N, `T)

The syntax here is really gross though. I don't have good idea, because it should look something like a function definition, but we need to pass variable bindings as parameters, which is just weird.