degory / ghul

compiler for the ghūl programming language
https://ghul.dev
GNU Affero General Public License v3.0
4 stars 0 forks source link

Tagged unions #1132

Open degory opened 6 months ago

degory commented 6 months ago

Tagged Union Types

Introduction

This issue proposes the addition of tagged union types to the ghūl programming language. Tagged unions (also known as discriminated unions or sum types) allow defining a type as a union of distinct cases, each of which can carry its own data.

Syntax

The syntax for defining tagged unions in ghūl is:

union Shape is
    CIRCLE(radius: float);
    SQUARE(side: float);
si

union Option[T] is
    SOME(value: T);
    NONE;
si

union Result[T, E] is
    OK(value: T);
    ERROR(error: E);
si
union_definition ::= "union" identifier type_parameters? modifiers? "is" variant_definition+ "si"

type_parameters ::= "[" type_parameter ("," type_parameter)* "]"
type_parameter ::= identifier

variant_definition ::= identifier variant_fields? ";"
variant_fields ::= "(" variant_field ("," variant_field)* ")"
variant_field ::= identifier ":" type_expression

type_arguments ::= "[" type_expression ("," type_expression)* "]"

identifier ::= ...
type_expression ::= ...
modifiers ::= ...

Semantics

Immutability

Construction

Unions are constructed using the new keyword followed by the qualified variant name (including any necessary actual generic type arguments) and any necessary field values. Note that type arguments are applied to the union type, not the variant:

let circle: Shape = new Shape.CIRCLE(10.0);
let square: Shape = new Shape.SQUARE(5.0);
let some_int: Option[int] = new Option[int].SOME(42);
let ok_result: Result[int, string] = new Result[int, string].OK(42);
let error_result: Result[int, string] = new Result[int, string].ERROR("Error message");

Although unit variants contain no fields, construction still requires parentheses, for consistency with use of new with other parameterless constructors in ghūl:

let none_int: Option[int] = new Option[int].NONE();

For convenience, constructor functions can be defined for commonly used unions, to avoid having to specify generic argument types:

some[T](value: T) -> Option[T] => new Option[T].SOME(value);
none[T]() -> Option[T] => new Option[T].NONE();
ok[T, E](value: T) -> Result[T, E] => new Result[T, E].OK(value);
error[T, E](error: E) -> Result[T, E] => new Result[T, E].ERROR(error);

Representation

The compiler can choose the most efficient representation for each union type based on its characteristics:

The initial implementation may be limited representing a union and variants as a base class and derived classes, in which case struct representation support will be delivered under an separate issue.

Requirements Checklist

Optional:

Implementation Notes

The parser, the type expression syntax trees, and the generic type system all need some changes in order to handle variants of a generic union type.

Type Expression Syntax Trees

Need to update syntax trees so we can represent type member expressions where the part to the left of the dot is a generic type application

Parser

Need to update the parser to handle type member expressions where the part to the left of the dot is a generic type application

Type System and Generic Specialization

Specialization of variants of generic unions needs special handling.

In principle we could specialize variants in the usual way, treating them as members of their owning union and replacing all references to the parent unions formal type arguments within the variant with the actual type arguments supplied in the generic application type expression.

However, the .NET representation of the variants will be as classes in their own right. Hence it's more appropriate to make them generic classes with the same formal type arguments as the owning union as this will better align with the actual .NET representation.

So

union Option[T] is
    SOME(value: T);
    NONE;
si

will be represented as

class Option[T] is
    _tag: ubyte; // unless we use a type check or a virtual method to determine variant type
si

class SOME[T]: Option[T] is
    value: T;
si

class NONE[T]: Option[T] is
  // even though NONE has no fields, it needs to be assignment compatible with
  // Option[T] so it still needs a formal type argument T that is forwarded to Option[T]
si

Implementation Notes

The parser, the type expression syntax trees, and the generic type system all need some changes in order to handle variants of a generic union type.

Type Expression Syntax Trees and Parser

Need to update syntax trees and the parser to represent and handle type member expressions where the part to the left of the dot is a generic type application, such as Option[T].SOME.

Type System and Generic Specialization

Specialization of variants of generic unions needs special handling.

In principle, we could specialize variants in the usual way, treating them as members of their owning union and replacing all references to the parent union's formal type arguments within the variant with the actual type arguments supplied in the generic application type expression.

However, the .NET representation of the variants will be as classes in their own right. Hence, it's more appropriate to make them generic classes with the same formal type arguments as the owning union, as this will better align with the actual .NET representation.

For example:

union Option[T] is
    SOME(value: T);
    NONE;
si

will be represented as:

class Option[T] is
    _tag: ubyte; // unless we use a type check or a virtual method to determine variant type
si

class SOME[T]: Option[T] is
    value: T;
si

class NONE[T]: Option[T] is
  // even though NONE has no fields, it needs to be assignment compatible with
  // Option[T] so it still needs a formal type argument T that is forwarded to Option[T]
si

When declaring symbols, we need to ensure:

Because variants contain no code that could reference any symbols, there's actually no need for them to be inside their parent's scope. It may actually be easier, however, not to attempt to break them out of it - it likely won't matter in practice either way. Plus, the one symbol in the parent we do potentially want to reference from the variants (the tag, which we might need to reference from compiler-generated code) will be inherited.

For a type expression like Option[int].SOME, we can't construct a GENERIC type or a GENERIC symbol that represents it, nor do we want to since we actually want to apply the type arguments to the variant, not the union. Hence, we should transform Option[int].SOME into Option.SOME[int].

We could do this blindly in the parser, before we even know what kind of symbol Option will be declared as, since the only kinds of generic types that contain (accessible) type members will be unions. This wouldn't be very future-proof, however, and it might result in confusing error messages.

It would be preferable to defer rewriting Option[int].SOME into Option.SOME[int] until the resolve-types phase, at which point we'll be able to check that Option really is a union, that it's generic and takes one type parameter, and that SOME is a member of it. We can then pretend the user had actually written a qualified identifier generic type application Option.SOME[int] and proceed to specialize that. In the case of any errors resolving the type, though, we will still have the information needed to report it as Option[int].SOME.

We should also notify the symbol use listener that both Option and SOME are referenced here, to ensure hover, symbol navigation, and rename all work (in particular, rename will fail to rename Option otherwise as it won't be referenced in any phase after resolve-types).

For a type expression Option[int].SOME, we should not attempt to specialize Option[int] (making a copy of Option[T] with int substituted for T) and search for member SOME within it. Instead, we should search the unspecialized non-generic Option class for its member SOME[T] and then specialize that with actual type argument int.

Within the body of a variant, any references in type expressions to other variants of the containing union should not be legal. References to the containing union are allowed - this is required to support recursive union types such as lists and trees. If a union is generic, then references to its type within the body of one of its variants need appropriate formal type arguments, which can (and probably should) be the formal type arguments of the union itself.

union Tree[T] is
    NODE(value: T, left: Tree[T], right: Tree[T]);
    LEAF;
si

We need to synthesize an init(...) method for each variant that takes an argument for each of the variant's fields and then initializes the fields accordingly

union Tree[T] is
    NODE(value: T, left: Tree[T], right: Tree[T]);
    LEAF;
si
...
class NODE[T]: Tree[T] is
    value: T;
    left: NODE[T];
    right: NODE[T];

    init(value: T, left: Tree[T], right: Tree[T]) is
        self.value = value;
        self.left = left;
        self.right = right;
    si
si

This can be done similarly to synthesized read and assign accessor methods - we can just construct appropriate syntax trees and inject them into the variant's syntax tree before the define symbols pass (e.g. in add accessors for properties pass)

However, unit variant instances need to be singletons and this complicates things. If the code that handles new expressions needs to be updated to handle this case, then perhaps instead we can add two new IR values - one for instantiating variants with arguments and the other for getting the singleton value for a given unit variant

Implementation Checklist

degory commented 6 months ago

Prerequisite for #1134