Storyyeller / cubiml-demo

A simple ML-like programming language with subtyping and full type inference.
Apache License 2.0
159 stars 4 forks source link

Free type variables and extracting type information #1

Open zesterer opened 3 years ago

zesterer commented 3 years ago

Hi,

I'm very interested in using cubic biunification in a project of mine. However, I've two questions about the algorithm that I've not been able to figure out from the codebase and associated blog posts alone.

1) How might one declare free type variables? (i.e: require the type inference to work backwards to infer the types of variables, such as the following case)

let x = func x -> x + 5

// `x` is inferred as `Int -> Int`

2) How might one extract type inferred type information once checking is complete to pass to later compiler stages?

Storyyeller commented 3 years ago

For 1, you should be able to just use a regular type variable. Perhaps it would help if you explained in more detail what you are trying to do.

zesterer commented 3 years ago

Thanks for the quick response!

For 1, you should be able to just use a regular type variable

I thought this might be the case but I was unsure.

Perhaps it would help if you explained in more detail what you are trying to do.

For (2), I'm trying to take a Value and use the existing data within the type checker core (after all type checking has completed) to extract a fully-inferred type.

For the sake of this question, I guess you could say that this is equivalent to printing the type of an expression after type-checking.

Storyyeller commented 3 years ago

If you want to print out the type, then you just follow the graph nodes and print it as desired. Note that fully simplifying it can take exponential time in pathological cases. One of the key reasons that the base algorithm has worst case O(n^3) time is that it doesn't try to simplify anything, but you'll probably want to simplify it for display purposes.

zesterer commented 3 years ago

Apologies, I'm still a little confused.

In the demo on this page: https://blog.polybdenum.com/2020/08/08/subtype-inference-by-example-part-6-numeric-types-and-operators.html

I can type in the following expression without a type error.

if true
then 42
else "forty two"

Is this intentional? This seems like a malformed program that the type checker is accepting.

If you want to print out the type, then you just follow the graph nodes and print it as desired.

I'm not entirely sure what you mean by this. The existing type checker doesn't provide a way to 'follow' a Value recursively to its VTypeHead that I can see, and there doesn't seem to be an obvious way to implement something like that without hitting a TypeNode::Var along the way.

Storyyeller commented 3 years ago

if true then 42 else "forty two" just has the type int | str. There's no problem there.

zesterer commented 3 years ago

Ah, the code is implicitly generating sum types? In this case, I am, if anything, more confused as to how one extracts that information from the type checker core.

Storyyeller commented 3 years ago

I think it's best to explain in more detail what you are trying to actually accomplish so we can figure out what the best way to do it is. Printing out inferred types is generally not useful, because the inferred types are often too large and complex for a human to understand anyway.

zesterer commented 3 years ago

To clarify: In cubiml-demo, the type-checker simply enforces that the program is well-typed and then transpiles the code to a dynamically-typed language, JavaScript. If one wishes to actually compile the program to a more efficient representation (i.e: machine code), then the types themselves need to be extracted from the type-checker such that they might be used during code generation (to determine their size, and other such things).

In practice (at least, in past compiler projects) I've done this by taking the inferred types from the typechecker and 'baking' them into concrete types (instantiating polymorphic functions as necessary) in order to produce a program where every term is associated with a concrete type such that code generation can begin to occur.

My problem is that, currently, I cannot see a simple way to extract this sort of type information from cubiml's typechecker.

Storyyeller commented 3 years ago

The type system shown in cubiml was designed for transpilation to JS and was not designed for efficient static compilation. If that's what you want to do, you'll probably want to use a different type system. You can still use the same algorithm, and the same code, but you'll want to define the actual types and subtype relations differently. For example, you'll probably want to get rid of width subtyping for records.

Doing so of course opens up new issues, like how to identify which record type a given field access is supposed to belong to. I'd recommend looking at Ocaml and Haskell, which have wrestled with these issues. Ocaml doesn't allow anonymous record types at all, while Haskell requires all field names to be unique.

Storyyeller commented 3 years ago

Once you've done that, you can extract all the type information you want from the type graph. For a variable, just look at the set of VTypeHeads and/or UTypeHeads that flow to it.

zesterer commented 3 years ago

Thanks for the information! I think perhaps I'll take some inspiration from cubiml but attempt implementing some alternative algorithm. Once again, thanks for being patient with me!

devdtc commented 2 years ago

First, let me thank you for the tutorial and sample code. This kind of system has practically all of the qualities I wanted for my toy language, and the tutorial is extremely thorough, informative and persuasive. :)

A few (naive) questions:

  1. I am also curious about zesterer's example of if true then 42 else "forty two". How would one constrain the arms of a match/if to be equal? I know non-sum union types are there because of JS, but would they be difficult to remove from the system?
  2. This is probably a large or vague question, but how would one go about implementing something like derive macros that inspect the type? E.g. if I wanted to automatically derive [de]serialization routines, would I want to try to walk the graph for just the use or just the value type and quit when I see a cycle or a non-serializable head structure?
  3. When you say one could reuse the algorithm but change the type system, what would that look like in practice? Would most of the changes be in check_expr and check_heads while leaving flow() and Reachability as they are?
  4. How would you express a list of values with a common upper bound in Cubiml? E.g. I want something like:
     type List 'a = `Cons 'a | `Nil unit
     let cons = fun x -> fun l -> `Cons {head=x; tail=l}
     let nil = `Nil unit
     let l = cons {a=1; b="hi"} (cons {a=2; c="bye"} nil) : List ('x <: {a: int})
  5. What other changes might you suggest to allow efficient compilation to native code? Is it infeasible to--once the type checker deems everything's safe--have the compiler walk the expression tree and monomorphize where possible except for cases like (4), but keep features like width subtyping?

Thanks again!

LPTK commented 2 years ago

To @devdtc (and @zesterer may be interested). I don't mean to derail this discussion, but you may be interested to have a look at my Simpler-sub project, which proposes a very simple algorithm for subtype inference by starting from Simple-sub (an analogue to cubiml) and cutting some corners. The result is much simpler, and it has the property that accessing inferred types is straightforward – just look at the type syntactically, and at its bounds if it's a type variable. Bounds are inferred and merged automatically as least-upper bounds or greatest lower bounds, avoiding the complicated graphs of the other approach. Any of the simplifications made in Simpler-sub can very well be reverted individually, for example if do you need structural recursive types, you can always do them as in Simple-sub or cubiml, and still retain the nice and easy type variable bounds approach. I'm sorry if posting this here is inappropriate (feel free to remove), but I just wanted to help out.

Storyyeller commented 2 years ago

@devdtc

1.

The simplest approach would be to just hack in a special Use type head that contains mutable state which tracks what types have been checked against it previously and returns a type error in the case of "incompatible" types.

A more technically correct approach would be to use a hybrid of unification and biunification depending on which behavior you want in which parts of the type system, but that makes the typechecker more complex.

It is a vague question. Assuming you want to generate code based on inferred types, then that is a big no-no. The most important rule of global type inference is that types are derived from code, rather than the other way around. If you have both code derived from types and types derived from code, then your system no longer has well defined or predictable behavior. One way around this is to have a stratified type system, where you have two different kinds of types, one that is explicitly specified and usable in code generation and one that is inferred from the code.

That being said, there are ways to do the sort of thing derive macros are used for without the actual code generation part. For example, you could arrange things so that the same code can be used for any component type, thus obviating the problem.

  1. Yes, that is correct.

  2. You would just say List {a: int}. Thanks you subtyping, you get this behavior for free.

  3. This is something I've been thinking a lot about, but it's a tricky subject, so I'll just do a brief brain dump.

The fundamental problem with existing systems languages is that their type systems combine multiple levels of abstraction, leading to much confusion and bugs. I think the real way forward is to have a language which makes the different levels of abstraction explicit.

I'm not sure how that might look - the particular design would be a UX question. But what you really need is some way to specify how types behave logically, as well as controlling how they are represented after optimization, etc. Depending on the focus of the language, you might start with just high level types and have optional annotations to provide optimization hints when the default behavior doesn't do what you want. Or you might do the reverse - start with low level physical types only and add the ability to build logical refinement types on top of them. It's all up to you.

Of course, realistic compilers have many different optimization passes, and the different passes often themselves operate at different levels of abstraction. To be fully correct, you'd want to be able to expose and control behavior at each intermediate stage. But this would obviously get out of hand pretty quickly, and I'm not sure what a good way to do it would be. On the bright side, even trying already puts you ahead of the state of the art. Real world compilers like LLVM are full of bugs because they can't even keep the levels of abstraction consistent within the compiler itself, let alone what is exposed to the programmer.

Another problem is that a Cubiml like approach might work if all you care about is emulating C, but any serious language designer will want a Rust-like lifetime system at the very least, and I do not think there is any way to have full, decidable lifetime inference. This means that you're going to need to require type annotations, at least for lifetimes, and that means a lot of language design questions start going down different paths. I'm not sure what a good solution to this problem is.


@LPTK The problem with exposing inferred types is that they can be exponentially large, and hence you automatically get exponential complexity. What makes Cubiml unique is the worst-case cubic type inference, but that necessarily comes at the cost of not normalizing (and hence not exposing) inferred types.

That being said, I'm not sure which approach is best. But losing the polynomial complexity guarantees is a major downside, and it means that you'll probably need the programmer to supply at least partial type annotations in order to keep compilation time manageable. And that in turn opens the thorny question of how to specify them and how to teach the programmer when and where they need to add annotations.


Anyway, I hope that helped. Please let me know if you have any more questions.

devdtc commented 2 years ago

This is fantastic and much more than I expected--thank you both!


@LPTK I saw you post about simple-sub on r/ProgrammingLanguages but I had not seen simpler-sub, which might be a great fit. Will definitely take a look!


@Storyyeller this is very helpful (especially the brain dump!). A few follow-ups:

1. Another more basic part of this question that I probably should have started with is: what is the meaning of a type like int | string in a statically typed functional language like Cubiml? In TS you'd use typeof to branch on the type, but is the idea that you'd pattern match it somehow? It seems like the kind of thing more likely to result from a typo than from user intent, but maybe I think that only because I haven't used non-sum unions before.

2. That makes a lot of sense. Maybe a better thing to do is have macros like derive work on the AST of the type alias definition or use some EDSL for types that the macro can understand.

5. This is a very interesting way to frame the problem. By performant code though, I didn't mean quite to the extent of Rust. My impression is you probably don't need to complicate the system too much to get good, predictable performance. C#/F# get by for gaming because of their value structs allow performance critical stuff to be written in an efficient way that minimizes allocations. Maybe all that's needed is a distinction between something akin to Cubiml objects and simple C-like structs that do not support subtyping and can be stored inline in arrays. GC overhead might be mitigated by configurable collection strategies (e.g. something like a function that calls a closure and tells the GC to use an arena and defer collections until it returns when it can just free the arena). Anyway, just some ideas. :)

I think you're right that there probably isn't a good way to give full control over the generated code while maintaining the level of abstraction of languages like the MLs. My belief hope is that it probably isn't necessary for most of what I care about, like writing games.


Re. the cost of normalizing inferred types--it should still be feasible to (eventually) have the compiler do it, even if only for the REPL right? I think that's the only thing I'd want it for, playing around with Cubiml. I get the argument that if the system is intuitive enough that printing types everywhere shouldn't be necessary, but it'd be helpful for people learning the system to see what's going on. zesterer's example is one that was counterintuitive to me coming from Haskell/OCaml (and having never used TS), for instance.

Storyyeller commented 2 years ago

int | string is equivalent to top in Cubiml, but it could be different depending on how you design your language (for example, including typeof as you suggested).

LPTK commented 2 years ago

@devdtc

I should have mentioned that in Simpler-sub, you don't need to have a top type. You can actually decide to reject things like int | string as they arise, since types are normalized on the fly – for this, simply add an appropriate error in the corresponding case of the least-upper-bound computation.

This actually allows you to implement types that have different, incompatible representations, and to compile them efficiently. I think this is specifically what you were asking for. A term like if true then 42 else "forty two" would simply yield an error saying that types 42 and "forty two" are not compatible, since their respective supertypes int and string are not compatible.

In Simple-sub and cubiml, we could also achieve the same thing, but it'd be quite more complicated. Simple-sub already normalizes inferred types in the simplification phase, so such bad least-upper bounds could be reported at simplification time. But there are subtleties around record types which require complicating the engine further (specifically, when computing the LUB of {x: A} and {x: B}, we should not make it { x: LUB(A, B) } until we're sure A and B have compatible representations – otherwise the LUB should be {}). Though honestly, such record types with width subtyping are probably not what you want in a language that's supposed to be performance-oriented, as they require dynamic dictionary-passing, a very inefficient way of implementing fields.


@Storyyeller

@LPTK The problem with exposing inferred types is that they can be exponentially large, and hence you automatically get exponential complexity. What makes Cubiml unique is the worst-case cubic type inference, but that necessarily comes at the cost of not normalizing (and hence not exposing) inferred types.

I'm not convinced this really is a problem in practice. The original HM type inference system itself has exponential complexity, and that has not prevented its use in languages like OCaml, where programmers often leave no type annotations at all in their code bases. Human-written code does not tend to exhibit that sort of complexity pathological cases.

Now, subtyping adds its own potential pathological cases, but I don't see why things would be fundamentally different in this context. If you always normalize and simplify types, you keep them small (or at a size similar to what you'd get in ML), and so the size of the program doesn't matter that much – it's the size of individual definitions that tends to matter, but these are generally not very big (or the program needs some refactoring). A few hundred lines of code at most.

That being said, I'm not sure which approach is best. But losing the polynomial complexity guarantees is a major downside

To be clear, I'm pretty sure Simple-sub has the same complexity as cubiml. It's the simplification and normalization that's more expensive. But again, if program definitions remain of reasonable size and you do simply and normalize each inferred type, you don't actually trigger pathological cases, in my (admittedly limited) practical experience.

Storyyeller commented 2 years ago

I suspect that the main reason that it is not a problem in "real world" codebases is that 1) in practice, people supply type annotations everywhere anyway and 2) in practice, people make relatively little use of the type system compared to what is possible and 3) even if they did, subtyping adds a lot of power compared to unification, and that comes at a (compile time) performance cost.

The whole point of having worst case guarantees is that they let you actually make meaningful statements about the properties of the system, rather than just saying "well it works in practice if you don't actually make use of it much". You could say that about anything!

LPTK commented 2 years ago

1) in practice, people supply type annotations everywhere anyway

I don't think that's it. There are many significant codebases in languages like SML, OCaml, F#, and even Haskell (before the convention to annotate the signatures of exported definitions was established) where basically no type annotations are provided. I have never heard of type inference being a bottleneck in these cases. In practice, it still behaves close to linearly with the size of the program, since the size of individual definitions is bounded.

3) subtyping adds a lot of power compared to unification, and that comes at a (compile time) performance cost.

Again looking at OCaml, you can see that they essentially emulate subtyping with row polymorphism in many places of the language – most notably in polymorphic variants and object types. Pure row polymorphism has some usability limitations, but note that for both of these features, OCaml enables recursive types by default, which places their expressiveness pretty close to MLsub. While it's true not many people use these features often, I have never heard of them causing type inference bottlenecks.

The whole point of having worst case guarantees is that they let you actually make meaningful statements about the properties of the system, rather than just saying "well it works in practice if you don't actually make use of it much". You could say that about anything!

A complexity analysis is not very useful if it doesn't take into account the actual shape of the data that's meant to be processed. It just turns out that the shapes of programs human write have very strong invariants, so that should be what's optimized for IMHO.

I agree it's nice to have worst-case complexity bounds in any algorithm (even when they won't normally be exercised), but it's a pretty big tradeoff that I wouldn't be willing to make. Most practical languages have horrible worst-case type checking complexity — in fact, most languages we actually use have undecidable type checking, including, Rust, OCaml, and Haskell! https://3fx.ch/typing-is-hard.html#how-hard-is-type-checking-your-favorite-language

dgreensp commented 2 years ago

I'm starting to write an implementation of cubiml, for learning purposes, and I think what people are missing in the if-then-number-else-string scenario is if no code ever uses the result, of course it will type-check. That's just part of the deal with this kind of type-checker. If some code does use the value, it had better use it in a way where it can be a string or an int. At least, that's my understanding. And I suppose it's equivalent to Top "operationally" if you can't do anything with this type of value besides pass it around, including distinguishing it from other sorts of values besides int and string... though in another sense, it is not the same set as the set of all values, which would include functions, records, etc, right?

Anyway, despite the repeated assurances that the types from this checker are unwieldy and useless--and should not be shown to the user or used for further compiler features--I am still hoping to leverage them to provide an experience like the type-checking in TypeScript or Scala. I'm curious to look at them.

Storyyeller commented 2 years ago

That is correct.