dart-lang / language

Design of the Dart language
Other
2.64k stars 199 forks source link

Patterns spec uses an ill-defined notion of structural equivalence of types. #2641

Open stereotype441 opened 1 year ago

stereotype441 commented 1 year ago

The patterns spec currently says this, in the "Pattern variable sets" section (emphasis mine):

...The pattern variable set for a pattern is:

Logical-or: The pattern variable set of either branch. It is a compile-time error if the two branches do not have equal pattern variable sets. Two pattern variable sets are equal if they have the same set of names and each corresponding pair of variables have the same finality and their types are structurally equivalent after NORM().

I believe the intent here is that "structurally equivalent after NORM()" basically means "the same type".

But I can't find anywhere in the spec or in the language repo defining the notion of "structurally equivalent". For instance, are the types void Function<T>(T) and void Function<U>(U) considered structurally equivalent? I think we want to treat these two types as "the same type", but NORM() doesn't rename type variables, and without a definition of "structurally equivalent", it's unclear whether we should do so.

A similar issue arises with ordering of named parameters in function types and named fields in records. It seems like types that differ only in the order of named parameters or named fields should be considered the "same type". But NORM() doesn't sort them, so without a definition of "structurally equivalent", it's unclear what the spec intends. (The records spec defines a notion of "structural equvalence" for record objects but not for record types).

Also note that we aren't entirely consistent, even within the patterns spec, of how we specify the notion of the "same type". In the "shared case scope" section, it uses the phrase "same type" instead of "structurally equivalent after NORM()", but I'm pretty sure the same meaning is intended:

If n is defined in every pattern variable set in vs and has the same type and finality, then introduce n into s with the same type and finality. This is a shared variable and is available for use in the body.

Note that both the analyzer and the front end implementations both implement operator == on their DartType classes, so the implementations do have a notion of what the "same type" means. Notably, these implementations of operator == do not operate by taking NORM() and then comparing the results structurally. However, I believe they behave equivalently to taking NORM() and then doing a structural comparison that ignores renames of bound type variables and ignores sort order of named parameters and named fields.

I think we should define the notion of "the same type" in one place in the spec, make sure it matches the behaviour of operator == in the implementations, and then elsewhere in the spec (for instance in the patterns spec), simply say "the same type".

stereotype441 commented 1 year ago

CC @munificent @leafpetersen @jakemac53 @lrhn @mit-mit @natebosch @eernstg @kallentu

stereotype441 commented 1 year ago

Hmm, after further examination, it appears that the analyzer and front end notions of "same type" are not equivalent to taking NORM() and then doing a structural comparison that ignores renames and sort order. For example, the analyzer and front end do not consider FutureOr<Object> to be the "same type" as Object, even though NORM(FutureOr<Object>) = Object.

However, I'm certain that the analyzer and front end implement other normalization rules as part of their operator ==. For instance, neither implementation is capable of distinguishing the types T?? and T?.

I haven't thoroughly read through all the code to see precisely what the implementations consider to be the "same" vs "different" types. So more research may be in order here.

eernstg commented 1 year ago

Here's a specification angle on type equality. Before null safety we did not have a well-defined notion of equality for types (at least, not one that was consistently implemented). With null safety, we could use the following:

This section defines run-time type equality for instances of Type in a way that implies a corresponding notion of equality for the corresponding types (named S1 and S2):

the normal forms NORM(S1) and NORM(S2) are syntactically equal up to equivalence of bound variables

At the end of this sentence we have 'and ignoring * modifiers on types', but that shouldn't make any difference here. This notion of equality is used to determine canonicalization (e.g., const <S1>[] and const <S2>[] is the same object iff S1 is equal to S2 in this sense).

It sounds like this could actually be the definition that the patterns spec is referring to.

The PR that updates the language specification to include null safety makes this explicit (github may not show that line if you click on the link because 'Large diffs are not rendered by default', but you can look up line 24475 manually):

We define what it means for two types to be the same as follows:
Let $T_1$ and $T_2$ be types.
Let $U_j$ be the transitive alias expansion
(\ref{typedef}) of $T_j$, for $j \in 1 .. 2$,
and let $S_j$ be \NormalizedTypeOf{$U_j$}, for $j \in 1 .. 2$
(\ref{typeNormalization}).
We then say that $T_1$ and $T_2$ are the \Index{same type}
if{}f $S_1$ and $S_2$ are have the same canonical syntax.

('Canonical syntax' is just a way to denote types that eliminates differences like prefix1.C and prefix2.C and C being the same type, and C and C being different types, because those terms occurred in different binding environments.)

The rule about function type subtyping needs a notion of "same type" for type variable bounds, but that one is specified to mean mutual subtyping.

The notion of type equality mentioned here is more strict. This is useful when the topic is, say, inherited member signatures: It does matter which signature a given abstract class gets if it implements two different superinterfaces I and J, and I has void m(dynamic) and J has dynamic m(Object?). With mutual subtyping they correspond to "the same" function type, but it does make a difference for clients whether they're calling a method returning void or dynamic, and similarly for the parameter.

This also makes me think that type equality as defined above is the useful requirement in the patterns spec, and mutual subtyping is too permissive.

leafpetersen commented 1 year ago

('Canonical syntax' is just a way to denote types that eliminates differences like prefix1.C and prefix2.C and C being the same type, and C and C being different types, because those terms occurred in different binding environments.)

This isn't quite enough because of bound variables, unless that's also implied. In the NNBD spec I talk about this in terms of syntactic equality which means essentially alpha-equivalence. That is, two types are syntactically equal if they are structurally exactly equal for some choice of their bound variables, modulo different prefixes on the same type. I always assume that type aliases have also been fully expanded out.

Semantic type equality is defined as mutual subtyping.

In places, a slightly different variant of equality is used: namely syntactic equality after normalization. This is almost, but not quite equivalent to mutual subtyping (e.g. it continues to distinguish between top types).

I would suggest using syntactic equality here. That is, we distinguish between Object and FutureOr<Object>; between Object? and dynamic; etc. But we do not distinguish between F<int> and void Function(int) where typedef F<T> = void Function(T), and we do not distinguish between void Function<T>(T) and void Function<S>(S).

lrhn commented 1 year ago

It sounds like we agree that we have some notion of "structural type equivalence modulo textual representation", whether we call it syntactic equality, or define a canonicalization function and take its equivalence classes. I also think we all agree what it should do, it's just specifying it formally which is (as usual) an exercise in crossing t's and dotting i's (and slashing ø's and circling å's). But, we should make sure we do agree precisely which equivalences we include.

It definitely includes alpha-equivalence of bound type parameters, full expansion of type aliases, distinguishing the actual types referenced as opposed to the syntax used to refer to them (there might not even be a context anywhere in the program source where we can write a syntax representing the type, because it includes types that are not all in scope at the same time), and ignoring the ordering of named function parameters. Maybe more.

I think this comes back to the fundamental question: What is a type?

We talk about types all the time. They exist as entities in our static and dynamic semantic models. Each such type entity can have a number of properties, like some being "class types" which can be linked to a class definition, and we have a (partial ordering) subtyping relation defined over all the type entities.

But what are those entities, really?

I believe the model we use (implicitly or explicitly, deliberately or not) is that our type entities are terms of an abstract syntax model, an AST for types. They can be described structurally, as, e.g., TypeTerm ::= GenericTypeReference<TypeTerm1, ..., TypeTermN> | ..., where GenericTypeReference uniquely defines a type declaration, which must be generic and take N type arguments. We textually represent this as List<int>, but that's assuming a context where List and int references those particular declarations. It's a shorthand. The real term uniquely references the List and int declarations of dart:core, or rather the type entities for the types introduced by those declarations, in a completely non-ambiguous way. (What @eernstg's canonicalized textual form does syntactically by assuming a context where all libraries are imported with a unique prefix.)

The abstract model does not care about the names of bound type variables, or about the ordering of named parameters, and aliases have always been expanded (type alias do not introduce type entities, they only introduce new syntax to denote them). The model either does not have any bound type variable names or named parameter ordering at all, or alternatively we just use a more textual model to represent the types, and our semantic type entities are then equivalence classes of that textual model modulo the things that shouldn't matter. (Requires us to prove that it actually defines equivalence classes. And that subtyping is compatible with it too.)

In any case, for equivalent terms, there is only one entity in our semantic model. We cannot distinguish equivalent terms at all, it's the same type in every way. They're no more different than the integers denoted by 1 and 0.99999...

Then we have NORM which tries to figure out which different type terms can safely be considered to be semantically equivalent types. It's entirely structural in the type AST, it does not need to know the actual subtyping relation on types (but that's also defined structurally, so subtyping is just incorporated in the definition of NORM where it's necessary). The NORM function is defined to be idempotent and total (if we didn't make any mistakes), so we can use it for canonicalization, and therefore it too introduces an equivalence relation. It's designed to be compatible with semantic type equivalence (mutual subtyping), so if NORM(T1) is T2, then the type of T1 and the type of T2 are semantically equivalent. And semantic type equivalence is an equivalence relation by definition (if subtyping is reflexive and transitive, which it should be).

That brings us back to the question of which equivalence to use when.

Take int? vs int??. Those are different type terms, but equivalent wrt. NORM (both normalizes to int?).

Are int? andint?? distinct types in our semantic type model? I'd say yes, as currently defined. They are explicitly normalized by NORM, which means they must exist as separate terms before doing NORM. (And they are indeed mutual subtypes.)

Which again means that in any place of the language where we don't want to distinguish int? and int??, we need to invoke NORM.

The alternative is to say that our semantic type entities are equivalence classes of type ASTs wrt. NORM, but then we cannot distinguish FutureOr<Object> from Object at all, and we definitely do not want that.

I don't know what all this means for or-patterns. If we can somehow introduce an int?? type on one side, and int? on the other, the pattern would be invalid, but we've made it really hard to introduce int?? by not allowing it as syntax, and doing NORM in places where we might otherwise add a ? to a type. It's probably not a problem to not normalize.

eernstg commented 1 year ago

@leafpetersen wrote, about using 'canonical syntax' to avoid having different terms for the same type:

This isn't quite enough because of bound variables, unless that's also implied

It would be better to handle that in the same section (and to make it a bit more explicit), but the language specification does have the following (and it has been in there for several years):

Two function types are considered equal if consistent renaming of type parameters can make them identical. A common way to say this is that we do not distinguish function types which are alpha-equivalent

On the other hand, it's not obvious to me how it would be possible to interpret the term 'syntactically equal' in such a way that we get the compile-time error we should have in this case:

// Library 'lib1a.dart'.
class C {}

// Library 'lib1b.dart';
class C {}

// Library 'lib2a.dart';
import 'lib1a.dart';
C f() => C();

// Library 'lib2b.dart;
import 'lib1b.dart';
C? c;

// Library 'main.dart'.
import 'lib2a.dart';
import 'lib2b.dart';

void main() {
  c = f(); // Right hand side has type `C`, left hand side has type `C`, are they not syntactically equal?
}

In any case, I agree with @lrhn that we all agree on what we need to say, and we'll surely find a way to say it.

At this point I think we have two well-established ways to express sameness of types (both assuming that typedefs have been fully expanded first):

  1. Mutual subtyping.
  2. The 'same type' test that invokes NORM and then performs a structural equality check.

These two approaches are well-established, because 1 is currently used with function type subtyping for type variable bounds, and 2 is currently used during constant canonicalization.

Mutual subtyping does not distinguish top types (like dynamic and void). The NORM approach distinguishes some of them. Both equate FutureOr<Object> and Object. Both equate T?? and T?.

I'd suggest that we use (2), which is basically what the patterns feature spec says already:

their types are structurally equivalent after NORM().

leafpetersen commented 1 year ago

On the other hand, it's not obvious to me how it would be possible to interpret the term 'syntactically equal' in such a way that we get the compile-time error we should have in this case:

I don't see any issue here. Equality of identifiers is not string equality on the raw identifiers, it is identity on the canonical URI reference.

stereotype441 commented 1 year ago

@lrhn said:

I don't know what all this means for or-patterns. If we can somehow introduce an int?? type on one side, and int? on the other, the pattern would be invalid, but we've made it really hard to introduce int?? by not allowing it as syntax, and doing NORM in places where we might otherwise add a ? to a type. It's probably not a problem to not normalize.

There's a complication here because the implementations don't actually represent types in the same way we spec them (nor do they implement them in the same way as each other). I've been digging through the analyzer and CFE's definitions of operator == on types, and my understanding is:

In practice this probably doesn't matter much, because (a) the analyzer and CFE agree with each other, and (b) most of the differences above are disallowed by Dart grammar. It would sure be nice if our spec's definition of "the same type" agreed with what the implementations actually do, though.

leafpetersen commented 1 year ago
  • Both the analyzer and CFE ignore named function parameters and named record fields when comparing types; they also account for alpha-equivalence of function generic type parameters.

I meant to comment on this. This is the behavior I would specify.

  • The analyzer and CFE don't distinguish T??, T*?, and T?. This is because they use a single enum to record whether a type ends in ?, *, or neither.

Yeah, I avoided going into this, but I remember discussing this. Again, this is fine - it's finicky to specify, but the grammar stops you from writing it explicitly, so it's just a matter of making sure that the specification of substitution maintains the invariant. Finicky, annoying, but it's fine in practice.

  • The analyzer and CFE don't distinguish dynamic, dynamic?, and dynamic*; they have just one implementation of the dynamic type. Same with void.

Similar to the above, except that the grammar does allow it. We'd have to decide how to specify this. Again, finicky.

  • The CFE doesn't distinguish Null, Null?, and Null*. The analyzer does.

Probably should reconcile this. Again, similar to dynamic above, I'm sure we can spec it, just finicky.

  • The analyzer and CFE do distinguish Never? from Null.

Yes, this is reasonable.

munificent commented 1 year ago

Is it sufficient to define type equality directly like:

Two types A and B are equal iff after resolving type aliases:

Legacy types, other stuff, etc.

I'm guessing there's some sublety I'm missing here that is covered by trying to define it "syntactically"?

eernstg commented 1 year ago

@leafpetersen wrote:

Equality of identifiers is not string equality on the raw identifiers, it is identity on the canonical URI reference.

That is not obvious from the term 'syntactically equal'. However, this is exactly the same kind of information that I'm proposing we make available using 'canonical syntax' (which is basically transforming C and prefix.C into the canonical form aCanonicalPrefix.C). There would be a need to separate the analysis into phases, though: We can't consider identifiers to somehow contain a canonical URI reference during name resolution (such as lexical lookups), because the canonical URI is one of the deliverables from name resolution. So we'd need to have something like "identifiers" and "identifiers with built-in URI". (Perhaps "resolved identifiers" would work?)

In any case, we clearly agree that this information is required in order to make identifier comparison well-defined (during checks for subtype or same-type relationships, at least), and there are obviously many different ways to say it.

stereotype441 commented 1 year ago

@munificent wrote:

Is it sufficient to define type equality directly like:

Two types A and B are equal iff after resolving type aliases:

  • They are both interface types and:

    • They refer to the same declaration.
    • Each corresponding pair of type arguments has equal types.
  • They are both type parameters and refer to the same declaration.

We'd need to add some extra verbiage to account for promoted type parameter types (denoted with & elsewhere in our specs, e.g. T&U meaning "type parameter T promoted to U")

  • They are both function types and:

    • They have the same number of type parameters and each corresponding pair of type parameters have equal bounds.
    • They have the same number of mandatory positional parameters with corresponding parameters having equal types.
    • They have the same number of optional positional parameters with corresponding parameters having equal types.
    • For every required named parameter in A, there is a corresponding required named parameter in B with equal type, and vice versa.
    • Ditto for non-required named parameters.
  • They are both record types and:

    • They have the same number of positional fields and each corresponding pair of positional fields have equal types.
    • For every named field in A, there is a corresponding named field in B with equal type, and vice versa.
  • They are both nullable types and the corresponding underlying types are equal.
  • They are both FutureOr<T> for some corresponding types T that are equal.
  • They are both void.
  • They are both dynamic.
  • Otherwise, they are not equal.

Legacy types, other stuff, etc.

I think for legacy types we could just say:

And, as mentioned somewhere above, we need to decide whether to distinguish Null vs. Null? vs. Null*. My gut feeling is that we should consider those all equivalent, in which case we'd need an additional rule:

I'm not aware of any other cases that need to be covered.

I'm guessing there's some sublety I'm missing here that is covered by trying to define it "syntactically"?

Nothing I can think of. I like your approach!

lrhn commented 1 year ago

I'm all for giving our equivalence rules a look-over and possibly a tweak, but I believe that what we have now is there for a good reason.

We have a term system to represent the types, which needs to be closed under substitution, and which can express and distinguish all the types we need in our static type system (which is more complicated than our runtime type system, because it has to deal with unbound and promoted type variables, and it interacts with type schemas too). Whether that system can generate int?? depends on how we define substitution to work.

Then we can define equality relations on those terms. Each such equivalence relation should have a purpose, so we should be clear which purpose they each serve.

The term system above is likely not going to be able to represent each of our types uniquely. If we use identifiers for bound type variables, it at least needs alpha equivalence, and possibly function type named parameter order equivalence on top. The first equivalence should probably be identifying things that are different representations of the same type structure, and the language semantics should be defined in terms of those equivalence classes, because we never want to distinguish type terms that only differ in arbitrary representation choices. If we want to collapse more things here, that's fine too, as long as we accept that those things will, and must, never be distinguishable in any way.

Then we can add more equivalence relations for other purposes, like we have NORM-canonicalization and mutual-subtyping. We should be precise about each of these equivalence relations, why they exist and where they apply.

Whether Null, Null* and Null? is the same type (and different from Never??) is something we need to decide based on what that precise equality will be used for.

munificent commented 1 year ago

This is a surprisingly long comment thread about an part of the language that feels more fundamental to me than pattern matching. Is there anything actionable here that we want to do for the patterns proposal?

lrhn commented 1 year ago

For pattern matching, we need to decide when two different declarations have "the same type", because those declarations are in different cases with a shared body, so we need to join those types when creating the case body scope.

If we can do that, we're good. The rest is something we can noodle on later.

Each case provides a declared and possibly promoted variable.

We should probably check, first, that the declared type of the variables are at least (and probably just) all mutual subtypes. If not, we reject.

We should then join those promotion chains (maybe remember types of interest too, or maybe not), and use the result as the declared and promoted types of the new variable of the case body.

I don't know whether we normalize before joining promotion chains, but I think we don't.

I do think our runtime type system normalizes types, or at least the static type exposed at runtime if captured by inference, is normalized. (If I print the reification of the static type of a FutureOr<Object> variable, it prints Object.)

void test2(bool unknown) {
  FutureOr<Object>? o;
  print("0: ${o.staticType}");  // 0: Object?
  if (unknown) {
    if (o is! FutureOr<Object>) {
      o = "a" as FutureOr<Object>;
    }
    print("1: ${o.staticType}"); // 1: Object
  } else {
    if (o is! Object) {
      o = "a" as Object;
    }
    print("2: ${o.staticType}"); // 2: Object
  }
  print("3: ${o.staticType}"); // 3: Object?
}

The result suggests that FutureOr<Object> is NORM-alized to Object at some point at runtime, but FutureOr<Object> and Object are not considered the same during compile-time promotion chain merging.

Still, it's a consistent result, so if we do that for case-variables, it's still fine.

The problem will be if the declared types cannot be canonicalized, because they are different types that are just mutual subtypes. We may have to use NNBD_TOP_MERGE for those. I believe it is guaranteed to always give some result.

eernstg commented 1 year ago

We already have a notion of 'same type'. It's used, e.g., for canonicalization of constants. I'd like to repeat my proposal that we use this concept in this setting as well.

It is defined here in the nnbd feature spec. Here is a definition of the same thing which is expressed without references to constants or legacy types:

We define what it means for two types to be the same as follows:
Let $T_1$ and $T_2$ be types.
Let $U_j$ be the transitive alias expansion
(\ref{typedef}) of $T_j$, for $j \in 1 .. 2$,
and let $S_j$ be \NormalizedTypeOf{$U_j$}, for $j \in 1 .. 2$
(\ref{typeNormalization}).
We then say that $T_1$ and $T_2$ are the \Index{same type}
if{}f $S_1$ and $S_2$ are structurally identical.

The notion of being 'structurally identical' has different names. @leafpetersen might want to call it 'syntactically identical', and I've proposed to use a notion of 'canonical syntax', but the purpose and meaning of this concept is exactly the same: It is a check that the two types are both the same atomic type (so C and prefix.C may be the same atomic type because they both resolve to the same declaration), or they have the same structure (e.g., they are both a generic class that takes two type arguments, or they are both a function type of the form _ Function({required _ name})), and the corresponding subterms are recursively structurally identical.

The function \NormalizedTypeOf is written as NORM in several feature specs.

Note that normalization preserves the distinction between dynamic and void and other top types, which is significant because the affordances offered with a variable of type dynamic and void are different (and that's true for List<dynamic> vs. List<void> as well, etc). I believe it's useful that those types are not the same even though they are subtypes of each other, and I'd prefer that developers must write an explicit type here and there until consistency has been achieved (that is, there will not be an NNBD_TOP_MERGE that silently jumps in and fixes things).

munificent commented 1 year ago

I'd like to repeat my proposal that we use this concept in this setting as well.

That sounds fine with me. Are other folks OK with that? If so, I'll just update the patterns proposal to literally link to that definition in the NNBD spec.

lrhn commented 1 year ago

SGTM.