APrioriInvestments / typed_python

An llvm-based framework for generating and calling into high-performance native code from Python.
Apache License 2.0
197 stars 8 forks source link

There should be exactly one Type* for a given identityHash #439

Open braxtonmckee opened 1 year ago

braxtonmckee commented 1 year ago

Right now we have a problem where it's possible to get two copies of the same Type* that are essentially identical, and which have identical identityHash, but which are not the same object. As a result, we have code in the c++ layer where we check if two types are "identical" by computing the identity hash and checking for equality.

In this issue doc, I propose that we implement changes to guarantee that if two Type instances have the same identity hash, then the Type are equal. This will have a number of benefits, including making it easier to ensure that our compiler cache works correctly, since we can now assume that there is one and only one copy of a given Type floating around in the system.

Implementing this change will require changes to the TP api, largely because the way we have currently implemented Forward types makes it impossible to guarantee that there is only one python object representing a given type. For instance, you can write the following code:

A = Forward("A")
B = Forward("B")

A = A.define(Alternative("A", Empty={}, NotEmpty=dict(b=B)))
B = B.define(Alternative("A", Empty={}, NotEmpty=dict(a=A)))

If we were to execute this exact code from two different entrypoints (say, during deserialization in the compiler cache, and also regular module import), then its impossible for us to guarantee that we will always get the same instance for 'A', because at the moment A is fully defined, we do not yet know the type of B.

To solve this, I propose that we modify the way Forward types work, so that all type definitions involving forwards need to be fully "unwrapped" to actual valid subclasses of Type. In this model, we formally divide the world into Type objects that have an instance of Forward reachable inside of them and those that don't, regardless of whether the Forward type graph is fully defined. Call types that can see Forward "ForwardDefined" types and those that can't "ResolvedTypes". Currently, a ForwardDefied type is 'sort of usable', in the sense that TP attempts to pretend, once all the reachable forwards have definitions, that its a regular type. Instead, we should insist that such types are not usable, and force the user to 'realize' the ForwardDefined type as a copy of the type-graph that has no Forwards defined within it. The process of 'realizing' a Forward is where we can apply our memo and guarantee that we always return the same Type and python 'type' instance for a given identityHash. From an API perspective, we can implement this by adding a '.realize' method to all Type* instances (ForwardDefined) or not that throws an exception if there is an undefined forward.

Once we have this, we can work through our two primary mechanisms for producing recursive types, TypeFunction and Module, and make sure that using them makes it possible to just ignore this issue.

As part of this change, we will also need to define a new form of 'identityHash'. The current 'identityHash' walks the object graph and produces a sha-hash that should dictate how the Type or object in question behaves once compiled - two objects with the same identityHash should produce identical compiled code. This is actually a stronger definition than we need for our type identity, and should be renamed 'compilerHash'. The identityHash we use for types should stop walking the graph as soon as it finds a globally visible named object, such as a module or module dict. Its important that we don't walk into the definitions of modules in particular because we will need the identityHash to work during the module import phase, before we can guarantee that all module members are fully defined. The compilerHash of a particular type may change after a type is created (but before the type is used in the compiler) since the type may refer to functions or other module members that are defined after it. However, our identityHash simply needs to guarantee that two Types with the same identityHash behave the same way at runtime, and so there is no reason to be hashing the internals of a named module object. In order to implement this correctly, we should modify the 'identityHash' C++ infrastructure to take an 'enum' indicating whether we are computing the identityHash or the compilerHash of an object, and ensure that all the data structures contained within the MutuallyRecursiveTypeGroup framework are indexed by this flag - this will let us largely re-use the infrastructure we have already built, with the flag determining how deeply into the graph the CompilerVisibleObjectVisitor is willing to walk.