Leverage Type Introspection (via edgedb-rust) with Deserialized ESDL AST (via Python EDB)

dmgolembiowski commented 3 years ago

Abstract

We propose to use ESDL abstract syntax trees as the basis for code generation in the Edgemorph framework.

Motivation

To Edgemorph, EdgeDB's compiler is the root of all magic. By digesting an SDL module into AST tokens, we share a common tongue between any supported programming language. To this end, it is reasonable to use a strongly-typed programming language like Rust or TypeScript to build boilerplate code structures from a user's schema definitions.

edgedb_bullet_train

Warning: Do not stand in front of the EdgeDB bullet train.

Since EdgeDB's capabilities continue to rapidly evolve, and since its SDL language continues to mature in ways that enrich each user's experience, it becomes imperative for Edgemorph to jump out of the way and trek behind — following the smokestacks. We do this by capturing abstract syntax structures during the edm make process, and disk-cache them for interpretation during edm make install.

We believe this approach offers the greatest amount of backward and forward compatibility between successive EdgeDB version releases — because each tag (i.e. alpha-3, alpha-4, ... ) will correspond to its own variant of AST deserialization requirements. Moreover, whenever EdgeDB announces a new release, e.g. version alpha-N, AST changes resulting from alpha-N's release will only need to be developed on a fork of the latest Edgemorph edition (the one corresponding to EdgeDB version alpha-N - 1 ).

To be clear, this is a high-effort, high-maintenance approach but the tradeoff is guaranteed backwards compatibility with EdgeDB.

Type Specifications

The purpose of this RFC's Abstract Specifications section is to identify generic templates that will be coded in Rust. For example, the serialized abstract syntax below must have each of its fields meaningfully converted into a Rust type at compile time. (Note: The following list of types is not complete, but it does cover the most common AST token kinds.)

Example of a serialized module's abstract syntax tree

<TreeNode id=139713944389664, name='ModuleDeclaration', children=edb.common.checked.CheckedList[edb.common.markup.elements.lang.TreeNodeChild]([<TreeNodeChild id=None, label='name', node=<TreeNode id=139713904379744, name='ObjectRef', children=edb.common.checked.CheckedList[edb.common.markup.elements.lang.TreeNodeChild]([<TreeNodeChild id=None, label='name', node=<String str='etest' at 0x7f119fe79880> at 0x7f119ed86580>]) at 0x7f119ed86100> at 0x7f119f0cb400>, <TreeNodeChild id=None, label='declarations', node=<List id=139713904382336, items=edb.common.checked.CheckedList[edb.common.markup.elements.base.Markup]([<TreeNode id=139713904380176, name='CreateObjectType', children=edb.common.checked.CheckedList[edb.common.markup.elements.lang.TreeNodeChild]([<TreeNodeChild id=None, label='name', node=<TreeNode id=139713904380224, name='ObjectRef', children= ...

`TreeNode`

id: <i32>

name: N_T such that N_T ∈ T ´ and T ´ satisfies the size requirements for each of the following identifiers. :

{ 'BinOp',
  'CreateAlias', 
  'CreateConcreteLink', 
  'CreateConcreteProperty',
  'CreateFunction',
  'CreateIndex',
  'CreateLink', 
  'CreateObjectType', 
  'CreateScalarType',
  'ForQuery',
  'FuncParam',
  'FunctionCall', 
  'FunctionCode',
  'InsertQuery', 
  'IntegerConstant',
  'ModuleAliasDecl', 
  'ModuleDeclaration',
  'ObjectRef', 
  'Path', 
  'Ptr',
  'Schema', 
  'SelectQuery', 
  'Set',
  'SetAnnotation',
  'SetField', 
  'ShapeElement', 
  'ShapeOperation',
  'StringConstant', 
  'TypeCast', 
  'TypeName' }

children: CheckedList<TreeNodeChild, Markup>

TreeNodeChild
id: Optional<i32>
label: String s, such that s ∈ L = { "name", "target", "maintype" }
node: enum <String ; TreeNode; List >, with the following corollaries:
- node::String → <String str = '%s'>;
- node::TreeNode → &'a Sized<RefCell<Weak<TreeNode<'a>>>>. 'a is the lifetime specifier for the TreeNode it ellides. Sized<T> is a type with known size. RefCell<T> is a mutable memory location with dynamically checked borrow rules¹. Weak<T> is a pointer that holds a non-owning reference to the managed allocation². TreeNode is an EdgeDB language markup base object subtype.

`Schema`

declarations: Vec<ModuleDeclaration>

ModuleDeclaration
name: ObjectRef<&str>
declarations: Vec<Declaration>

Declaration
CreateAlias:

CreateObjectType:

name: String,
commands: Vec<CreateConcreteProperty> | Vec<CreateConcreteLink>

CreateFunction:

name: String
params: Vec<FuncParam>
returning: Optional<TreeNodeChild>
returning_typemod: Optional<TreeNodeChild>

`BinOp`

left: Expr
op: String
right: String

Implementation Considerations

It can be difficult to write a parser for any complex serialized AST where all types must have known sizes at compile time. While a "TT (token tree) Muncher" seems like a viable option, the practicality of a TT muncher at this scale is brutal. The sheer volume of terms and tokens to match (or discard) makes this difficult to maintain. A more suitable approach would be to write the deserializer with some formal grammar modularity. My preference leans toward PEG and Pest.

Regardless of the approach taken for the intermediate deserialization step, Edgemorph will either need to create the innermost leaf nodes and run to_owned() once inside their owner's ::new(...) method, or Edgemorph could adapt the builder methods in datastructures.rs to stitch distinct nodes together within their owners. Lastly, Edgemorph could operate upon the AST and match against a giant enum-like structure to allocate each of the codegen Rust types.

References

dmgolembiowski commented 3 years ago

@tailhook, do you have any suggestions on this topic:

Implementation Considerations

It can be difficult to write a parser for any complex serialized AST where all types must have known sizes at compile time. While a "TT (token tree) Muncher" seems like a viable option, the practicality of a TT muncher at this scale is brutal. The sheer volume of terms and tokens to match (or discard) makes this difficult to maintain. A more suitable approach would be to write the deserializer with some formal grammar modularity. My preference leans toward PEG and Pest.

Regardless of the approach taken for the intermediate deserialization step, Edgemorph will either need to create the innermost leaf nodes and run to_owned() once inside their owner's ::new(...) method, or Edgemorph could adapt the builder methods in datastructures.rs to stitch distinct nodes together within their owners. Lastly, Edgemorph could operate upon the AST and match against a giant enum-like structure to allocate each of the codegen Rust types.

I'd rather know ahead of time if I'm entering the danger zone

dmgolembiowski commented 3 years ago

ToDo: Revise pre-RFC to include multi-module validation using EdgeQL introspection. In particular, referencing outside user-defined SDL modules within another SDL module is not checked by the ql_parser.

tailhook commented 3 years ago

@tailhook, do you have any suggestions on this topic:

I'm not sure I understand matters here. But have you considered using edgedb introspection for codegen? I.e. you apply the schema into the edgedb instance and then execute queries. Here is how you can get all the properties of the User object:

SELECT schema::ObjectType {
    properties: {name}
}
FILTER .name = 'default::User';

More docs here: https://www.edgedb.com/docs/edgeql/introspection/objects

We are going to rewrite EdgeQL parser in Rust at some point, so tying your implementation to specific python AST now might introduce more churn than needed.

dmgolembiowski commented 3 years ago

We are going to rewrite EdgeQL parser in Rust at some point, so tying your implementation to specific python AST now might introduce more churn than needed.

Excellent; Yury also shared some useful points with me in another conversation related to this pre-RFC. Introspection seems to be the way to go, so I'll table this issue until the Rust EdgeQL parser is stable. Thank you!

dmgolembiowski commented 3 years ago

Have you considered applying the schema into the live EdgeDB and then using introspection for getting type info for making codegen? Or do you use edgedb submodule for different purposes?

No, I hadn't considered that. Thanks for the suggestion; it may come in handy once I begin the edm port to rust. I chose this approach because back when alpha-1 or alpha-2 and I was just getting started, I realized it wasn't convenient enough to check syntax for validity unless you were already on the CLI. At those earlier versions, the error reporting was less refined than it is now, so debugging was a chore. But for compilation I'm using the edgedb.edb submodule to deserialize a user's module files into AST so that only valid schemas can be "installed". For the time being, it's a smelly to hack the non-public submodule into edgemorph, but it's only temporary

Oh, I've seen this comment after commenting about the same in another issue. You may disregard that comment. Originally posted by @tailhook in https://github.com/dmgolembiowski/edgemorph/pull/8#issuecomment-717112607

Type introspection should work in concert with markup serialization and EDB QL parsing.

Reopening this pre-RFC with the knowledge that QL parsing is already provided by the EdgeDB submodule via:

from edb.common.markup import _serialize as serialize
from edb.edgeql import parser as qlparser

(p.s. Yury, thank you for the Gist. I'll definitely incorporate it into this RFC.)

dmgolembiowski / edgemorph