recap Schema Spec Draft (thoughts and comments)

cpard commented 1 year ago

Intro

a useful distinction for data schemas is the following:

Schemas for serialization intended for transportation.
Schema for serialization intended for storage and querying.

The first category includes representations like Avro, Protobuf, Thrift and JSON

The second category includes anything relational and serializations like Parquet and ORC.

The reason this distinction is important is because the standards for each category tend to focus on different things as they try to address different use cases and needs.

But, being able to transform from one category to other, is very important. Data will be shared through an API in JSON, moved into a Kafka topic in Avro, land to S3 as Avro and from there transformed into Parquet and from there into different relational pipelines until the data will be served as JSON through a cache or something similar to be consumed by an app.

The above high level data lifecycle is common in every org that is seriously working with data and it takes many different forms, depending on the use case but what remains the same is the need to switch from one data serialization to the other.

Type System

When it comes to defining a type system that can cover the schema and potentially the data transformation from any of the above serializations to another, we need to consider the the minimum possible subset of types that can allow us to do that.

To do that, I find useful to distinguish types into the following categories:

Primitive (core) types, i.e. int and float
Complex types, i.e. records, maps, arrays, enums etc.

This is important!! Figuring out the set of types that can serve us well for the purposes of Recap, it is useful to first consider a fixed set of formats that we want to be able to represent using it. For example, Thrift supports a Set type but Avro and Protobuf doesn't. Does it make sense to include a Set type in recap?

My suggestion here is to start by defining a set of serializations we want to be always compatible with and try to keep this set as minimal as possible while maximizing the surface area of use cases.

Parameterized Types In SQL it's pretty common to have parameterized types, think of VARCHAR(120) for example or many of the timestamp types. Do we want the type system to support these and if yes, how we can deal with type coercion? Trino supports timestamps with up to picosecond resolution, no other engine does but how we would map a timestamp(picos) to timestamp(milliseconds) ?

Primitive Types I believe a good idea here is to consider Arrow core types as a guide to design the primitive types of recap. The main reason is being opportunistic around the adoption of Arrow. As more and more systems adopt it, the more the core types of Arrow will be used and it will make it easier to maintain recap and new mappings in the future.

The recap Schema The use of schema in recap is a bit confusing. To me a schema is a logical grouping of unique entities but the way it is used in recap a schema can also be considered what usually is referred as a record in other languages.

I would recommend to try and me consistent with the more generally adopted semantics around these terms to avoid confusion and help with the adoption of the spec. So,

Schema should be a way to group together entities in a unique way. i.e. what a schema in SQL is or a module or namespace in a PL.
Records should be the main unit of defining entities.

Transpilers

I believe the transpilers should be as simple as possible and as much of the transpiling logic as possible should be part of the type system and schema spec.

Allowing the transpiler to decide how to transform a type can easily lead to semantic issues and reduces the value that a standard like recap can offer.

For this reason I think that some of the types like enums that are core supported by pretty much every serialization used today, should be a core type of the spec.

In general, transpilers should be as dumb as possible and the spec should encapsulate as many semantics as possible.

criccomini commented 1 year ago

This is a great write-up. Thank you for taking the time to do it!

First, I think it might be helpful for me to clarify some things.

Regarding schemas intended for transportation vs. intended for storage and querying, Recap is meant to be used for transportation. More specifically, it's meant to be used for transportation between systems with different schemas. Even more specifically, it's meant to be used to talk about schemas that are being transported between two systems with different schema formats. Moving Protobuf data to MySQL, MySQL data to Kafka, Kafka data to Snowflake, and so on. I call this out as it's neither the use case of something like Protobuf (which I consider squarely in the service IDL camp) nor is it the use case of something like Parquet or a DB schema.
You raise a good point about where to put complexity: in the type system or in the transpiler. I think the right answer is somewhere in the middle; I don't think I'm as convinced as you are that a complex type system is a good way to go. IMO, CUE goes too far in the "fancy type system" direction.
It's important that we clarify whether Recap should support the union of schemas from all sytems, or whether it should gracefully degrade. I am of the position that it should gracefully degrade, but provide an escape hatch (e.g. logical types in Kafka Connect schemas) for systems that wish to have fancier types. So, do I think Recap should support picoseconds? No, I do not. I think Recap can and should be lossy in outlier cases.

Some comments inline:

My suggestion here is to start by defining a set of serializations we want to be always compatible with and try to keep this set as minimal as possible while maximizing the surface area of use cases.

💯 Here's what I propose:

Avro, Protobuf, JSON schema, Parquet BigQuery, Snowflake, Athena, Redshift, MySQL, PostgreSQL, Trino

Again, I don't think Recap should have 100% coverage of all of these schemas. I think it needs to have enough coverage of each to do:

Schema compatibility checks (backward, forward, full)
Data validation checks (or constraints in CUE terminology)
IDL and DDL generation for common data types (again, this can be lossy for fringe types)
Provide a model that, when implemented in-memory, allows frameworks to execute common data manipulation tasks (e.g. a stream processor could implement the Recap schema spec as a data model, and use those data models it is API.. exactly as Kafka Connect schemas are used).

I believe a good idea here is to consider Arrow core types

Indeed. Arrow's data types look pretty robust. The Recap spec covers most of them. I do like that Arrow models "large" and "small" for things like string, binary, and list. I think this addresses a bit of your VARCHAR(120) comment.

One thing to be conscious of is that these are OLAP/DB-centric. It's worth going through the exercise of mapping this schema set to the IDLs I outlined above (Avro, Proto, and JSON schema). At a glance I think it should map fairly well. though I notice it lacks enum, for example.

how we can deal with type coercion

I think this is left to the transpiler for types that are unsupported. For supported types, the coercion should be defined in the Recap spec.

To me a schema is a logical grouping of unique entities but the way it is used in recap a schema can also be considered what usually is referred as a record in other languages.

Yea, Schema is a disaster of a term. information_schema completely breaks what schema means to most people. I agree it's confusing.

It does strike me that your definition of "schema" is a bit OLAP/DB centric. I'm not sure IDL folks would agree with your definition.

I'll try and clean it up.

Allowing the transpiler to decide how to transform a type can easily lead to semantic issues and reduces the value that a standard like recap can offer.

I need to think on this. At face-value, I agree. OTOH, a complex type system can, itself, be hard to implement properly (yes, there can be test suites, but when I look at something like CUE lang, I don't think I'd want to implement it in another language. In fact, that's probably why it's only in Go right now 🤷 ).

/cc @gunnarmorling

cpard commented 1 year ago

This is a great write-up. Thank you for taking the time to do it!

First, I think it might be helpful for me to clarify some things.

Regarding schemas intended for transportation vs. intended for storage and querying, Recap is meant to be used for transportation. More specifically, it's meant to be used for transportation between systems with different schemas.

This is great and very helpful for me to understand the scope.

I had in my mind a schema that has to be equally expressive for transportation and storage/querying.

You raise a good point about where to put complexity: in the type system or in the transpiler. I think the right answer is somewhere in the middle; I don't think I'm as convinced as you are that a complex type system is a good way to go. IMO, CUE goes too far in the "fancy type system" direction.

100% agree on this, the tricky and important part is finding the right balance. CUE has a rich type system but it also ends up offering a non Turing-complete language to do anything and that complicates things a lot. I do think though that having the goal of making the transpiler as simple as possible it's important from a DX perspective as it will allows a developer to iterate faster and avoid bugs at the same time.

It's important that we clarify whether Recap should support the union of schemas from all sytems, or whether it should gracefully degrade. I am of the position that it should gracefully degrade, but provide an escape hatch (e.g. logical types in Kafka Connect schemas) for systems that wish to have fancier types. So, do I think Recap should support picoseconds? No, I do not. I think Recap can and should be lossy in outlier cases.

100% agree on this too but the semantics have to be explicit, that's another reason I prefer to not have the transpiler decide some of that stuff. Think of the case where someone builds pipelines for financial data, let's say SEC requires nanosecond granularity and by accident someone transforms the data into millisecond granularity.

Can we safeguard the user from introducing these bugs?

Some comments inline:

My suggestion here is to start by defining a set of serializations we want to be always compatible with and try to keep this set as minimal as possible while maximizing the surface area of use cases.

💯 Here's what I propose:

Avro, Protobuf, JSON schema, Parquet BigQuery, Snowflake, Athena, Redshift, MySQL, PostgreSQL, Trino

This looks good! (btw Athena is based on Trino so supporting one of the two should cover the other). I had to edit my comment to add this but,

we can probably just substitute BQ, Snowflake, Athena, Redshift, MySQL, PSQL and Trino with standard (ANSI) SQL. It will simplify design and it should cover the majority of what these systems have.

Again, I don't think Recap should have 100% coverage of all of these schemas. I think it needs to have enough coverage of each to do:

Schema compatibility checks (backward, forward, full)

Data validation checks (or constraints in CUE terminology)

IDL and DDL generation for common data types (again, this can be lossy for fringe types)

Provide a model that, when implemented in-memory, allows frameworks to execute common data manipulation tasks (e.g. a stream processor could implement the Recap schema spec as a data model, and use those data models it is API.. exactly as Kafka Connect schemas are used).

this looks great!

One thing to be conscious of is that these are OLAP/DB-centric. It's worth going through the exercise of mapping this schema set to the IDLs I outlined above (Avro, Proto, and JSON schema). At a glance I think it should map fairly well. though I notice it lacks enum, for example.

yeah olap/dbs do not have enums and they are an important part of IDLs but I think you have defined Union right? Enums are syntactic sugar over Unions (happy to be corrected by PL people on that).

how we can deal with type coercion

I think this is left to the transpiler for types that are unsupported. For supported types, the coercion should be defined in the Recap spec.

That's my main objection but mainly because I've been burned so many times by issues with data quality and I feel that safety when transforming from one instance is really important.

To me a schema is a logical grouping of unique entities but the way it is used in recap a schema can also be considered what usually is referred as a record in other languages.

Yea, Schema is a disaster of a term. information_schema completely breaks what schema means to most people. I agree it's confusing.

It does strike me that your definition of "schema" is a bit OLAP/DB centric. I'm not sure IDL folks would agree with your definition.

it's completely OLAP/DB centric!! That's where I come from :D

Allowing the transpiler to decide how to transform a type can easily lead to semantic issues and reduces the value that a standard like recap can offer.

I need to think on this. At face-value, I agree. OTOH, a complex type system can, itself, be hard to implement properly (yes, there can be test suites, but when I look at something like CUE lang, I don't think I'd want to implement it in another language. In fact, that's probably why it's only in Go right now 🤷 ).

I totally agree that a CUE lang situation should be avoided (although I like CUE). Finding the right balance between expressivity and simplicity is very important in building a successful spec and it will def require iterations. But it's a fun and exciting opportunity!

criccomini commented 1 year ago

100% agree on this too but the semantics have to be explicit, that's another reason I prefer to not have the transpiler decide some of that stuff.

Was thinking more on this last night. Two thoughts:

Define coercions for unsupported types explicitly in the spec (e.g. Trino picoseconds are converted to millis).
Provide a transpiler validation suite so transpilers can be validated against the spec.

we can probably just substitute BQ, Snowflake, Athena, Redshift, MySQL, PSQL and Trino with standard (ANSI) SQL

💯 Excellent. So then: Avro, Protobuf, JSON schema, Parquet, ANSI SQL, Arrow.

I am putting a Google doc together that lists the types for each schema format, and how they should convert to/from Recap.

criccomini commented 1 year ago

OK, so I have a Google doc that compares Avro, Protobuf, JSON Schema, ANSI SQL, Parquet, Arrow, and CUE. It's a little fuzzy and hand-wavy in places, but it was an informative exercise.

https://docs.google.com/spreadsheets/d/1_gXOf8yjodZGFuNpskKd0KBL7Zv6Qjg_Ed-4vfEerTk/edit?usp=sharing

Some takeaways:

I don't think ANSI SQL is worth targeting. It's too fuzzy with most data types (such as UNSIGNED, INT sizes, CHAR sizes, etc).
I'm coming around to CUE's constraint system--having just float with a constraint for 32-bit floats and different constraint for 64-bit floats. Given that Recap is expressing only logical types, I think this is the most flexible way to capture type sizes. I think Recap could use this for ints, floats, bytes, strings, date, time, timestamp, etc.

criccomini commented 1 year ago

Noodling on (2) a bit more, I think there are two approaches to describing schemas:

Arrow's way
CUE's way

CUE relies heavily on type systems, and can model like .. everything.

Arrow is more pragmatic. Defines common stuff, lets chips fall where they may.

I think I favor the latter. I'm going to try to fiddle with Recap to get it to look like Arrow.

cpard commented 1 year ago

Noodling on (2) a bit more, I think there are two approaches to describing schemas:

Arrow's way

CUE's way

CUE relies heavily on type systems, and can model like .. everything.

Arrow is more pragmatic. Defines common stuff, lets chips fall where they may.

I think I favor the latter. I'm going to try to fiddle with Recap to get it to look like Arrow.

I totally agree. I think Arrow's pragmatic approach together with its adoption makes it the best choice for what you try to achieve.

criccomini commented 1 year ago

w.r.t. Arrow, there's still a question over whether to model things as the Schema.fbs or the language types (e.g. Python)

For example (Schema.fbs):

type = "int"
bits = 32
signed = false

vs. Arrow's Python API:

type = "int32"

The former falls between CUE and more traditional IDLs. The latter is definitely a standard IDL style.

criccomini commented 1 year ago

Here's what types look like with explicit lengths set (e.g. int32):

null
bool
uint8
int8
uint16
int16
uint32
int32
uint64
int64
float16
float32
float64
time32
    unit: SECOND, MILLISECOND
time64
    unit: MICROSECOND, NANOSECOND
timestamp:
    unit: SECOND, MILLISECOND, MICROSECOND, NANOSECOND
    timezone: string
date32
date64
duration
    unit: SECOND, MILLISECOND, MICROSECOND, NANOSECOND
interval # month_day_nano_interval
binary # variable binary <= 2GB
    length: (optional) int # If set, assume fixed-length
binary64 # variable large binary > 2GB
string
decimal
    width: int
    scale: int
list
    size: (optional) int # If set, a fixed sized list is used
    type: <type>
map
    key_type: <type>
    val_type: <type>
struct
    fields: list[field]
field
    name: string
    type: <type>
    nullable: bool
union
    types: list[<type>]

I've removed large_list and large_string since Arrow doesn't require them.
I also allowed decimal to go beyond 128 bits.
I removed dictionary.
Lastly, I got rid of schema; it seems identical to struct. I'm not sure why Arrow has both.

criccomini commented 1 year ago

Here is the Schema.fbs version of what's above:

null
bool
int
    bits: int
    signed: bool
float
    bits: int
time
    unit: SECOND, MILLISECOND, MICROSECOND, NANOSECOND
timestamp:
    unit: SECOND, MILLISECOND, MICROSECOND, NANOSECOND
    timezone: string
date
    bits: int
duration
    unit: SECOND, MILLISECOND, MICROSECOND, NANOSECOND
interval # month_day_nano_interval
binary
    length: long
    variable: bool
string # Unicode UTF-8 encoded
decimal
    width: int
    scale: int
list
    length: (optional) int # If set, a fixed sized list is used
    type: <type>
map
    key_type: <type>
    val_type: <type>
struct
    fields: list[field]
field
    name: string
    type: <type>
    nullable: bool
union
    types: list[<type>]

It's much more compact. I think with sane defaults (e.g. bits = 32 and signed = true for int), it is also quite ergonomic.

cpard commented 1 year ago

Why not have both? use

type = "int" bits = 32 signed = false

as the foundation and then provide standard types like int32 mapped to the above for easiness. Some kind of syntactic sugar to make working with the type system easier for developers who use it to implement concrete services and also have the more expressive version that can be used by library developers who want to extend the type system.

cpard commented 1 year ago

The more I think about it, the more I believe that when it comes to the design of a type system like the one of recap, you need to consider two types of users of it.

The developer who will be using recap as part of their data infrastructure
The maintainer - library developer who maintains the core recap libraries

These two have slightly different needs. (1) needs an experience with more guardrails, the type system there is to guide her to the right direction, removing boilerplate and ensuring that bugs won't happen.

(2) needs more expressivity and a system that can accommodate future requirements, i.e. now you want to add support for Thrift and it has some important features that were not encountered in the previous specs. This person would trade guardrails for expressivity.

Imo both are important for the success of the project and if you can address both needs in an elegant way, it would be awesome.

I think arrow does something similar but in their case there are very clear boundaries between (1) and (2). For (1) has only to interface with the host language of the spec, so the abstractions are created there while (2) deals with the core spec more where the system is more expressive.

criccomini commented 1 year ago

The more I think about it, the more I believe that when it comes to the design of a type system like the one of recap, you need to consider two types of users of it.

Hah! I have been noodling this as well over the last couple of days.

I think arrow does something similar but in their case there are very clear boundaries between (1) and (2). For (1) has only to interface with the host language of the spec, so the abstractions are created there while (2) deals with the core spec more where the system is more expressive.

100%. The only wrinkle I've been wrestling with has been that I want to introduce some primitive check constraints to the SDL. This is something Arrow doesn't have to wrestle with; checks are encoded as "bits" and "signed" attributes in Integer, for example. In CUE, those would be >= 0 & < INT_MAX (indeed, that's how CUE encodes int32).

I bounce back and forth about whether we want (2) to just use a CUE-like type system or box them in with Arrow types. The upside of CUE-like (or KCL-like) check constraints is that it's pretty easy to add new types as you add new schemas. The downside is that it feels like it makes it harder for maintainers to implement the actual library (they need to implement some boolean logic parser for check constraints). I might need to prototype it out to prove that point.

For (1), for sure they need to deal with concrete types (float32, uint32, etc).

cpard commented 1 year ago

Who's going to be enforcing the constraints? I think an issue to consider here is performance.

CUE is focusing on validating configurations mainly, that allows them to have a rich constraint definition mechanism that doesn't have to query much about the performance of validation.

Now consider the opposite extreme, a firewall. The constraints you want to apply to packets has to be extremely predictable and low latency.

If you plan to validate schemas only, you don't have to worry about perf but if the plan is to also validate data, then this has to be also explicitly considered as part of the design of the system.

criccomini commented 1 year ago

I think Recap should be agnostic to when validation occurs. I think there are three levels:

Done at compile time via types (e.g. you're using an int32, so you automatically have a min/max).
Done a runtime. This is what JSON schema, protobuf-gen-validate, and JSON 380 do.
Done after the fact. These are data quality checks.

As far as expressiveness goes, I think I only want to cover very basic checks. Basic boolean checks plus maybe len and regex. Very similar to JSONPath filter expressions (and CUE).

So, performance should be inline with the runtime checks that the validators I enumerated in (2) do. I think that's do-able.

adrianisk commented 1 year ago

Catching up on all of this, good thread

Arrow is more pragmatic. Defines common stuff, lets chips fall where they may. I think I favor the latter. I'm going to try to fiddle with Recap to get it to look like Arrow.

Also in favor of this, avoid a lot of added complexity and still cover 90% of what people realistically want to do.

Why not have both? use

type = "int" bits = 32 signed = false

as the foundation and then provide standard types like int32 mapped to the above for easiness. Some kind of syntactic sugar to make working with the type system easier for developers who use it to implement concrete services and also have the more expressive version that can be used by library developers who want to extend the type system.

I like this as well - similar to the comment above I think you should design with the 90% use case in mind, and this lets you do that while still preserving flexibility. Someone looking at a large schema file full of type/bits/signed declarations is going to initially be overwhelmed, where a simple int32 is concise and familiar. Well thought out syntactic sugar can make a big difference on developer experience & readability, it's part of the reason I prefer C# over Java even though under the hood they're functionally very similar.

Was thinking more on this last night. Two thoughts:

Define coercions for unsupported types explicitly in the spec (e.g. Trino picoseconds are converted to millis). Provide a transpiler validation suite so transpilers can be validated against the spec.

This seems pretty reasonable to me - behavior of supported types is part of the base spec, behavior of unsupported types either gets added to the spec along with a validation suite, or is undefined. As part of the spec, transpilers are required to warn the user in some way when they've implemented an unsupported type not defined in the spec (maybe fail by default, and require a special flag so there's no way someone can accidentally use it?).

criccomini commented 1 year ago

Awesome. I have a new spec draft I'll post tomorrow. Switched to YAML, refines validators, and adopts Arrow's data structures.

criccomini commented 1 year ago

Ok, I've got a new draft of the spec up:

https://github.com/recap-cloud/recap-schema-spec/blob/main/SPEC.0.1.0-draft.md

Some notes:

It adopts Arrow's dev-facing data model (not its Schema.fbs data model).
It cuts a few things from the Arrow model (long_string, intervals).
I combined binary and large_binary into just bytes with a length and variable attributes. This can capture both variable and fixed-length byte arrays, and it can capture both regular and large byte arrays.
It supports arbitrary-precision decimal, not just decimal128 and decimal256. This means Recap's decimal is a superset of Arrow's.
It merges the concept of Arrow's schema and struct into a single struct.
Fields aren't nested (as they are in Arrow). Fields can have a type of struct to nest (Kafka Connect-style).
Eliminated Arrow's dictionary type.
Per-@cpard's request, totally eliminated the schema nomenclature. Everything is a type now.
Replaced time32 and time64 with just time. This means Recap's time is a superset of Arrow's; it can represent longer SECOND and MILLISECOND times than Arrow can. Ditto for date32 and date64.
I updated the validators to be constraints and completely reworked them. See the doc for more.
I added an alias field for type aliases. This is how derived types (logical types) are defined. I chose this name separately from name because name represents structure and field names which can be non-unique (a struct and field might have the same name, or nested structs might contain fields with the same name). alias is globally unique. I chose alias and not aliases because it felt more ergonomic for the developer. Multiple aliases for a single type may be defined by re-defining a new type with a new alias over and over (e.g. type: base_type \n alias: alias1 and type: base_type \n alias: alias2).
I made the comparison constraints default to length for string and list types (a list type with a '>=': 1 means the list must have a length >= 1). I'm not sure if this is a good idea or not, but it does shrink the constraint language quite a bit.

I think the decision that warrants the most discussion is decision (1), using the dev-facing types. The three options are:

a. Define Arrow's dev-facing types (float32, float64, etc) as standard types. b. Define Arrow's Schema.fbs types (int(bits), float(bits)) as standard types and dev-facing types as derived types that define the bits (float32 = float(32)). c. Define CUE's types (int, float) as the standard types and define Arrow's dev-facing types using constraints (int32 >=-2_147_483_648 and <=2_147_483_647).

I opted for (a) since it felt the most ergonomic for both developers and library maintainers when I was playing with it. It comes at the cost of flexibility--it prohibits certain ints and floats from being modeled except as byte arrays (i.e. decimal). I still need to experiment some more here.

cpard commented 1 year ago

I opted for (a) since it felt the most ergonomic for both developers and library maintainers when I was playing with it. It comes at the cost of flexibility--it prohibits certain ints and floats from being modeled except as byte arrays (i.e. decimal). I still need to experiment some more here.

Can you elaborate a bit more on what cannot be modeled?

criccomini commented 1 year ago

Sure, so the current type system doesn't have a base int and float. It has only intX and floatX. So numbers that lie outside of the bit precision can't be modeled as pure types. They have to be modeled as a derived type of a byte array with some hints as to how to interpret it. The decimal derived type does this. A concrete example would be an integer that is larger than both uint64 and float64 could capture. Another example would be a number that is within a float32's min/max range, but lies outside the precision of the IEEE float spec.

criccomini commented 1 year ago

I am currently drafting an alternative that uses int and float as base types and intX and floatX are derived types (like CUE). I'll post it shortly so we can compare.

criccomini commented 1 year ago

Ok, here are two different versions of the spec:.

This is the more "pure" version that has CUE-like base types and Arrow-dev-like derived types:

https://github.com/recap-cloud/recap-schema-spec/blob/ec7a912e859871021006362ad53d779fa0da5971/SPEC.0.1.0-draft.md

This is a more constrained version that has Arrow-dev-like base types:

https://github.com/recap-cloud/recap-schema-spec/blob/14a0e9442304d8108a5c23b9cd66b39325a1d726/SPEC.0.1.0-draft.md

criccomini commented 1 year ago

And, even MORE CUE-like:

https://github.com/recap-cloud/recap-schema-spec/blob/8171771882feafb0bb1d5edd293324ac271356c2/SPEC.0.1.0-draft.md

I removed the length and variable attributes from bytes, since those can be expressed as constraints. The type system is pretty compact now.

cpard commented 1 year ago

And, even MORE CUE-like:

https://github.com/recap-cloud/recap-schema-spec/blob/8171771882feafb0bb1d5edd293324ac271356c2/SPEC.0.1.0-draft.md

I removed the length and variable attributes from bytes, since those can be expressed as constraints. The type system is pretty compact now.

I really like the shape it takes! great job!

criccomini commented 1 year ago

Thanks! I'm working on a Python MVP right now to validate that it's ergonomic for transpilers.

criccomini commented 1 year ago

Ok, so I spent some time in Python today trying to implement the spec.

Here's what the types look like:

https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/__init__.py

And here's what an Avro transpiler might look like:

https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/avro.py

It's by no means complete, and it's definitely got some bugs and TODOs. Still, it was an informative exercise. It does seem like it'll work.

The main pain-point in the implementation was, unsurprisingly, having to deal with the constraints. I only implemented the <= and >= length constraints, which do appear to work. It's not the easiest thing in the world, especially in Python's type system.

cpard commented 1 year ago

The main pain-point in the implementation was, unsurprisingly, having to deal with the constraints. I only implemented the <= and >= length constraints, which do appear to work. It's not the easiest thing in the world, especially in Python's type system.

This is a schema transpiler only, right? The output is going to be a valid Avro schema? How is this going to be used by the user? I'm wondering if the transpiling phase is just a step of the overall workflow where after that the recap tooling can also provider integration with the rest of the Avro tooling (i.e. using the schema to generate code in a target language).

Regarding the constraints, yes it's kind of ugly and feels brittle. I was wondering if having more abstraction in place would help here.

Constraints at the end are boolean expressions chained with and/or predicates. Maybe constraints on the schema level are represented as a boolean expression that has to be true to satisfy and then have a small DSL for these expressions specifically.

I think Databricks and the Delta Table spec has one of the richest constraint system I've seen. In this case SQL is the constraint DSL. I do think that what Delta-Databricks has is overkill and not the best experience but it might give some ideas.

criccomini commented 1 year ago

This is a schema transpiler only, right?

Yes. Avro -> Recap -> Avro.

The output is going to be a valid Avro schema?

from_avro's output is a Recap in-memory class hierarchy. to_avro is a valid Avro schema.

using the schema to generate code in a target language

Yep, that's what I'm doing.

I was wondering if having more abstraction in place would help here.

Yea 100%. I'm (re)learning the AST stuff on the fly, so my impl is certainly not best. I had a fever dream last night and am re-working things a bit to make the typing more pure.

Maybe constraints on the schema level are represented as a boolean expression that has to be true to satisfy and then have a small DSL for these expressions specifically.

Yea, this is what I was trying to avoid because it seems nasty to have to implement for every library maintainer (me). But now I'm discovering NOT implementing that is even worse. :P Quickly discovering exactly why CUE has the operators it has.

I think Databricks and the Delta Table spec has one of the richest constraint system I've seen

I think this is actually a sign of a failure. A rich constraint system (or at least, one with a lot of operators) is a sign of a less "pure" type system. Compared to something like CUE, which is much more compact, but incredibly expressive. The trick is making something like a CUE type system approachable to mere mortals like me.

I'm experimenting right now with eliminating the required, in, !in, and unique constraints, and leaning on the type system instead. It makes library implementation a lot easier, I think.

Still tinkering on exactly the right balance between user and library. For example, if I eliminate in, then in becomes an == of a union type (in [1,2,3] is the same as < union[int(==1), int(==2), int(==3)])... Not as friendly to the user, but easier on the library maintainer.

criccomini commented 1 year ago

So, one thought is to decouple the in-memory AST from the Recap YAML/TOML/<whatever>. This would allow library maintainers (me) to implement a pretty pure type system without impacting the devs that are defining TOML/YAML files. It would also mean I could save implementing the on-disk format for later, and focus on the transpiling implementation.

I had been thinking about things this way anyway, but not making it explicit.

criccomini commented 1 year ago

... if you squint, this is kinda what arrow does with their Schema.fbs vs. language APIs. Their Schema.fbs is just a much more constrained type system than I think we want.

criccomini commented 1 year ago

(I'm also willing to admit that I've been locked in a garage talking to myself for 2 months. Maybe we should just go back to explicit types--int32, string64--and be done with it. I just worry we still end up having to implement boolean expressions to deal with constraints, which is something I did hope to keep as part of the spec.)

giphy

cpard commented 1 year ago

(I'm also willing to admit that I've been locked in a garage talking to myself for 2 months. Maybe we should just go back to explicit types--int32, string64--and be done with it. I just worry we still end up having to implement boolean expressions to deal with constraints, which is something I did hope to keep as part of the spec.)

hahaha or maybe I should come and take you out of the garage and go for a beer! That stuff usually accelerate development at the end!

Jokes aside, I do agree that the Databricks approach is a sign of failure and that the right type system can improve the experience A LOT. Although DB also has to deal with the SQL reality which is another thing.

criccomini commented 1 year ago

Okay! I've implemented a pretty pure type system:

https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/__init__.py

It supports all of the types described in the spec, and it supports the following constraints:

<=: Data must be less than or equal to a value.
<:: Data must be less than a value.
>=: Data must be greater than or equal to a value.
>:: Data must be greater than a value.
==: Data must be equal to a value.
!=: Data must not be equal to a value.
!pattern: Data must not match an RE2-compliant regular expression.
pattern: Data must match an RE2-compliant regular expression.

It passes some very basic smoke tests for converting to and from Avro.

There are for sure some bugs in the type system, as I haven't implemented any tests on it, but it does seem to work in the REPL for happy-path cases.

(You might notice that this looks very similar to CUE's operators. It's not by accident. The only thing missing are a couple of the functions that CUE supports--length and such.)

criccomini commented 1 year ago

Ok, more progress.

I've implemented a fairly full-featured from_avro that parses Avro schemas into Recap's AST. It is still missing a couple features (notably, default support), but it passes a few tests.

I even managed to implement enums using the type system:

        case EnumSchema():
            return schema.Union(
                types=[
                    schema.String64(constraints=[schema.Equal(symbol)])
                    for symbol in avro_schema.symbols
                ]
            )

(If it's an Avro EnumSchema then use a union of constant string types--one for each Avro Enum symbol).

Still a ton to do, but I'm growing more confident that this approach is going to work. 😄

If I can do Avro and something database-ish, I think we can move forward with the spec.

criccomini commented 1 year ago

If y'all want to put around, the types are here:

https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/__init__.py

And the Avro transpilers are here:

https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/avro.py

Side note: transpiler or converter? from_avro converts Avro to Recap. to_avro converts Recap to Avro. What's the right terminology for that? Converter? Reader? Transpiler?

cpard commented 1 year ago

Side note: transpiler or converter? from_avro converts Avro to Recap. to_avro converts Recap to Avro. What's the right terminology for that? Converter? Reader? Transpiler?

I would go with converter instead of transpiler. I'll play around with the code hopefully tomorrow. Good job!

criccomini commented 1 year ago

Oky! I've moved the Python implementation over to Recap:

https://github.com/recap-cloud/recap/pull/208

See the notes in the commit for where things stand.

I'm going to move this awesome thread over to https://github.com/recap-cloud/recap and close this repo out.

As for the text-based spec, I'm going to leave that for another day. All I really need for Recap right now is the in-memory model. I also made the Recap base Type extend Pydantic's BaseModel, so we automatically can serde schemas using Pydantic. :)

gabledata / recap

recap Schema Spec Draft (thoughts and comments) #209

Intro

Type System