beacon-biosignals / Legolas.jl

Tables.jl-friendly mechanisms for constructing, reading, writing, and validating Arrow tables against extensible, versioned, user-specified schemas.
Other
24 stars 2 forks source link

support pluggable/arbitrary (de)serialization formats/targets (CSV, JSON, YAML, etc.) #34

Open jrevels opened 2 years ago

jrevels commented 2 years ago

From the FAQ:

Why does Legolas.jl support Arrow as a (de)serialization target, but not, say, JSON?

Technically, Legolas.jl's core Row/Schema functionality is totally agnostic to (de)serialization and could be useful for anybody who wants to generate new Tables.AbstractRow types.

Otherwise, with regards to (de)serialization-specific functionality, Beacon has put effort into ensuring Legolas.jl works well with Arrow.jl "by default" simply because we're heavy users of the Arrow format. There's nothing stopping users from composing the package with JSON3.jl or other packages.

Legolas rows are valid Tables.jl AbstractRows, so can already be fed into any package that (de)serializes Tables.jl tables.

The only "Legolas-specific" bit one might want to be captured is the schema metadata (e.g. the fully qualified schema string), in order to enable consumers to load tables into the appropriate schema. Arrow has a convenient custom table-level metadata mechanism we use for this currently; note that regardless of what we support as a result of this issue, we could still specialize on Arrow and continue to use its custom table-level metadata instead of some worse generic fallback.

You can imagine defining a little hook in Legolas' API that enables authors to define how to (de)serialize schema metadata to any given file format.

However, it'd be nice to not ever really need to build up a library of "supported formats" and instead have a generic approach. There are three ways I can imagine doing this that are fully agnostic, and one additional way that is fully agnostic to the format but still specialized on the storage system:

  1. Simply write the Legolas metadata as a newline-terminated magic string/header before the table content, regardless of format. This is a little bit annoying for Legolas-unaware consumers of Legolas-written data (e.g. an arbitrary JSON reader might break), but at least it'd be pretty discoverable.

  2. Always encode the Legolas metadata as the first row in the table; e.g. (a=[1,2,3], b=[5,6,7]) would be written out like (a=["schema@1",1,2,3], b=[nothing,5,6,7]). This solves the portability problem of the previous approach, but would mean that we couldn't guarantee homogenous column eltypes for the whole table for Legolas-unaware consumers that don't know to "skip" the first row

  3. Just add a whole column to the table that contains the schema string duplicated for each row. This solves the previous two problems, but is a bit wasteful. but maybe not too bad given that the bloat is constant w.r.t. row size, and is easily compressible.

  4. If your target storage system supports a notion of object metadata separate from object content (e.g. S3), you can just always store the Legolas metadata there. IMO this approach is the most convenient one if you only cared about specific storage systems.

(ref also https://github.com/beacon-biosignals/Legolas.jl/issues/1, though that's kind of orthogonal)

ericphanson commented 2 years ago

One way to interpret/implement Option 3 is to say all @rows always have two columns, version and schema. That could be kind of nice in that one doesn't need to be careful to use Legolas.write to make sure the metadata is there (since it's just a column like any other), and that it means a row is aware of it's own schema, so it could make sense to validate a row alone without reference to an external schema. Likewise, Legolas.write wouldn't need a schema as a second argument; it could check that each row has a schema and version columns that all agree with each other, and then validate that resulting schema on the table.

jrevels commented 2 years ago

FWIW Beacon's internal systems basically implement option 3 already to enable storing different row extensions in the same table; for that we use 3 columns:

schema_qualified obviously already has schema_name and schema_version baked into it, we just happen to break them out as optimizations

if we were to add this as part of the specification, I think I'd only want to include the qualified string, since that contains all necessary information (and additionally the parent information, which schema_name/schema_version leave out)

ericphanson commented 2 years ago

I added TOML serialization support to an internal Beacon package. I just had 1 top-level key which was schema and put the fully qualified schema name, and then had a array of tables called row, which had 1 entry per column. So it looks something like like

schema = "myschema@1"

[[row]]
col1 = "hi"
col2 = 5
[[row]]
col1 = "bye"
col2 = 6

Seems to work well enough as a simple plain-text format, assuming you're OK with the very limited type support. Having top-level keys is nice to support the metadata.

I could imagine something similar working for JSON as well, and I guess YAML. Not sure about CSV, there I think the first row thing makes more sense since skipping the first row is a thing many CSV readers already know how to do. (I haven't seen that option in JSON or TOML readers, on the other hand).

jrevels commented 2 years ago

Some thoughts while working on #54

ararslan commented 2 years ago

I haven't actually thought this through for more than about a minute but would it be possible to upstream the foundation of Legolas, i.e. versioned schemas for Tables.jl-compliant rows, into Tables.jl?

There is also now a function metadata (see https://github.com/JuliaData/DataAPI.jl/pull/48) that will provide a generic means of storing and retrieving table- and column-level metadata, and will apparently replace Arrow's getmetadata function. That should make it easier to generically work with schema-related information.

jrevels commented 2 years ago

Ha, funny you say that - I kinda see Legolas as "incubating" at Beacon for now, but my fuzzy hope is that it eventually can be donated to JuliaData under a more generic/discoverable name (TableSchemas.jl or something)

EDIT: I kinda like DataRecords.jl but that's almost too generic. Maybe RecordLens.jl or RecordOptics.jl (given the lense/optics-like nature of Legolas-generated records)?

the reason to not to do that immediately IMO is that there are still a few important open issues (like this one) that might (might not) entail breaking changes, and i don't see much reason to go through churn of moving it / promoting it in wider community before we figure some of those out and battle-test our solutions at Beacon

I hadn't really considered moving this stuff into Tables.jl itself, but would be very happy if it evolves to the state where folks would find that worthwhile - I think Julia's package ecosystem would benefit from a bit more centralization lol. but my default assumption is that folks would prefer a separate package at least to start. But maybe this a good opportunity for the middle ground of a subpackage (e.g. Arrow.jl vs. ArrowTypes.jl)

jrevels commented 1 year ago

xref https://github.com/JuliaLang/julia/pull/47695 which could be useful here as well for obviating the need for special separate LegolasJSON / LegolasArrow packages


Another thought, brought on from some internal Beacon stuff: Often, a given (de)serialization process is informed by both the target serialization format (e.g. Arrow, JSON) and by relevant application-specific semantics (e.g. "sure, we expect you to serialize this as a string to JSON, but when you communicate w/ service A, that string should follow this particular structure; when you communicate w/ service B, it should follow this structure"). Right now, the solution for this is for users to apply such desired pre/post-(de)serialization transformations "manually" (e.g. just pre/post-process your input/output however you want), which I think is usually the right way to go.

However, there might be some circumstances where this simple approach is problematic. For example, if you need to convert a given schema version's required field to an intermediate type that has necessary overloaded serialization behaviors w.r.t. your target format (e.g. your own custom toarrow definition), but that intermediate field type is not allowed by the schema version (e.g. schema version's author might be completely unaware of your use case). In such a case, you can't actually construct the desiredLegolas.AbstractRecord in composition w/ the desired transformation, so you end up either needing to define your own schema version (boilerplate-y/undesirable) or handroll and alternative to the built-in Legolas machinery for propagating metadata.

This implies that Legolas should expose some functional pre/post-(de)serialization hooks that are orthogonal to choice of (de)serialization format.

Note that if we supported such hooks + pluggable target formats, we'd definitely be in a regime where some diagrams describing (de)serialization flow would be useful.


This also somewhat relates to another idea, which is that different Legolas-aware systems may implement/support different subsets of semantics (e.g. schema version registration, extension, etc.) that may cause these different systems to favor different strategies for (de)serializing Legolas metadata. Regardless, we probably should just standardize a uniform "lowest common denominator" strategy for each supported target format (as we've done for Arrow), but we do have another possible option of more loosely defining a mere target interface instead ("a Legolas-aware system must define a strategy for emitting/retrieving y from supported serialization formats"). Probably not very useful, but just mentioning it.