QuartzLibrary / schema_analysis

Universal Schema Analysis
Apache License 2.0
24 stars 1 forks source link

Support for preserving field order in the generated schema #2

Open icsaszar opened 3 years ago

icsaszar commented 3 years ago

This crate is already super useful as is, but I think it would be even better if the generated schema had the fields in the same order as in the file.

For example if the file contains:

{
"Id": 1,
"Username": "test",
"BetaUser": true,
"AccessedAccount": false
}

the generated schema (chose typescript because it's the most concise) currently looks like:

export interface Root {
    AccessedAccount: boolean;
    BetaUser: boolean;
    Id: number;
    Username: string;
}

with the fields in sorted order. With preserve order, it would look like this:

export interface Root {
    Id: number;
    Username: string;
    BetaUser: boolean;
    AccessedAccount: boolean;
}

serde_json has a feature called preserve_order and when it is enabled it will use indexmap::IndexMap instead of a BTreeMap: https://github.com/serde-rs/json/blob/df1fb717badea8bda80f7e104d80265da0686166/src/map.rs#L15

serde_yaml has this behavior by default by using linked_hash_map: https://github.com/dtolnay/serde-yaml/blob/4684ca06ad5092ff337ffc7805d5388c86ed4cf3/src/mapping.rs#L13

QuartzLibrary commented 3 years ago

Hi icsaszar,

Glad you are enjoying the crate!

Order preservation is indeed something I have considered, it did not make it in mostly because it is not very high priority for me and I focused on getting the crate up to shape with the maps provided by std.

A order-preserving solution is absolutely a useful thing to have, but sorting is also handy for when there are a lot of fields.

I'll have to figure out which one should be the default and how to include them. Ideally it would not be a cfg flag, as that would make it harder to have the website support both at the same time and force the user to decide beforehand (possibly leading to having to re-run an expensive analysis).

So, the solutions I see are:

I'm leaning toward the last option, as it both preserves the most information and keeps things separate. After the fact schema manipulation is cheap anyway.

Other things to consider before implementing it:

If anyone has any ideas/comments or wants to express a preference on the behaviour of the crate feel free to jump in!

icsaszar commented 3 years ago

I agree there's a lot of aspects to this problem.

What if you defined a new type OrderPreservingSchema that wraps Schema and implements all the same traits as InferredSchema (like Coalesce).

The problem is that Schema needs to be configurable too to know how it should order the fields. Two solutions come to my mind:

  1. add an OrderedStruct member that would be backed by an order preserving map
    • simple to implement and adds a lot of flexibility (maybe some format has the concept or ordered vs unordered maps), but problematic for the consumer of the values because they have to deal with both Struct and OrderedStruct cases
      pub enum Schema {
      // ...
      Struct {
      fields: BTreeMap<String, Field>,
      },
      OrderedStruct {
      fields: IndexMap<String, Field>,
      },
      }
  2. implement a map type that can be configured at runtime to preserve the order of fields and exposes a common interface for both cases.
    • tedious to implement because there needs to a wrapper for the two maps that exposes a common interface, but the most ergonomic for the consumer because then can use the common interface of the wrapper type for the common processing logic and pattern match to get the specifics if the desire:
      
      enum Map<K, V> {
      Preserving(IndexMap<K, V>)
      Ordered(BTreeMap<K, V>)
      }

pub enum Schema { // ... Struct { fields: Map<String, Field>, }, }



Regarding the edge cases, first of all I don't really think preserving order really makes sense for data where the fields can be in any order (like if you used `HashMap` to serialize the data). 

For duplicated fields I would honestly keep all instances (with the goal that the schema should reflect the structure as closely as possible), which brings its own problems. 

For optional (missing) fields ideally there would be some way to configure the deserialization to: 
- return an error because not all files as "consistent" (all fields are present and in the same order)
- return the schema with the optional field (but again only in the case when it's in the same place i.e. the fields are in the same order)

I should also note that these assumptions are made with mostly json in mind. There will probably files in other formats (like yaml) that wouldn't fit the assumptions of this model.
QuartzLibrary commented 3 years ago
  1. add an OrderedStruct member that would be backed by an order preserving map [...] (maybe some format has the concept or ordered vs unordered maps)

We wouldn't be able to distinguish between ordered/unordered formats a-priori, because Serde doesn't distinguish them. Serde simply give you a list of keys that may or may not be ordered, so the selection would have to be strictly at the Context level and user-provided.

More generally, I don't like the idea of adding variants because it forces you to deal with them everywhere whether you are using one or the other. The second option sidesteps this problem, but it would still require a ordering/sorting aware user to remember the option they used and correctly pattern match on it.


For duplicated fields I would honestly keep all instances (with the goal that the schema should reflect the structure as closely as possible), which brings its own problems.

I understand why this might be acceptable or even preferred in most formats because duplicate keys are usually forbidden anyway, but unfortunately supporting xml pretty much requires otherwise. See below for more info.

For optional (missing) fields ideally there would be some way to configure the deserialization to:

  • return an error because not all files as "consistent" (all fields are present and in the same order)
  • return the schema with the optional field (but again only in the case when it's in the same place i.e. the fields are in the same order)

Failing if the fields are not in order seems excessive, as mismatched ordering is pretty common. I guess this would work if there was an explicit 'strict ordering' opt-in somehow, but then it definitely could not be the default behaviour. Maybe tagging out of place fields would be enough? Custom struct aggregators seem like a better option to me though.


The most important format to keep in mind for this is xml, because unlike most other common formats it behaves oddly. In particular each tag is a struct where inner tags are key-value pairs and text content has a special $value key.

This is relevant because:

In both cases we would lose the 'memory usage scales with schema size not data size' feature (because the schema size itself would scale with the data size).


If it doesn't cause a dramatic slow down I think I would go for the single case that retains the most information (ordered), and add sorting as an extra step on top (the maps are pretty easy to sort after all).

I'm not convinced anything more than basic 'best effort' ordering is needed though. After all, keeping track of ordering across values would add non-insignificant complexity (mostly edge cases), and anything more than that could be implemented with a custom aggregator inferring an order from the list of fields it gets (that list does no de-duplication so all info is preserved) and then using that in the sorting function after the fact (which would allow arbitrary comparisons). This would effectively encapsulate the problem in a separate aggregator making it easier to test and have multiple (possibly esoteric) sorting options.


Thank you for sharing your thoughts! I'll keep all this in mind when this bumps up in priority (right now I'm busy using the crate rather than extending it :D )