Open icsaszar opened 3 years ago
Hi icsaszar,
Glad you are enjoying the crate!
Order preservation is indeed something I have considered, it did not make it in mostly because it is not very high priority for me and I focused on getting the crate up to shape with the maps provided by std
.
A order-preserving solution is absolutely a useful thing to have, but sorting is also handy for when there are a lot of fields.
I'll have to figure out which one should be the default and how to include them. Ideally it would not be a cfg flag, as that would make it harder to have the website support both at the same time and force the user to decide beforehand (possibly leading to having to re-run an expensive analysis).
So, the solutions I see are:
I'm leaning toward the last option, as it both preserves the most information and keeps things separate. After the fact schema manipulation is cheap anyway.
Other things to consider before implementing it:
Id
and Username
go if it wasn't in the first example where those two were picked up?
Ideally the crate would be smart enough to place it in the middle of the two, but what if the fields are not consistently ordered? There are a lot of possible edge cases.
The two options here are to either just append to the end new fields, or to insert them at the right-ish location when they first appear.serde_json
uses a BTreeMap
by default makes me think that the performance hit is significant.
Then again, we don't need to create a new map for every object instance, only once for each shape, so the price might be minimal. This will require testing though (especially for extreme cases like formatted text in xml files messing with the schema), and I don't have proper benchmarks setup at the moment. If anyone has any ideas/comments or wants to express a preference on the behaviour of the crate feel free to jump in!
I agree there's a lot of aspects to this problem.
What if you defined a new type OrderPreservingSchema
that wraps Schema
and implements all the same traits as InferredSchema
(like Coalesce
).
The problem is that Schema
needs to be configurable too to know how it should order the fields. Two solutions come to my mind:
OrderedStruct
member that would be backed by an order preserving map
Struct
and OrderedStruct
cases
pub enum Schema {
// ...
Struct {
fields: BTreeMap<String, Field>,
},
OrderedStruct {
fields: IndexMap<String, Field>,
},
}
enum Map<K, V> {
Preserving(IndexMap<K, V>)
Ordered(BTreeMap<K, V>)
}
pub enum Schema { // ... Struct { fields: Map<String, Field>, }, }
Regarding the edge cases, first of all I don't really think preserving order really makes sense for data where the fields can be in any order (like if you used `HashMap` to serialize the data).
For duplicated fields I would honestly keep all instances (with the goal that the schema should reflect the structure as closely as possible), which brings its own problems.
For optional (missing) fields ideally there would be some way to configure the deserialization to:
- return an error because not all files as "consistent" (all fields are present and in the same order)
- return the schema with the optional field (but again only in the case when it's in the same place i.e. the fields are in the same order)
I should also note that these assumptions are made with mostly json in mind. There will probably files in other formats (like yaml) that wouldn't fit the assumptions of this model.
- add an OrderedStruct member that would be backed by an order preserving map [...] (maybe some format has the concept or ordered vs unordered maps)
We wouldn't be able to distinguish between ordered/unordered formats a-priori, because Serde doesn't distinguish them. Serde simply give you a list of keys that may or may not be ordered, so the selection would have to be strictly at the Context
level and user-provided.
More generally, I don't like the idea of adding variants because it forces you to deal with them everywhere whether you are using one or the other. The second option sidesteps this problem, but it would still require a ordering/sorting aware user to remember the option they used and correctly pattern match on it.
For duplicated fields I would honestly keep all instances (with the goal that the schema should reflect the structure as closely as possible), which brings its own problems.
I understand why this might be acceptable or even preferred in most formats because duplicate keys are usually forbidden anyway, but unfortunately supporting xml pretty much requires otherwise. See below for more info.
For optional (missing) fields ideally there would be some way to configure the deserialization to:
- return an error because not all files as "consistent" (all fields are present and in the same order)
- return the schema with the optional field (but again only in the case when it's in the same place i.e. the fields are in the same order)
Failing if the fields are not in order seems excessive, as mismatched ordering is pretty common. I guess this would work if there was an explicit 'strict ordering' opt-in somehow, but then it definitely could not be the default behaviour. Maybe tagging out of place fields would be enough? Custom struct aggregators seem like a better option to me though.
The most important format to keep in mind for this is xml, because unlike most other common formats it behaves oddly. In particular each tag is a struct where inner tags are key-value pairs and text content has a special $value
key.
This is relevant because:
arrays/lists in xml are represented simply as a key appearing more than once (in quick-xml serde anyway). We mostly would want to collapse these into a single one as having 100s of array positions would cause us to have 100s of separate schemas. This means that keeping duplicate fields separate is not really a viable option.
sometimes you get mixed text/formatting tags in xml files.
Right now those are sorted and collapsed (so all <p>
tags are 'merged' and such) which is what we want to happen, as schema analysis on formatted text doesn't make much sense and keeping all of them around and ordered might esacerbate the problem of those sections blowing the schema to huge sizes. This is a real problem as I've already produced 0.5GB schemas with merged tags (not merging them would have meant a schema pretty much the size of the input which was 2.7GB compressed).
In both cases we would lose the 'memory usage scales with schema size not data size' feature (because the schema size itself would scale with the data size).
If it doesn't cause a dramatic slow down I think I would go for the single case that retains the most information (ordered), and add sorting as an extra step on top (the maps are pretty easy to sort after all).
I'm not convinced anything more than basic 'best effort' ordering is needed though. After all, keeping track of ordering across values would add non-insignificant complexity (mostly edge cases), and anything more than that could be implemented with a custom aggregator inferring an order from the list of fields it gets (that list does no de-duplication so all info is preserved) and then using that in the sorting function after the fact (which would allow arbitrary comparisons). This would effectively encapsulate the problem in a separate aggregator making it easier to test and have multiple (possibly esoteric) sorting options.
Thank you for sharing your thoughts! I'll keep all this in mind when this bumps up in priority (right now I'm busy using the crate rather than extending it :D )
This crate is already super useful as is, but I think it would be even better if the generated schema had the fields in the same order as in the file.
For example if the file contains:
the generated schema (chose typescript because it's the most concise) currently looks like:
with the fields in sorted order. With preserve order, it would look like this:
serde_json
has a feature calledpreserve_order
and when it is enabled it will useindexmap::IndexMap
instead of aBTreeMap
: https://github.com/serde-rs/json/blob/df1fb717badea8bda80f7e104d80265da0686166/src/map.rs#L15serde_yaml
has this behavior by default by usinglinked_hash_map
: https://github.com/dtolnay/serde-yaml/blob/4684ca06ad5092ff337ffc7805d5388c86ed4cf3/src/mapping.rs#L13