Closed mccanne closed 3 years ago
Don't forget to update zeek-compat.md to reflect that zson files can be read and converted to zeek.
As I work on ingest, i was thinking maybe we could consider unifying inferred ndjson types with zson. Meaning that for a same input ndjson file, zq -i ndjson
and zq -i zson
result in the same zng. (Thus-i ndjson
would become redundant, and we'd also avoid autodetect ambiguity that would arise otherwise).
I think the only delta that we'd have to address is that json numbers are currently turned into zng float64 by the inferred ndjson reader, whereas (based on examples from the spec) zson parses integer literals into zng int and float literals into zng floats.
I do see one possible downside, which is that it might be surprising to end up with different (int vs float) types for the same field coming from json. But for those who really want all json numbers to turn into one type, you could maybe use generic ingest to achieve that.
Actually I suppose that zq -i zson
doesn't interpret "." in field names as indicating a sub-record, like zq -i ndjson
does. So there's at least that delta...
Interesting idea!
@henridf interestingly, I figured the zson parser would be slow as it does a parse of source input to AST, then does a semantic pass over the AST to resolve types (generating another tree data structure), then does a zcode build to produce the final zng record. Turns out zq is quite a bit faster at parsing zson than json. After thinking for a sec, I realize the go pattern of using hash tables to represent each JSON object could be the culprit here. Anyway, with a couple flags to the zson reader (i.e., do a column sort for a stable order, unflatten dotted names, treat all numbers as float64, ...), I think it would be fruitful to take this path for the reshaper work. p.s. I think the unflatten step can fit in really easily as part of the semantic pass.
Now that we have a zson writer, we need the read side.
The reader will be more flexible compared to what the writer generates, e.g., allowing ergonomic shorthand by dropping field names when you have type decorator on a record, and so forth.