brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

zson reader #1679

Closed mccanne closed 3 years ago

mccanne commented 3 years ago

Now that we have a zson writer, we need the read side.

The reader will be more flexible compared to what the writer generates, e.g., allowing ergonomic shorthand by dropping field names when you have type decorator on a record, and so forth.

mccanne commented 3 years ago

Don't forget to update zeek-compat.md to reflect that zson files can be read and converted to zeek.

henridf commented 3 years ago

As I work on ingest, i was thinking maybe we could consider unifying inferred ndjson types with zson. Meaning that for a same input ndjson file, zq -i ndjson and zq -i zson result in the same zng. (Thus-i ndjson would become redundant, and we'd also avoid autodetect ambiguity that would arise otherwise).

I think the only delta that we'd have to address is that json numbers are currently turned into zng float64 by the inferred ndjson reader, whereas (based on examples from the spec) zson parses integer literals into zng int and float literals into zng floats.

I do see one possible downside, which is that it might be surprising to end up with different (int vs float) types for the same field coming from json. But for those who really want all json numbers to turn into one type, you could maybe use generic ingest to achieve that.

henridf commented 3 years ago

Actually I suppose that zq -i zson doesn't interpret "." in field names as indicating a sub-record, like zq -i ndjson does. So there's at least that delta...

mccanne commented 3 years ago

Interesting idea!

mccanne commented 3 years ago

@henridf interestingly, I figured the zson parser would be slow as it does a parse of source input to AST, then does a semantic pass over the AST to resolve types (generating another tree data structure), then does a zcode build to produce the final zng record. Turns out zq is quite a bit faster at parsing zson than json. After thinking for a sec, I realize the go pattern of using hash tables to represent each JSON object could be the culprit here. Anyway, with a couple flags to the zson reader (i.e., do a column sort for a stable order, unflatten dotted names, treat all numbers as float64, ...), I think it would be fruitful to take this path for the reshaper work. p.s. I think the unflatten step can fit in really easily as part of the semantic pass.