benhoyt / goawk

A POSIX-compliant AWK interpreter written in Go, with CSV support
https://benhoyt.com/writings/goawk/
MIT License
1.95k stars 84 forks source link

Add JSON Lines support #152

Open dloss opened 2 years ago

dloss commented 2 years ago

The JSON Lines text format (aka JSONL or newline-delimited JSON) has one JSON object per line. It's often used for structured log files or as a well-specified alternative to CSV.

Here are some ideas how the JSON Lines format could be supported in GoAWK. To be honest I'm not completely sure if this is a good idea, but I've found it interesting to think about. This write-up captures some of my thoughts.

I can imagine different levels of sophistication. We could start simple and then in later versions support more complex input data and ways to interact with it.

One JSON array of scalars per line

["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, null]
["Deloise", "2012A", 19, true] 

Suggestions:

Questions:

One JSON object per line, with pairs of keys and scalar values

This is used by the Graylog Extended Log Format (GELF).

{"version":"1.1", "host":"example.org", "short_message": "A log message", "facility":"test", "_foo":"bar"}
{"version":"1.1", "host":"test.example.org", "short_message": "Another msg", "facility":"test", "_foo":"baz"}

Users wanting to parse Logfmt messages (like myself, see #149) should be able to convert their data into this format quite easily.

Suggestions:

Nested data

{"one": 1, "four": [1,2,3,4], "five": {"alpha": ["fo", "fum"], "beta": {"hey": "How's tricks?"}}}
{"one": 1, "four": [4], "five": {"alpha": ["fa", "fim"], "beta": {"hi": "How's tracks?"}}}

Suggestions:

Questions:

benhoyt commented 2 years ago

Thanks! I do intend to do a deep-dive into this, but just a few initial thoughts.

I hadn't considered your first example of "JSON array per line", just because "JSON object per line" is much more common. But that's perfectly valid and reasonable as a strongly typed CSV (well, really more like "slightly typed CSV"). I think JSON true should map to AWK 1 and false to AWK 0. As for JSON null, probably AWK null (what variables are initialized to, but basically acts as "" and 0 depending on context).

What would it do with non-scalar values? In other words, if an array or object was nested inside? Error? Just yield the JSON string? Ignore it? Replace with some placeholder like ""? I suppose for v1 we could say non-scalar values are undefined, and yield "" for now, with the possibility of extending it later.

I don't think Unicode causes problems. Everything's just UTF-8 in GoAWK.

And then "JSON object per line" maps very well to the GoAWK-specific @"field" syntax, as you say. Again, there's the problem of nested, non-scalar items. The @"foo.bar" or @"foo.bar[5]" type of syntax is tempting, but it would change the "row storage model" quite a bit -- not sure if that's an issue. Yes, jsonpath and jmespath seem significantly more complicated than we'd want here; just plan JavaScript .key and [index] notation would be enough. Though again, for v1 we could say non-scalar values are undefined, with the possibility of extending it later.

What would $1 and $2 mean in "JSON object per line" mode? (In fact, would it be a different mode than "JSON array per line", or would that be automatic?) With Go's JSON decoder to a map[string]any, it doesn't record key order. Go's encoding/json doesn't export a scanner, so we might have to build our own if we wanted key order. Then again, maybe for v1 we just disallow $n.

Not sure we need to escape double quotes in returned JSON strings if we end up doing that. Just yield the JSON-encoded string. Escaping is only an issue for string literals.

Yeah, the AWK $1 vs JavaScript [0] thing is interesting. I think for the @"foo[0]` notation it should be 0-based, given that it will be a subset of JavaScript notation and that's 0-based. A bit confusing either way.

Thanks for your thoughts on this. More another time!

gedw99 commented 2 years ago

Hey

https://github.com/tomnomnom/gron Is related in that it is a golang package to make json able to work with grep.

It looks like a potential base for hawk to support json ?

In the example fgrep is used. There is a basic golang implementation of grep here: https://github.com/u-root/u-root/blob/v0.10.0/cmds/core/grep/grep.go

fprep is as I understand it depreciated anyway

janxkoci commented 4 months ago

Hey, just an FYI that miller (written in Go) also supports JSONL (and JSON), maybe you can check the code there. The author notes that JSON parsing is generally more slow than the other supported formats.