brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

tidy up zjson #1233

Closed mccanne closed 4 years ago

mccanne commented 4 years ago

There has been recent discussions about changing zjson to make it more ergonomic.

The zng data format has richly typed records as well as a deterministic column order, whereas json objects do not have a deterministic column order or typing beyond strings, float64's, object, arrays, and null. Thus, if you want to transmit zng data in json and preserve its type structure, you need to encode all this in a layer above json.

A different question is should we have the option to not impose the complexity of zng on a client? Just throw it out. This is trivial to implement: currently the search endpoint returns zjson for both "zjson" and "ndjson" requests. We can change the ndjson output to use the ndjson writer instead of the zjson writer. Then a user could get vanilla ndjson without the richness of zng. So, we can basically already support this with a trivial change.

Regarding the zjson design, the current approach roughly follows that of the underlying zng model of embedding type information in the stream: type definitions declare arbitrarily complex and nested data types, and values are sent referencing the type information recursively with small-integer type identifiers. In this approach, a sequence of ndjson objects represents a typed value and if the type definition hasn't been previously seen, it is included in the json object the first time it is encountered.

For example, consider this zng:

#0:record[a:string,b:record[x:int32,y:ip]]
0:[hello;[2;127.0.0.1;]]
0:[world;[4;192.168.1.1;]]
0:[goodnight;[4;192.168.1.2;]]
0:[gracie;[4;192.168.1.3;]]

This encodes in ndjson (with zq, but not with zqd presently):

{"a":"hello","b":{"x":2,"y":"127.0.0.1"}}
{"a":"world","b":{"x":4,"y":"192.168.1.1"}}
{"a":"goodnight","b":{"x":4,"y":"192.168.1.2"}}
{"a": "gracie","b":{"x":4,"y":"192.168.1.3"}}

i.e., you no longer know that x was an integer (it is now a javascript number) or that y is an IP address. Certain apps or users may want the simplicity of this so let's just support it.

What we're trying to do, though, as a strategic direction for the company is to differentiate from the json world (and log analytics systems) and make our app's ux aware of rich data types and structure. i.e., zeek experts get to see the zeek data presented in the column order they are used to and see familiar structures like IP address, network ports, sets of IP addresses, vectors of values, and so forth (and I would argue these attributes of structured data presentation for semi-structured data will have impact far beyond zeek). So, zjson to the rescue:

{
  "id": 24,
  "type": [
    { "name": "a", "type": "string" },
    { "name": "b", "type": [ {  "name": "x",  "type": "int32" },  { "name": "y",  "type": "ip" }  ]  }
  ],
  "values": [ "hello",  [  "2",  "127.0.0.1"  ]  ]
}
{  "id": 24,  "values": [  "world",  [  "4", "192.168.1.2" ]  ] }
{  "id": 24,  "values": [  "goodnight",  [  "4", "192.168.1.3" ]  ] }
{  "id": 24,  "values": [  "gracie",  [  "4", "192.168.1.4" ]  ] }
}

This is admittedly more complex than json but this is because json and javascipt don't support rich typings so we have to bite the bullet one way or another (unless we want abandon our aspirations and retreat to simple javascript types in the our app).

One proposal is that zjson should change so that the types are zipped into the values so each line of ndjson is fully self-describing. This could look something like this:

 [  { "name": "a",  "type": "string", "value":  "hello"}, 
      {name: "b", "type":"record",  value: [ {"name": "x", "type": "int32", "value": "2" }, 
                                                                    { "name": "y", "type": "ip", "value": "127.0.0.1" ]  ] 
 [  { "name": "a",  "type": "string", "value":  "world"}, 
      {name: "b", "type":"record",  value: [ {"name": "x", "type": "int32", "value": "4" }, 
                                                                    { "name": "y", "type": "ip", "value": "192.168.1.1" ]  ] 
 [  { "name": "a",  "type": "string", "value":  "goodnight"}, 
      {name: "b", "type":"record",  value: [ {"name": "x", "type": "int32", "value": "4" }, 
                                                                    { "name": "y", "type": "ip", "value": "1192.168.1.2" ]  ] 
 [  { "name": "a",  "type": "string", "value":  "graciie"}, 
      {name: "b", "type":"record",  value: [ {"name": "x", "type": "int32", "value": "4" }, 
                                                                    { "name": "y", "type": "ip", "value": "1192.168.1.3" ]  ] 

Perhaps there is a better approach than this that conveys the data as "tidy data"? And for existing tooling/apps that expect richly structured tidy data in json format? I'm not sure how to head in this direction.

Another observation is that a trivial javascript function in the client could zip zjson into the latter form. Whether the backend supports the zipped format or the client does the zipping is pretty trivial in terms of implementation.

So whether the client zips or the backend zips, the data is informationally equivalent and we're talking about ergonomics and ease of use of the API. Perhaps we do both on the backend? They are so close together and pretty trivial to do both.

Another point is that if we are strategically attempting to make our client understand richly structured data (at the intersecton of semi-structured json and schema-rigid tables) then I would think it would be important that the client have abstractions that understood the record type as a first-class entity and the grouping of records by type, etc. As the client gets more complicated and more aware of the richly structured data types, it seems like this would be helpful. And if so, and if we switched to the zipped format only, then there would need to logic to detect the type signature of each zipped record so that they could be type collated in this fashion.

Whether or not we change this approach, I noticed there is one glaring problem in the current zjson design/implementation: type definitions for zng records are nicely decomposed into recursive definitions (as above) but type definitions for zng arrays, zng sets, and zng unions must be parsed from the zng type string. This was just a poor implementation decision and reflects our lack of clarity a year ago when this code was written. Also, it appears that type aliases are simply dropped in zjson (aliases are important as people add type extensions with semantics defined outside the scope of zng and want client support for such things, as in the "logical types" of parquet and avro).

So the problem is with this:

#0:record[a:string,b:array[record[x:int32,y:int32]]]
0:[hello;[[2;3;]]]
0:[world;[[4;5;][6;7;]]]

which is currently encoded in zjson as follows:

{
  "id": 25,
  "type": [
    {
      "name": "a",
      "type": "string"
    },
    {
      "name": "b",
      "type": "array[record[x:int32,y:int32]]"
    }
  ],
  "values": [ "hello", [  [ "2", "3" ]  ]  ]
}
{
  "id": 25,
  "values": [  "world", [ [ "4", "5" ],  [ "6", "7" ] ] ]
}

Here, the type string array[record[x:int32,y:int32]] would have to be parsed and decoded by the client. Yuck! We got this design right for the record type above but not for array, set, and union. A fix could be to represent these container types with JSON object instead of string as follows:

{
  "id": 25,
  "type": { "record", "fields": [
    {
      "name": "a",
      "type": "string"
    },
    {
      "name": "b",
      "type": { "type:": "array", "of": { "type": "record", "fields": [ { "name":"x", "type":"int32" }, { "name": "y", "type":"int32" } ] } }
    }
   }
 ...

The question also came up as whether we should have a single request-response API endpoint without streaming ndjson, though this seems like an orthogonal point. The advantages of streaming are:

That said, for ergonomics again, it could make sense to have a simple request/response search endpoint (in addition to the streaming endpoint) that wraps all the results, warnings, stats etc into a single json response making it easy to do simple queries with existing tooling. This would be in addition to retaining the streaming search endpoint. This request/response endpoint could orthogonally offer zipped or regular json encoding as discussed above. The same behavior could be implemented on the client-side behind a client API (e.g., inside of zealot or inside of a zq python module).

mccanne commented 4 years ago

@jameskerr if zealout had a zng.record class then it wouldn't matter if the data came zipped up or not, and you could just query methods on the record to get at fields, types, do introspection, etc. You could also ask zealot to give zng.records oor vanilla javascript objects following the ndjson format if the client didn't care about the richness of zng. (Of course the brim client cares about the rich data types but maybe other users of the zealot library wouldn't.)

mccanne commented 4 years ago

I haven't heard any comments or follow-up on this. My proposal is to create new tasks for the following:

jameskerr commented 4 years ago

@mccanne sorry for the radio silence. Thanks for revisiting the zjson format again. I think your proposals all make sense. I forgot about the array/set/union encoding issues, so I'm glad those were discovered again and issues made.

I think the current zjson model is great for all the reasons you've mentioned above. These new ergonomic changes will make the data even more portable.

As a user, if I want rich types, streaming responses, and no-bloat payloads, I download a library (zealot, python, zapi) and I'm on my way. If I want a quick and dirty fetch to do an ad hoc chart on https://observablehq.com/, codepen, or curl | jq, then I am able to do that too. I dig it.

mccanne commented 4 years ago

@jameskerr Cool! Thanks for closing the loop here.

philrz commented 4 years ago

I've confirmed with @mccanne that this issue has served its purpose because it led to the creation of #1273 and the issues below it. Closing this one.