brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Read non-adjacent record fields in Zeek TSV #1217

Closed philrz closed 4 years ago

philrz commented 4 years ago

A community user was attempting to read a Zeek TSV log that had been generated by a Corelight Sensor. Sample contents:

#separator \x09
#set_separator  ,
#empty_field    (empty)
#unset_field    -
#path   conn
#open   2020-08-24-00-25-01
#fields ts  uid id.orig_h   id.orig_p   id.resp_h   id.resp_p   proto   service duration    orig_bytes  resp_bytes  conn_state  local_orig  local_resp  missed_bytes    history orig_pkts   orig_ip_bytes   resp_pkts   resp_ip_bytes   tunnel_parents  orig_cc resp_cc id.orig_h_name.src  id.orig_h_name.vals id.resp_h_name.src  id.resp_h_name.vals
#types  time    string  addr    port    addr    port    enum    string  interval    count   count   string  bool    bool    count   string  count   count   count   count   set[string] string  string  string  set[string] string  set[string]
1598243094.015046   CWjxkd3jpmxuvN21uj  10.124.2.117    61927   10.70.70.70 8080    tcp -   0.002716    0   77  SF  F   F   0   FdfR    3   120 2   157 -   -   -   -   -   SSL_SNI oneclient.sfx.ms,bats.video.yahoo.com,ctldl.windowsupdate.com,tapestry.tapad.com,www.gstatic.com,www.google.com:443,c.clicktale.net,eb2.3lift.com:443,13-237-209-96.expertcity.com:443,pr-bh.ybp.yahoo.com:443,clientservices.googleapis.com:443,js-sec.indexww.com:443,collect.tealiumiq.com,www.pacast.com,oneclient.sfx.ms:443,clientservices.googleapis.com,bats.video.yahoo.com:443,www.youtube.com

As of zq commit 16d510a, this triggers the following error:

$ zq -t -i zeek sample.tsv 
sample.tsv: line 9: duplicate field id

The root cause involves zq's internal reconstruction of the Zeek record data type into the hierarchical format as it existed when the data originated inside Zeek. One of the assumptions built into the current functionality is that all sub-fields of a record are expected to be adjacent, such as is traditionally seen with id.orig_h, d.orig_p, id.resp_h, and id.resp_p. The zq reader is therefore not expecting to see the fields starting rightward from id.orig_h_name.src separated by columns representing other fields.

In an internal discussion, the Brim development team established it would be feasible to enhance the reader to be accepting of non-adjacent record fields. However, there's not a way to easily accomplish this while still preserving the column order with "split" record fields in the original data. In other words, if this Zeek TSV were read via the proposed enhanced zq, turned internally into ZNG, and then written it back out again as Zeek TSV, the initial/final Zeek TSV representations will differ. The likely implementation will have the "separated" record fields made adjacent to the record fields seen "leftmost" in the column order. Therefore the above sample when written back out from zq as Zeek TSV may look like:

#separator \x09
#set_separator  ,
#empty_field    (empty)
#unset_field    -
#path   conn
#open   2020-08-24-00-25-01
#fields ts  uid id.orig_h   id.orig_p   id.resp_h   id.resp_p   id.orig_h_name.src  id.orig_h_name.vals id.resp_h_name.src  id.resp_h_name.vals proto   service duration    orig_bytes  resp_bytes  conn_state  local_orig  local_resp  missed_bytes    history orig_pkts   orig_ip_bytes   resp_pkts   resp_ip_bytes   tunnel_parents  orig_cc resp_cc
#types  time    string  addr    port    addr    port    string  set[string] string  set[string] enum    string  interval    count   count   string  bool    bool    count   string  count   count   count   count   set[string] string  string
1598243094.015046   CWjxkd3jpmxuvN21uj  10.124.2.117    61927   10.70.70.70 8080    -   -   SSL_SNI oneclient.sfx.ms,bats.video.yahoo.com,ctldl.windowsupdate.com,tapestry.tapad.com,www.gstatic.com,www.google.com:443,c.clicktale.net,eb2.3lift.com:443,13-237-209-96.expertcity.com:443,pr-bh.ybp.yahoo.com:443,clientservices.googleapis.com:443,js-sec.indexww.com:443,collect.tealiumiq.com,www.pacast.com,oneclient.sfx.ms:443,clientservices.googleapis.com,bats.video.yahoo.com:443,www.youtube.com    tcp -   0.002716    0   77  SF  F   F   0   FdfR    3   120 2   157 -   -   -

Incidentally, the way I was able to create that output is via:

# cat sample.tsv | /usr/local/zeek/bin/zeek-cut -C ts uid id.orig_h id.orig_p id.resp_h id.resp_p id.orig_h_name.src id.orig_h_name.vals id.resp_h_name.src id.resp_h_name.vals proto service duration orig_bytes resp_bytes conn_state local_orig local_resp missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents orig_cc resp_cc > sample-fixed.tsv

And that representation is happily accepted by zq, with the re-assembly of the hierarchical record enabling shorthand like:

$ zq -t 'cut id' sample-fixed.tsv 
#0:record[id:record[orig_h:ip,orig_p:port,resp_h:ip,resp_p:port,orig_h_name:record[src:bstring,vals:set[bstring]],resp_h_name:record[src:bstring,vals:set[bstring]]]]
0:[[10.124.2.117;61927;10.70.70.70;8080;[-;-;][SSL_SNI;[www.pacast.com;c.clicktale.net;www.gstatic.com;www.youtube.com;oneclient.sfx.ms;eb2.3lift.com:443;tapestry.tapad.com;www.google.com:443;bats.video.yahoo.com;oneclient.sfx.ms:443;collect.tealiumiq.com;js-sec.indexww.com:443;ctldl.windowsupdate.com;pr-bh.ybp.yahoo.com:443;bats.video.yahoo.com:443;clientservices.googleapis.com;13-237-209-96.expertcity.com:443;clientservices.googleapis.com:443;]]]]
philrz commented 4 years ago

Verified in zq commit b2c8356.

Using the same example that failed before, we see the sub-fields that make up the id record now all appear together anchored alongside the leftmost appearance of an id.* field.

$ zq -t -i zeek sample.tsv
#zenum=string
#0:record[_path:string,ts:time,uid:bstring,id:record[orig_h:ip,orig_p:port,resp_h:ip,resp_p:port,orig_h_name:record[src:bstring,vals:set[bstring]],resp_h_name:record[src:bstring,vals:set[bstring]]],proto:zenum,service:bstring,duration:duration,orig_bytes:uint64,resp_bytes:uint64,conn_state:bstring,local_orig:bool,local_resp:bool,missed_bytes:uint64,history:bstring,orig_pkts:uint64,orig_ip_bytes:uint64,resp_pkts:uint64,resp_ip_bytes:uint64,tunnel_parents:set[bstring],orig_cc:bstring,resp_cc:bstring]
0:[conn;1598243094.015046;CWjxkd3jpmxuvN21uj;[10.124.2.117;61927;10.70.70.70;8080;[-;-;][SSL_SNI;[www.pacast.com;c.clicktale.net;www.gstatic.com;www.youtube.com;oneclient.sfx.ms;eb2.3lift.com:443;tapestry.tapad.com;www.google.com:443;bats.video.yahoo.com;oneclient.sfx.ms:443;collect.tealiumiq.com;js-sec.indexww.com:443;ctldl.windowsupdate.com;pr-bh.ybp.yahoo.com:443;bats.video.yahoo.com:443;clientservices.googleapis.com;13-237-209-96.expertcity.com:443;clientservices.googleapis.com:443;]]]tcp;-;0.002716;0;77;SF;F;F;0;FdfR;3;120;2;157;-;-;-;]

This also proves handy (and readable) in NDJSON.

$ zq -f ndjson -i zeek "cut id" sample.tsv | jq .
{
  "id": {
    "orig_h": "10.124.2.117",
    "orig_h_name": {
      "src": null,
      "vals": null
    },
    "orig_p": 61927,
    "resp_h": "10.70.70.70",
    "resp_h_name": {
      "src": "SSL_SNI",
      "vals": [
        "www.pacast.com",
        "c.clicktale.net",
        "www.gstatic.com",
        "www.youtube.com",
        "oneclient.sfx.ms",
        "eb2.3lift.com:443",
        "tapestry.tapad.com",
        "www.google.com:443",
        "bats.video.yahoo.com",
        "oneclient.sfx.ms:443",
        "collect.tealiumiq.com",
        "js-sec.indexww.com:443",
        "ctldl.windowsupdate.com",
        "pr-bh.ybp.yahoo.com:443",
        "bats.video.yahoo.com:443",
        "clientservices.googleapis.com",
        "13-237-209-96.expertcity.com:443",
        "clientservices.googleapis.com:443"
      ]
    },
    "resp_p": 8080
  }
}

Thanks @nwt!