brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Return as many named fields as possible from records that may not contain some of them #852

Closed philrz closed 4 years ago

philrz commented 4 years ago

New processor functionality is needed to allow the Brim "Export to ZNG" option to work as the user would expect when the Columns chooser has been invoked. The need for this was first identified when verifying brim/814.

Consider the following query result in Brim v0.10.0 after having imported this 2018 wrccdc pcap.

image.png

Starting from this diverse mix of http and dns events, let's say I click the Columns chooser and check off a mix of fields that appear in both event types (ts, uid) and fields that appear in only one event type (uri from the http events, and query from the dns events).

image.png

If I click Export, the results.zng that's produced contains zero events. This is a side-effect of how the app currently implements the Export function: The cut processor is used to ask zqd to trim the query response to contain only the named columns. However, the way cut is currently implemented, a record is only returned if all the named fields are present. This can be seen by dropping to the command line and executing equivalent zq by hand on the all.zng that was generated by the import.

$ zq -t "cut ts,uid,uri,query" all.zng 
Cut fields ts,uid,uri,query not present together in input

So the high-level requirement is that zqd needs to provide some variation on the current cut behavior that returns ZNG that could be re-imported to produce the same results the user was looking at when they clicked Export. It seems like this could be done via a new option on cut (which makes sense to me, but I'm open to suggestion) or a whole new processor. I'm not advocating for a change in the default behavior of cut (though I'd not necessarily rule it out).

While writing this up, I'm reminded that FWIW, zeek-cut is more forgiving. Turning my query result into Zeek TSV and doing the equivalent cut operation, I get all the diversity:

$ zq -f zeek "ocsp.pki.goog" all.zng > all.zeek
$ cat all.zeek | /usr/local/zeek-3.1.3/bin/zeek-cut -C ts uid uri query
#separator \x09
#set_separator  ,
#empty_field    (empty)
#unset_field    -
#path   http
#fields ts  uid uri 
#types  time    string  string  
1521835139.718680   CUkXuv2aZs5NpuHng1  /GTSGIAG3   
#path   dns
#fields ts  uid     query
#types  time    string      string
1521835139.673223   CkiLMN1OGJBFlZdZZc      ocsp.pki.goog
1521835139.673219   CkiLMN1OGJBFlZdZZc      ocsp.pki.goog
#path   http
#fields ts  uid uri 
#types  time    string  string  
1521835105.567982   CMt4qp2hgQEMhqmmj7  /GTSGIAG3   
1521835105.177261   C5YqLp10QwzPgYH5Aa  /GTSGIAG3   
1521835104.054822   CCZseg1hl12M0nc7p6  /GTSGIAG3   
1521835103.652957   C5YqLp10QwzPgYH5Aa  /GTSGIAG3   
1521835103.607992   CCZseg1hl12M0nc7p6  /GTSGIAG3   
#path   dns
#fields ts  uid     query
#types  time    string      string
1521835103.596102   CVLaYjq85ar0vTxPa       ocsp.pki.goog
1521835103.596097   CVLaYjq85ar0vTxPa       ocsp.pki.goog

This speaks to another implementation detail I've pre-discussed with @henridf: How would would the ZNG handle the fields not present? For instance, would the descriptor for the dns event contain a uri field that gets populated with a null value, and likewise would the descriptor for the http event contain a query field that's populated with null? If the implementation can handle it, I would argue the answer to this is "no". Like we just saw with zeek-cut it would seem ideal if this new cutting functionality could effectively trim the descriptor of each record to cover only the set of named fields that it contains.

philrz commented 4 years ago

Incidentally, while writing this up I noticed that zq can't actually read back the output from zeek-cut that's shown above. I think this could be seen as a zeek-cut bug, so I've opened https://github.com/zeek/zeek/issues/997 to see what the Zeek devs/community make of it. If they make an argument defending the current behavior and don't intend to change it, I'll open a separate zq issue to see if we can adapt to it.

henridf commented 4 years ago

In an offline discussion, decision was made to modify cut to output any record with a (non-zero) subset of the cut fields. This should address the issue noted above, and was considered to be preferable behavior for regular zql use as well.

philrz commented 4 years ago

Verified in Brim commit e7b06f3 talking to zqd commit 4e6680f. Per the attached video, I can now select the same diverse set of columns, Export to ZNG, re-import, and see data where there'd been nothing before.

Verify.zip

A couple things that came up for me during verification:

  1. I couldn't bring back the column headers after re-importing the ZNG. I've opened https://github.com/brimsec/brim/issues/873 to pursue that topic separately.

  2. At the command line, I noticed that the new cut still gives a warning if the requested field(s) return no data at all, but no warning if it's a mix of present/absent data.

$ zq -t "cut foo" weird.log.gz
Cut field foo not present in input

$ zq -t "cut foo,bar" weird.log.gz
Cut fields foo,bar not present together in input

$ zq -t "cut foo,name" weird.log.gz
#0:record[name:bstring]
0:[TCP_ack_underflow_or_misorder;]
0:[truncated_header;]
...

I sense a warning may still be justified in this final case, e.g. to help call the user's attention to a possible typo. I've got a question pending to @nwt on this topic, and other opinions are welcomed.