brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Add facility to run Z scripts at ingest #1833

Closed henridf closed 3 years ago

henridf commented 3 years ago

Add a facility to run Z scripts on ingested data, before it lands. This should also be applicable to the pcap ingest path. (suricata pcap is covered by #1871 )

philrz commented 3 years ago

While expectations have been set in #1870 that the functionality is limited, I've verified what we've got thus far in zqd commit 4dff9e9.

I took the approach of trying to "shape" an EVE JSON log from a standalone/separate Suricata v5.0.3. While the following Z is not a 100% match of the functionality currently being achieved through the legacy JSON typing system that's invoked during Brim pcap import, it gets fairly close:

$ cat shape-suricata.zs 
event_type="alert"
| put src_ip=src_ip:ip, dest_ip=dest_ip:ip
| put ts=iso(replace(timestamp, "-0700", "-07:00"))
| cut ts,event_type,src_ip,src_port,dest_ip,dest_port,vlan,proto,app_proto,alert,flow_id,pcap_cnt,tx_id,icmp_code,icmp_type,community_id
| fuse

I found a couple unrelated bugs while working on this:

  1. Because of #1907, I manually removed a single appearance of "ja3s":{} from my eve.json before attempting to import it. I was not able to find another way to work around the issue.
  2. The replace() call above works around #1905, which is nice validation of some of the flexibility that's possible with the new "shaper" approach.

I've put the modified EVE JSON file up at https://storage.googleapis.com/brimsec/issues/zq-1833/eve.json.gz in case it might be useful in future testing.

To import:

$ zapi -s eve postpath -z "$(cat shape-suricata.zs)" eve.json.gz 
100.0% 31.53MB/31.53MB
posted 31.53MB in 31.97698114s

Once imported, I was able to click View > Reload in my Brim app that had spawned this zqd and see the Space and its contents. We can see some of the benefits of the shaping in the form of the populated time picker.

image

The same is true of other functionality that relies on correct data typing, such as doing a CIDR match on ip types.

image

One thing that briefly threw me for a loop was the -z option. I'd previously been exposed to the zq -z option where the parameter is expected to be a filename. While it's true that zapi help postpath speaks of a "Z shaper script", I misinterpreted it and assumed it was a-filename-of-a-Z-script, hence I was scratching my head for a bit when this acted like a successful import (no error messages) but produced a zero-length all.zng:

$ zapi -s eve postpath -z shape-suricata.zs eve.json.gz 
100.0% 31.53MB/31.53MB
posted 31.53MB in 32.078547798s

Now with the benefit of 20/20 hindsight, I know that it was interpreting shape-suricata.zs as a syntactically-correct bare search for the string literal "shape-suricata.zs", which matched nothing in the imported JSON and hence produced no output. This mismatch in -z functionality between the tools feels like it's maybe an accident waiting to happen. Since I expect shaper configs are likely to be multi-line Z much of the time, I'm inclined to think the -z for zapi should also point to a filename. But I also know there's other ways to address this from a CLI design perspective, e.g. how curl -d supports inline POST data by default and then references a filename if it's preceded by @. So I'll leave this open as a possible design discussion topic for the team.

All that said, so far, so good. Thanks @henridf!