brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.38k stars 67 forks source link

Update shaper for Zeek v6.2.0 #5106

Closed philrz closed 5 months ago

philrz commented 5 months ago

tl;dr

This is a significant update to the Zeek integration docs, in particular bringing the shaper script current with Zeek v6.2.0. You can see the rendered version of them here.

Details

The changes in this PR are the long overdue "material changes" foretold in #4694. In addition to bringing the type definitions up to date with current GA Zeek release v6.2.0, I've made the shaper more compact and also started using Zed's error type to surface problems encountered during shaping. I think this all provides a solid working example of many Zed concepts coming together to solve a challenging problem. Now that build-zeek has me in the habit of keeping up with GA Zeek releases, I'm hopeful that I'll be able to keep this shaping doc current as the log types gradually change and avoid having to do grand overhauls like this going forward.

In addition to the changes to the Zeek shaper itself and how it's described, I've made some general improvements to the Zeek integration docs to fix typos, add links, and bring them more current with the evolving state of the other Zed docs. In some cases this actually involved removing some text since we've got better coverage of topics in their proper homes, e.g., we now have the detailed Shaping and Type Fusion doc whereas in the past the Zeek shaper doc effectively served as the most comprehensive doc about shaping. There's still a little redundancy in the Zeek shaper doc because I figured it was helpful to present the concepts in context like a "user guide" rather than sending the reader on a scavenger hunt through reference materials, though of course I still link off to all the relevant functions/operators/etc.

If reviewers would like to see the rendered docs rather than trying to pick through the diffs here, I've pushed a built copy of the docs site based on this branch to a personal staging site at https://6616d57a0260af2ee74d1a3e--spiffy-gnome-8f2834.netlify.app/docs/next/integrations/zeek.

How it was done

Here's some notes-to-self on how I came up with the changes here, as I expect they may come in handy the next time I do this.

While its output is no longer directly usable in Zed tooling, the print-types.zeek script from our (now archived) Zeek repo remains useful for assessing the default fields/types output by a particular Zeek release. To gather these for the old/new endpoints of this exercise I ran this on Zeek v4.1.1:

$ ZEEK_ALLOW_INIT_ERRORS=1 zeek print-types.zeek local | tail +2 | jq -S | python3 -m json.tool > types-4.1.1.json

And on Zeek v6.2.0:

$ ZEEK_ALLOW_INIT_ERRORS=1 zeek print-types.zeek local | jq -S | python3 -m json.tool > types-6.2.0.json

Then check for differences:

$ diff -y types-4.1.1.json types-6.2.0.json > types-diff.txt

If an entirely new log type is spotted in the diff (e.g., ldap in this case) or an existing log type is overhauled significantly, the lines that define the descriptor array for that log type were copied from the types-6.2.0.json to a separate file, then run through this pipeline in a script cleanup-type.sh:

#!/bin/sh
cat $1 |
  jq -c . |
  sed 's/"bstring"/"string"/g' |
  sed 's/set\[bstring\]/|[string]|/g' |
  sed 's/array\[bstring\]/[string]/g' |
  sed 's/array\[uint64\]/[uint64]/g' |
  sed 's/array\[float64\]/[float64]/g' |
  sed 's/{"name":"//g' |
  sed 's/","type":"/:/g' |
  sed 's/"},/,/g' |
  sed 's/id","type":\[orig_h:ip,orig_p:port,resp_h:ip,resp_p:port"}\]}/id:conn_id/g' |
  sed 's/^\[/{/' | sed 's/"}\]$/}/'

For example:

$ cat ldap.json 
        [
            {
                "name": "_path",
                "type": "string"
            },
            {
                "name": "ts",
                "type": "time"
            },
            {
                "name": "uid",
                "type": "bstring"
            },
            {
                "name": "id",
                "type": [
                    {
                        "name": "orig_h",
                        "type": "ip"
                    },
                    {
                        "name": "orig_p",
                        "type": "port"
                    },
                    {
                        "name": "resp_h",
                        "type": "ip"
                    },
                    {
                        "name": "resp_p",
                        "type": "port"
                    }
                ]
            },
            {
                "name": "message_id",
                "type": "int64"
            },
            {
                "name": "version",
                "type": "int64"
            },
            {
                "name": "opcode",
                "type": "bstring"
            },
            {
                "name": "result",
                "type": "bstring"
            },
            {
                "name": "diagnostic_message",
                "type": "bstring"
            },
            {
                "name": "object",
                "type": "bstring"
            },
            {
                "name": "argument",
                "type": "bstring"
            },
            {
                "name": "_write_ts",
                "type": "time"
            }
        ]

$ ./cleanup-type.sh ldap.json 
{_path:string,ts:time,uid:string,id:conn_id,message_id:int64,version:int64,opcode:string,result:string,diagnostic_message:string,object:string,argument:string,_write_ts:time}

Because of the known limitation of print-types.zeek described in https://github.com/brimdata/zeek/issues/15, I also manually eyeballed the type definition for Zeek's openflow and confirmed nothing has changed since the last shaper update.

philrz commented 5 months ago

Thanks @nwt! I'll accept the approval despite the light coverage on the shaper part. @mattnibs has been looking that over with an eye toward possible language improvements that might make it more self-documenting. But the old shaper was so out-of-date and I'm pretty confident the new one offers better functionality and error handling, so I'm keen to get this merged so I can point users at it sooner and we can keep making improvements over time. I also know I'm on the hook to probably be the exclusive supporter of this stuff since you guys rightfully have other things on your minds besides Zeek stuff. 😉