brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

define and implement a zng table/map type #1317

Closed mccanne closed 4 years ago

mccanne commented 4 years ago

A table/map type would be like set but the contents are key value pairs where only keys need to be unique and the canonical order is based on the key order.

The type definition would be go-like and look like this:

map[<any>]<any>

Given the zcode model, this could be used as an associative array for any zng data type.

This wouldn't be useful for very large arbitrary key,value maps within a zng stream as in hadoop because it would need to fit comfortably in memory given the sorted-key contraint and zng.Record encoding. That said, this type could be a very useful gadget for manipulating record values in various ways at small scale.

On the other hand, large-scale variants of this datatype could be useful in the runtime referred to by procs that comes from elsewhere, e.g., to be a first-class external object available to a zql query but populated externally, e.g., to join threat-intel datq to a zng stream.

e.g., something like this:

live-search | put intel=config.intel.map[id.orig_h] | filter intel.badguy=true | alert "${$id.orig_h}: ${intel.msg}"

where config.intel.map is a map of type map[ip]record[badguy:bool,info:string] and config refers to some external configuration for how this map appears in the runtime. This data structure could also be a cache of database lookups by key, e.g., retrieving the intel data from an online service and caching each result in the runtime table a la a DNS lookup cache. Of course, the database or map could be stored locally if performance is an issue (though the lookups could be easily parallelized).

philrz commented 4 years ago

Verified in zq commit 18045ab.

The map typedef is now described in the ZNG specification.

To show it in action, I created a simple example similar to the threat intel one described in the description of the issue using a couple entries from each of two Emerging Threats feeds:

$ cat map-validate.tzng 
#0:record[feed:map[ip,record[info:string]]]
0:[[103.106.250.53;[Compromised IP;]103.106.83.97;[Compromised IP;]101.187.97.173;[Feodo Block IP;]103.106.236.83;[Feodo Block IP;]]]

If the intel data were in the ZNG stream as shown here, a hit could be confirmed along with the message detail that could be used in a yet-to-be-written downstream processor that sends alert notifications:

$ zq -t 'put intel=feed[103.106.83.97] | cut intel' map-validate.tzng
#0:record[intel:record[info:string]]
0:[[Compromised IP;]]

If the IP were not a match in the map, an error is generated, which could be filtered out such that nothing would make it to the downstream processor:

$ zq -t 'put intel=feed[1.2.3.4] | cut intel' map-validate.tzng
#0:record[intel:error]
0:[key not found in map: 1.2.3.4;]

Of course, per the other comments in the issue description above, in practice it probably would not be practical for such intel data to be part of the input stream. However, maps like these could exist as external data sources, e.g. static items that could be dragged into Brim, separate ZNG files stored in a data lake, etc., and we'd just need ways in ZQL to call out to these rather than relying exclusively on what's in the input stream. For now, however, the example above gives the essence of how such maps could be accessed.

Maps can't be output in Zeek TSV, which matches behavior of Zeek's own logger.

$ zq -f zeek map-validate.tzng 
type map[ip,record[info:string]]: type cannot be represented in zeek format

Maps can be output as NDJSON, though their "lookup" nature cannot be represented by this alone.

$ zq -f ndjson map-validate.tzng | jq .
{
  "feed": [
    "101.187.97.173",
    {
      "info": "Feodo Block IP"
    },
    "103.106.83.97",
    {
      "info": "Compromised IP"
    },
    "103.106.236.83",
    {
      "info": "Feodo Block IP"
    },
    "103.106.250.53",
    {
      "info": "Compromised IP"
    }
  ]
}

Whereas for zjson, all the detail is present to restore the original ZNG representation. This would allow us to one day, for example, create a Python client that restores this into Python's dictionary data structure.

$ zq -f zjson map-validate.tzng | jq .
{
  "id": 25,
  "schema": {
    "of": [
      {
        "name": "feed",
        "of": [
          "ip",
          {
            "of": [
              {
                "name": "info",
                "type": "string"
              }
            ],
            "type": "record"
          }
        ],
        "type": "map"
      }
    ],
    "type": "record"
  },
  "values": [
    [
      "101.187.97.173",
      [
        "Feodo Block IP"
      ],
      "103.106.83.97",
      [
        "Compromised IP"
      ],
      "103.106.236.83",
      [
        "Feodo Block IP"
      ],
      "103.106.250.53",
      [
        "Compromised IP"
      ]
    ]
  ]
}

While discussing this IP-address-based threat intel example, I noticed that a lot of the threat intel data at Emerging Threats is subnet-based. If the keys in this map had consisted of subnets, but I wanted to see if the IP address for a single flow was a match against the feed, there'd need to be CIDR match logic available. #1504 is open to track this.