Multi-line include/exclude in pipeline mode

johnhtodd commented 9 months ago

Is your feature request related to a problem? Please describe. It would be useful to support multiple lines in matching syntax, with an "OR" implied between the lines that have the same keypair string.

Describe the solution you'd like Currently, in the pipeline mode branch, there exists syntax like this:

  - name: tag-queries
    dnsmessage:
      matching:
        include:
          dnstap.operation: "CLIENT_Q*."
          dns.qname: "^.*\\.google\\.com$"
        greater-than:
          dns.length: 50
      policy: "drop-unmatched"
    transforms:
      atags:
        tags: [ "TAG-QUERIES:tag-queries" ]
    routes: [ match-queries ]

It would be very useful to have additional matching performed without jamming it all on one line, so something like this:

  - name: tag-queries
    dnsmessage:
      matching:
        include:
          dnstap.operation: "CLIENT_Q*."
          dns.qname: "^.*\\.google\\.com$"
          dns.qname: "^.*\\youtube\\.com$"
          dns.qname: "^.*\\gmail\\.com$"
        greater-than:
          dns.length: 50
      policy: "drop-unmatched"
    transforms:
      atags:
        tags: [ "TAG-QUERIES:tag-queries" ]
    routes: [ match-queries ]

Similarly with "exclude:" lines (no example shown.) This wouldn't be limited to "dns.qname" - it would be for any matched component of the packet.

Describe alternatives you've considered Making a giant unmanageable regexp on one line is... possible. But terrifying. If I have only a few matching statements, it would be great to just put a few lines in.

It also would be ideal if files were supported in matching lines, so very long lists of include/exclude filters could be ingested from an external source. So: dns.qname: file:/var/collector/names-to-include.txt ... but this seems like a separate feature request. :-)

dmachard commented 9 months ago

List of regex can be easily supported with minor update (more easy to implement)

dnsmessage:
  matching:
    include:
      dns.qtype: [ "TXT", "MX" ]
      dns.qname: 
        - "^*.apple.com$"
        - "^*.google.com$"

Here a adaptation of the configuration for file support in a generic way

dnsmessage:
  matching:
    include:
      dns.opcode: 0
      dns.length:
        greater-than: 50
      dns.qname:
        file-list: "./testsdata/filtering_keep_domains_regex.txt"
        file-kind: "domain_list"
    exclude:
      dns.qtype: [ "TXT", "MX" ]
  policy: "drop-unmatched"

This logic has been implemented in the pipeline branch.

johnhtodd commented 9 months ago

This is good - I'll look at it on Tuesday when I'm back from travel. Thank you for the quick code changes!

I'm not quite clear why the "file-kind" definition is required. Wouldn't the match depend on what context the matching file is loaded into? Why would there need to be any parsing of any kind? I can see how matching can be applied to qname, resource records, EDNS data, geoIP data, TLD data, qtype... pretty much any field.

I'm very interested in how matching can apply to tags, because tag management deeper in the processing chain (on different machines, centrally located) seems to me to be a critical part of how go-dnscollector arrays interact with each other. Otherwise, we are left using (argh!) port numbers as indicators of intent, which makes me sad.

I made my example a bit more generic to perhaps allow for expansion in the future.

Your example is this:

      dns.qname:
        file-list: "./testsdata/filtering_keep_domains_regex.txt"
        file-kind: "domain_list"

My example thinking looks more like this:

      dns.qname:
        match-source: "file:./testsdata/filtering_keep_domains_regex.txt"

because maybe the future could have something like this:

      dns.qname:
        match-source: "https://filters.example.com/testsdata/filtering_keep_domains_regex.txt"
        match-source-refresh: 86400

...and it is possible to imagine future plug-in methods like "script:" or "sftp:" or "axfr:" for developers who want to be adventurous.

dmachard commented 9 months ago

I'm not quite clear why the "file-kind" definition is required. Wouldn't the match depend on what context the matching file is loaded into? Why would there need to be any parsing of any kind? I can see how matching can be applied to qname, resource records, EDNS data, geoIP data, TLD data, qtype... pretty much any field.

It can be necessary to known the type of content

If the source list contains IPs, I need to known that to preload IP with the specific internal golang dataset type
If the source contains a list of regex, I need to known that to compile each regex before to start
if the source list contains just basic string without regex, we need to known that to avoid to use regex
etc...

otherwise your match-source with plugin approach is better :)

dmachard / go-dnscollector

Multi-line include/exclude in pipeline mode #508