Option to disable tail'ed processing of an analyzer's logs

philrz commented 3 years ago

Repro is with Brimcap commit 1fa5fc4 and https://archive.wrccdc.org/pcaps/2018/wrccdc.2018-03-23.010014000000000.pcap.gz (uncompressed) as my test data.

In my verifications steps #16 (comment), I first used this unsuccessful approach to try to work around https://redmine.openinfosecfoundation.org/issues/4106, mistakenly thinking that all I needed to do was leave behind only valid logs to be subject to Zed processing.

$ cat /tmp/mysuricata 
#!/bin/bash
suricata -r /dev/stdin
cat eve.json | jq -c . > deduped-eve.json
shopt -s extglob
rm !("deduped-eve.json")

$ brimcap analyze -Z -config variant.yml ~/pcap/wrccdc.pcap > wrccdc.zson
{"type":"error","error":"duplicate field subject"}

@mattnibs explained to me what went wrong here. The "ztail" functionality in Brimcap starts performing Zed processing on the logs generated by the analyzer processes even before those processes are finished, since this allows users to potentially perform early querying on partial output. Because of this, Brimcap ended up choking on the partially-built eve.json (which contains the duplicate field names) before my wrapper script had a chance to delete it.

This led me to learn about and start using the globs parameter in the Brimcap config YAML such that the ztail would only tail the deduped-eve.json file, so I was all set. However, having gone through the experience, I now recognized it would still be convenient to have a way to disable this ztail behavior entirely when processing an analyzer's generated logs, for two reasons I can think of:

Whereas the post-processing I was doing here with jq lent itself to output you could still "tail", some kinds of post-processing may not be (e.g. they might rely on making an entire pass through a generated log after the complete output is present)
Some users may know they don't want to query partial results and therefore don't want to burn CPU cycles on the incremental Zed processing and instead just wait until all logs are finished being output

mattnibs commented 3 years ago

@philrz I'm not quite clear on what this ticket calls for. Is this a mode for an analyzer process that would wait to start reading records until the process has successfully exited?

philrz commented 3 years ago

@mattnibs: Yes, that was the essence.

Reading the text again now, my filing it at the time was in some ways a reflection of Brimcap's new-ness and me not yet being completely familiar with its bells & whistles. Revisiting it again now that it's been around longer and we've documented it more fully, I don't see it as urgent. Perhaps most importantly, the Custom Brimcap Configuration article discloses a couple key points:

It states how Brimcap assumes an analyzer "writes to log outputs only by appending", so anyone who's read & absorbed the article should not be surprised by the "tailing" behavior.
The NetFlow example shows how the globs parameter can be used to isolate files that have been post-processed and hence avoid the ones that are unsafe to tail while the analyzer is still running.

As long as best practices are followed, it seems users could accomplish pretty much whatever they need without this option. Granted, if I use my imagination, I could see a future where it would still be handy. For instance, there's formats like Parquet that (as I understand it) can't be read until they're fully written. However, Brimcap doesn't have a way to directly import these formats right now (#80) so it's kind of moot.

If it's ok, I think I'll drop the MVP1 marker off this one but keep it open in the Deep Freeze so it's easy to find if a use case does surface again.

brimdata / brimcap

Option to disable tail'ed processing of an analyzer's logs #44