Closed philrz closed 2 months ago
I revived this config while testing the compaction currently performed by lake manager (#3923). This made me wish the JSON Zeek data could actually be properly shaped so I'd have real ts
timestamps etc. I came up with this revised config that currently seems to be working ok with the reference shaper for Zeek.
$ cat fluentd-shaped.conf
<source>
@type tail
path /usr/local/var/logs/current/json_streaming_*
follow_inodes true
pos_file /Users/phil/work/fluentd/fluentd.pos
tag zeek
<parse>
@type json
</parse>
</source>
<match zeek>
@type exec_filter
command zq -z -I shaper.zed '| quiet(this)' -
tag shaped
<format>
@type json
</format>
<parse>
@type none
</parse>
<buffer>
flush_interval 1s
</buffer>
</match>
<match shaped>
@type http
endpoint http://127.0.0.1:9988/pool/zeek/branch/main
content_type application/x-zson
<format>
@type single_value
</format>
</match>
Things to note:
The quiet(this)
deals with what appear to be some new log types added in more recent Zeek releases that aren't yet covered by the reference shaper. The way the shaper is currently written, attempts to shape these just become error(missing)
. I'll plan to add these new logs to the reference shaper and also improve the error messages to help alert the user about what precisely went wrong.
I'd have preferred transporting the shaped data as ZNG rather than ZSON, but fluentd seems fussy about things being text lines. I may tinker more with binary ZNG the next time I revisit this.
The buffering config may need more work. When playing around with just sending shaped data to stdout rather than posting to the Zed lake, it seemed like it was waiting excessively (forever?) for events to accumulate before sending output. I'm not sure if that flush_interval 1s
is even having much effect right now, but it does seem like my live data is getting posted periodically, so I'm ok for now.
This looks awesome! +1 for ZNG support <3
https://zed.brimdata.io/docs/commands/zq#performance-comparisons
I wonder what would happen if you replaced https://github.com/corelight/json-streaming-logs/blob/master/scripts/main.zeek#L127 with filt$config["tsv"] = "T";
https://github.com/zeek/zeek/blob/master/scripts/base/frameworks/logging/writers/ascii.zeek#L12
@mrbluecoat: Glad to hear you think it sounds encouraging! This material actually hasn't been updated in a while and the Zed lake has evolved a bit in that time (and I imagine Fluentd might have changed a bit too). I've been starting to more formally document these kinds of topics in the "Integrations" area of the Zed docs and would be keen to do that with this Fluentd stuff if it's something you could see putting to use. However, as I'm doing that, I'd like to make sure that whatever it covers is relevant to your goals. I imagine GitHub Issue comments might not be the best forum for laying out those details, so could you ping me through other channels so we could discuss? If you like Slack you could send yourself an invite to our community Slack workspace and message me @Phil
or send an email to support@brimdata.io. Hope to hear from you!
Will do, thanks
As anticipated in my previous comment here, I did end up writing an article expanding and formalizing the config originally sketched out in this issue. The article can be found at https://zed.brimdata.io/docs/next/integrations/fluentd and will be tagged with the docs version for the next GA Zed release which is expected soon. The user that chimed in here in previous comments did follow up on Slack and reported having followed the article and was successful in observing live data import. Therefore I'm going to go ahead and close this issue with the expectation that the article may continue to be improved over time as users continue to find/follow it and share their experiences.
In an exercise similar to #3151 with Logstash, I've run a test with external agent Fluentd to push "live" data continuously to a Zed lake.
tl;dr - As Fluentd is a more modern tool than Logstash, in the end it did indeed seem to play more nicely with Zed in a minimal out-of-the-box config.
The test was performed with a Zed lake running behind Zui insiders 0.30.1-142 (which is Zed commit
101d358
), Fluentd1.15.3
, and Zeek5.1.1
(with the json-streaming-logs package) sniffing the coincidental live local traffic on my wireless interface during a typical work-from-home day.The
fluentd.conf
I ultimately settled on:Then to start Fluentd:
The
ulimit
guidance came from the "Before Installation" docs and did end up being significant, as the default setting on my Macbook was256
and in an early run Fluentd did indeed complain of running out of file descriptors.The only minor speedbump I hit along the way was needing to specify
content_type application/json
. At first after reading the relevant docs I thought the defaults would work since the doc explains thatjson_array
isfalse
by default and hence it posts in NDJSON (whichzed serve
accepts) and sets Content Type toapplication/x-ndjson
. However, this caused an HTTP 400 error because the Zed backend handles both regular JSON and NDJSON through the same reader and this is wired up only through Content Typeapplication/json
, so I had to set that explicitcontent_type
.Other than that, the defaults are way more friendly than what we saw with Logstash. As #3151 gets into, Logstash's default behavior was to either send a single JSON object per post or be configured to send a JSON array of objects. The former is known to perform poorly (#4266) and with the latter the posts end up in the pool as whole arrays and hence currently need to be post-processed in some way if they're going to be treated as individual Zed records. By comparison, Fluentd's default behavior not only sends NDJSON but also does so at a
flush_interval
of60s
, which means more records per commit and hence better performance. There's also a lot of tuning parameters at that page that indicate users have a lot more flexibility at their disposal if these defaults are undesirable.Having let it run for a couple hours,
7523
Zeek events have accumumulated in my pool, and here's a look at the metadata that shows object size/count:Applying some crude math, the average data object size/count:
Obviously that size is still way below the default 500 MB object size and hence we'd still be waiting a long time to see benefit from compaction. However, compared to the single-object commits discussed in #3151 and #4266, the data as stored here is much more forgiving when accessed in queries, e.g., counting is still very quick: