Open Vince-Cercury opened 5 years ago
Which version of Fluentd and Ruby do you use? Do you see the same errors without k8s?
Hey Okkez!
Dockerfile for the images used is here: https://github.com/splunk/fluent-plugin-splunk-hec/blob/master/docker/Dockerfile
We use Ruby 2.5.1-slim, concat 2.3.0 & fluentd 1.4.0 in our latest build. Don’t believe we saw this using ruby 2.5.1-slim with fluentd 1.2.x & concat 2.2.2 which we used in the previous version of our images.
Thanks for having a look!
anything further I can help collect to assist with this?
@matthewmodestino Thanks. I want the reproducible scenario to debug this issue. I don't have any idea to resolve this issue...
Have you had the chance to test the plugin against fluent 1.4?
What is your preferred testing method?
If I provide steps using our docker container, is that ok? Or do you prefer testing some other way?
I have tested this plugin using very simple configuration and logs with Fluentd 1.4.x.
I like bare metal environment for debugging. But reproducible steps using docker containers are also OK.
Hi @okkez,
Apologies for the delay.
I have pushed test configs to my repo, that I hope will make it easy for you to have a look at this.
I used docker for mac, but technically these configs should be able to run in any docker enviro. You can clone the repo:
git clone https://github.com/matthewmodestino/container_workshop.git
Or just grab the splunk_docker
and fluent_docker
folders from the repo.
https://github.com/matthewmodestino/container_workshop/tree/master/splunk_docker https://github.com/matthewmodestino/container_workshop/tree/master/fluentd_docker
Once you have pulled down the folders (the following assumes you keep the folder naming), from within the splunk_docker
folder, run:
docker-compose -f 1so_hec.yaml -f 1so_hec.yaml up -d
give the Splunk container a minute or two to start up. This is just to ensure fluentd has a valid backend to send to.
From within the fluentd_docker
folder deploy the fluentd container:
docker-compose -f fluent_hec.yaml up -d
The fluentd folder has the config
directory that contains the fluentd configuration files and is mounted into the container when it runs. We can update configs here and delete and redeploy the container for easy testing.
The output.conf
file contains the following concat plugin config, which will be applied to any docker json driver logs we pick up:
<filter tail.containers.**>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/
flush_interval 5s
</filter>
You should now have the following 2 containers running:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
dca738a36569 splunk/fluentd-hec:1.1.0 "/fluentd/entrypoint…" 4 seconds ago Up 3 seconds fluent-hec
36ce93169d2d splunk/splunk:latest "/sbin/entrypoint.sh…" 26 seconds ago Up 24 seconds (health: starting) 4001/tcp, 8065/tcp, 8191/tcp, 9887/tcp, 9997/tcp, 0.0.0.0:9999->8000/tcp, 0.0.0.0:9088->8088/tcp, 0.0.0.0:9089->8089/tcp so1
Now check the logs of the fluentd container:
docker logs -f dca738a36569
at the very top of the log we dump the running configs and versions:
<system>
log_level info
root_dir "/tmp/fluentd"
</system>
<source>
@id containers.log
@type tail
@label @SPLUNK
tag "tail.containers.*"
path "/var/lib/docker/containers/*/*-json.log"
pos_file "/var/log/splunk-fluentd-containers.log.pos"
path_key "source"
read_from_head true
<parse>
@type "json"
time_key "time"
time_type string
time_format "%Y-%m-%dT%H:%M:%S.%NZ"
localtime false
</parse>
</source>
<source>
@id config.v2
@type tail
@label @SPLUNK
tag "tail.config.*"
path "/var/lib/docker/containers/*/config.v2.json"
pos_file "/var/log/splunk-fluentd-docker-config.log.pos"
path_key "source"
read_from_head true
<parse>
@type "json"
</parse>
</source>
<source>
@id fluentd-monitor-agent
@type monitor_agent
@label @SPLUNK
tag "monitor_agent"
</source>
<label @SPLUNK>
<filter tail.containers.**>
@type jq_transformer
jq ".record | .container_id = (.source | capture(\"(?<container_id>[0-9a-f]{12})\") | .container_id)"
</filter>
<filter tail.containers.**>
@type concat
key "log"
timeout_label "@SPLUNK"
stream_identity_key "stream"
multiline_start_regexp "/^\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}/"
flush_interval 5s
</filter>
<filter tail.config.**>
@type jq_transformer
jq ".record.sourcetype = \"docker:config\" | .record"
</filter>
<match **>
@type splunk_hec
protocol https
hec_host "localhost"
hec_port 9088
hec_token "00000000-0000-0000-0000-000000000000"
host "docker"
source_key "source"
sourcetype_key "sourcetype"
insecure_ssl true
<fields>
container_id
</fields>
<buffer>
@type "memory"
chunk_limit_records 10000
chunk_limit_size 200m
flush_interval 5s
flush_thread_count 1
overflow_action block
retry_max_times 3
total_limit_size 600m
</buffer>
<format monitor_agent>
@type "json"
</format>
<format tail.config.**>
@type "json"
</format>
<format>
@type "single_value"
message_key "log"
add_newline false
</format>
</match>
</label>
</ROOT>
2019-05-21 17:16:35 +0000 [info]: starting fluentd-1.4.0 pid=6 ruby="2.5.1"
2019-05-21 17:16:35 +0000 [info]: spawn command to main: cmdline=["/usr/local/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/local/bundle/bin/fluentd", "-c", "/fluentd/etc/fluent.conf", "--under-supervisor"]
2019-05-21 17:16:35 +0000 [info]: gem 'fluent-plugin-concat' version '2.3.0'
2019-05-21 17:16:35 +0000 [info]: gem 'fluent-plugin-jq' version '0.5.1'
2019-05-21 17:16:35 +0000 [info]: gem 'fluent-plugin-prometheus' version '1.3.0'
2019-05-21 17:16:35 +0000 [info]: gem 'fluent-plugin-splunk-hec' version '1.1.0'
2019-05-21 17:16:35 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2019-05-21 17:16:35 +0000 [info]: gem 'fluentd' version '1.4.0'
2019-05-21 17:16:35 +0000 [info]: adding filter in @SPLUNK pattern="tail.containers.**" type="jq_transformer"
2019-05-21 17:16:35 +0000 [info]: adding filter in @SPLUNK pattern="tail.containers.**" type="concat"
2019-05-21 17:16:35 +0000 [info]: adding filter in @SPLUNK pattern="tail.config.**" type="jq_transformer"
2019-05-21 17:16:35 +0000 [info]: adding match in @SPLUNK pattern="**" type="splunk_hec"
2019-05-21 17:16:35 +0000 [info]: adding source type="tail"
2019-05-21 17:16:35 +0000 [warn]: #0 [containers.log] 'encoding' and 'from_encoding' are same encoding. No effect
2019-05-21 17:16:35 +0000 [info]: adding source type="tail"
2019-05-21 17:16:35 +0000 [warn]: #0 [config.v2] 'encoding' and 'from_encoding' are same encoding. No effect
2019-05-21 17:16:35 +0000 [info]: adding source type="monitor_agent"
2019-05-21 17:16:35 +0000 [info]: #0 starting fluentd worker pid=14 ppid=6 worker=0
and you should see that the logs we picked up from the fluentd container itself start triggering the concat rule and spitting errors:
2019-05-21 17:18:53 +0000 [info]: #0 Timeout flush: tail.containers.var.lib.docker.containers.dca738a36569ceb04fa6c21bc0cf39a4fbd95ab661e140cedc4c5b423f8cddc0.dca738a36569ceb04fa6c21bc0cf39a4fbd95ab661e140cedc4c5b423f8cddc0-json.log:stdout
2019-05-21 17:18:58 +0000 [warn]: #0 dump an error event: error_class=ThreadError error="deadlock; recursive locking" location="/usr/local/bundle/gems/fluent-plugin-concat-2.3.0/lib/fluent/plugin/filter_concat.rb:144:in `synchronize'" tag="tail.containers.var.lib.docker.containers.dca738a36569ceb04fa6c21bc0cf39a4fbd95ab661e140cedc4c5b423f8cddc0.dca738a36569ceb04fa6c21bc0cf39a4fbd95ab661e140cedc4c5b423f8cddc0-json.log" time=2019-05-21 17:18:58.562890167 +0000 record={"log"=>"2019-05-21 17:18:53 +0000 [info]: #0 Timeout flush: tail.containers.var.lib.docker.containers.dca738a36569ceb04fa6c21bc0cf39a4fbd95ab661e140cedc4c5b423f8cddc0.dca738a36569ceb04fa6c21bc0cf39a4fbd95ab661e140cedc4c5b423f8cddc0-json.log:stdout\n", "stream"=>"stdout", "source"=>"/var/lib/docker/containers/dca738a36569ceb04fa6c21bc0cf39a4fbd95ab661e140cedc4c5b423f8cddc0/dca738a36569ceb04fa6c21bc0cf39a4fbd95ab661e140cedc4c5b423f8cddc0-json.log", "container_id"=>"dca738a36569"}
Let me know if there is something specific you would like me to check in the fluentd container itself.
to clean up these resources:
from within your fluentd_docker
folder:
docker-compose -f fluent_hec.yaml down
then from within your splunk_docker
folder:
docker-compose -f 1so_hec.yaml down
Clean up docker volumes:
docker volume prune
Sorry for the delay. I missed the issue.
Try following diff:
diff --git a/fluentd_docker/config/output.conf b/fluentd_docker/config/output.conf
index 9029af3..531748f 100644
--- a/fluentd_docker/config/output.conf
+++ b/fluentd_docker/config/output.conf
@@ -1,10 +1,9 @@
-<label @SPLUNK>
- # extract index fields and sourcetype for container logs
+<label @TRANSFORM1>
+ # extract index fields and sourcetype for container logs
<filter tail.containers.**>
@type jq_transformer
jq '.record | .container_id = (.source | capture("(?<container_id>[0-9a-f]{12})") | .container_id)'
</filter>
-
<filter tail.containers.**>
@type concat
key log
@@ -13,14 +12,26 @@
multiline_start_regexp /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/
flush_interval 5s
</filter>
+ <match>
+ @type relabel
+ @label @SPLUNK
+ </match>
+</label>
- # = filters for docker config =
+<label @TRANSFORM2>
+ # = filters for docker config =
<filter tail.config.**>
@type jq_transformer
jq ".record.sourcetype = \"docker:config\" | .record"
</filter>
-
- # = output =
+ <match>
+ @type relabel
+ @label @SPLUNK
+ </match>
+</label>
+
+<label @SPLUNK>
+ # = output =
<match **>
@type splunk_hec
protocol https
diff --git a/fluentd_docker/config/source.containers.conf b/fluentd_docker/config/source.containers.conf
index 636c6ac..af4a05d 100644
--- a/fluentd_docker/config/source.containers.conf
+++ b/fluentd_docker/config/source.containers.conf
@@ -18,7 +18,7 @@
<source>
@id containers.log
@type tail
- @label @SPLUNK
+ @label @TRANSFORM1
tag tail.containers.*
path /var/lib/docker/containers/*/*-json.log
pos_file /var/log/splunk-fluentd-containers.log.pos
diff --git a/fluentd_docker/config/source.docker.conf b/fluentd_docker/config/source.docker.conf
index e1c5246..43f43fe 100644
--- a/fluentd_docker/config/source.docker.conf
+++ b/fluentd_docker/config/source.docker.conf
@@ -2,7 +2,7 @@
<source>
@id config.v2
@type tail
- @label @SPLUNK
+ @label @TRANSFORM2
tag tail.config.*
path /var/lib/docker/containers/*/config.v2.json
pos_file /var/log/splunk-fluentd-docker-config.log.pos
In the original configuration, concat plugin configuration includes timeout_label @SPLUNK
that will re-emit the event to <label @SPLUNK>
.
At this time, re-emitted events via timeout and emitted events via normal flow will conflict and it will cause deadlock which is reported in this issue.
In the above diff, I set the proper label to apply the appropriate filter per <source>
.
We should apply concat filter once.
Thanks so much for checking. I will review and advise.
I can also confirm that avoiding reemission to the current label fixes the problem. The use case here is with Docker's partial_message
, concatenating logs that were split up by standard Docker buffering.
If anyone is interested this is my current configuration snippet, working well in docker-ce 18.06.1
, fluentd 1.7.0
and fluent-plugin-concat 2.4.0
:
<source>
@type forward
# Some Docker logs can be split to 16 KB chunks, concat them first
@label @CONCAT_PARTIAL
port 24224
bind 0.0.0.0
</source>
<label @CONCAT_PARTIAL>
<filter **>
@type concat
key log
stream_identity_key container_id
partial_key partial_message
partial_value true
keep_partial_key false
separator ""
use_first_timestamp true
flush_interval 5
timeout_label @START
</filter>
# Send all logs to @START label, whether the concat attempt timed out or succeeded
<match **>
@type relabel
@label @START
</match>
</label>
<label @START>
# the rest of log processing pipeline
</label>
@peter-slovak I have a log (size 59k) that gets truncated into four separated log by docker and the partial log (59k) is created with other regular logs. How does the "partial_key log" or "partial_key partial_message" determine when the partial log ends to append the partial logs?
<filter **>
@type concat
key log
partial_key partial_message
partial_value true
keep_partial_key false
separator ""
use_first_timestamp true
flush_interval 5
timeout_label @START
</filter>
Thanks.
@bamb00 As I understand it, Docker flags partial logs in its internal structure and then exposes that flag to a particular logging driver/plugin. In case of fluentd, the flag makes it into the partial_message
log attribute.
Since these logs are line-oriented, as soon as we encounter an \n
in the log, we know it's the last chunk. See https://github.com/moby/moby/issues/34620#issuecomment-325912345 for a more detailed explanation by people that know a lot more about this :)
Problem
If I use multiline detection, it works but I get
deadlock; recursive locking
errors in fluentd logs(concat plugin is used as part of Splunk connect for Kubernetes https://github.com/splunk/splunk-connect-for-kubernetes)
Steps to replicate
Log:
Config:
Expected Behavior
It works (multi line appears as a single event in Splunk) but I see many errors in fluentd logs. I believe 1 error every 5 second for every log file:
Your environment
See original ticket https://github.com/splunk/splunk-connect-for-kubernetes/issues/111