Closed GLStephen closed 6 years ago
I have never seen this error. I think that fluent-plugin-bigquery does not cause this. If bigquery itself is unstable, flush thread of fluent-plugin-bigquery has long wait. It cause this error maybe. Would you give me your configuration and plugin version.
fluent-plugin-biquery version v1.0.0 with this manually patched https://github.com/kaizenplatform/fluent-plugin-bigquery/pull/137/commits/eaec2d13972312297aabdd317b02850b8d1a37cc
With FluentD 14 (td-agent 3)
Failue happens whether using multiple workers or just one. If I remove the customer_id buffer param then it runs fine.
I have 5 eventType handlers like these. One streams, the others load. If flush interval is set higher the system fails faster. It eventually locks up and Forwarders see the Aggregator (config below) as down.
<match prod.push.eventType>
@type copy
<store>
@type bigquery
@id eventType
auth_method application_default
project project
dataset dataset
table eventType_${customer_id}$%Y%m%d
ignore_unknown_values true
schema_path /etc/td-agent/eventType.json
<buffer customer_id, time>
@type file
timekey 1d
flush_interval 10s
flush_thread_count 4
total_limit_size 10GB
timekey_use_utc
</buffer>
<secondary bqguid>
@type file # or forward
path /var/log/td-agent/buffer/failed/bigquery.buffer.eventType.${customer_id}.*.log
</secondary>
</store>
<store>
@type flowcounter
tag prod.metric.eventType
unit minute
aggregate all
</store>
</match>
4 identical load types
<match prod.push.eventType>
@type copy
<store>
@type bigquery
@id eventTypeBQ
auth_method application_default
method load
project project
dataset dataset
table table_${customer_id}$%Y%m%d
#fetch_schema true
schema_path /etc/td-agent/table.json
max_bad_records 5
ignore_unknown_values true
slow_flush_log_threshold 600s
<buffer customer_id, time>
@type file
timekey 1d
flush_interval 1200s
flush_at_shutdown true
flush_thread_count 2
timekey_use_utc
</buffer>
<secondary bqguid>
@type file # or forward
path /var/log/td-agent/buffer/failed/bigquery.buffer.table.${customer_id}.*.log
</secondary>
</store>
<store>
@type flowcounter
tag prod.metric.eventType
unit minute
aggregate all
</store>
</match>
Forwarder looks like this. The aggregator seizes up and fails regardless of flush_interval and thread setting.
# Log Forwarding
<match prod.**>
type forward
<server>
host x.x.x.x
port 24225
weight 60
</server>
<server>
host x.x.x.y
port 24225
weight 60
</server>
<server>
host x.x.x.z
port 24225
weight 60
</server>
#
# use udp for heartbeat
heartbeat_type udp
flush_interval .1s
buffer_chunk_limit 999k
buffer_queue_limit 60240
# use multi-threading to send buffered data in parallel
num_threads 32
expire_dns_cache 600
# use file buffer to buffer events on disks.
buffer_type file
buffer_path /var/log/td-agent/buffer/forward
# in case buffer becomes full, have local backup
<secondary>
type file
path /var/log/td-agent/buffer/secondary
compress gzip
</secondary>
</match>
To provide some more info. I changed messages to have the customer_id in the tag and to route based on that. So, instead of using message data only the tag is used for processing. This issue has since gone away. It seems that the deadlock occurs when the messages are forwarded undifferentiated and have to be split at the aggregator level.
I believe I've found the underlying source of the thread issues, I'm just not sure if it is a fluentd issue or this plugin. What I've found is that if you have a lot of reorganization of forwarded chunks then fluend becomes unstable and has many issues. When I have my chunks all aligned to 999k everything runs very well. I tried reducing the forwarded chunks to 100k on a few test servers and the system became unstable. Changing back to 999k all around restored stability. It seems something in the rechunking logic is causing deadlocks or other issues.
Hmm.. Thanks for your report!
I believe this may have been the source of this issue: https://github.com/fluent/fluentd/issues/1549
We had similar issue, but fluentd-0.14.22 resolves it. Thanks!
I'm seeing errors in fluent-plugin-bigquery that seem similar to deadlock issues that have been reporting in general with FluentD 14. However, since the errors are specifically referencing fluent-plugin-bigquery I wanted to drop them in here. Things to know: Restart takes a long time and finally dies with these errors.