Open lukas-vlcek opened 8 years ago
Situation seem to be getting much worse with increasing number of files the Fluentd needs to tail.
For example if we use the following ENV variable for the test:
export NMESSAGES=20
export NPROJECTS=99
(^^ this means Fluentd will be tailing 100 log files each containing only 20 records)
We end up with the following indices state:
# curl http://10.40.2.198:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open this-is-project-84.1.2016.06.21 5 1 4 0 6.9kb 6.9kb
yellow open this-is-project-93.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-02.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-95.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-29.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-27.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-83.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-19.1.2016.06.21 5 1 3 0 6.7kb 6.7kb
yellow open this-is-project-97.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-20.1.2016.06.21 5 1 4 0 6.9kb 6.9kb
yellow open this-is-project-86.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-82.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-35.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-45.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-78.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-96.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-30.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-64.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-34.1.2016.06.21 5 1 20 0 32.6kb 32.6kb
yellow open this-is-project-36.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-17.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-04.1.2016.06.21 5 1 12 0 19.7kb 19.7kb
yellow open this-is-project-01.1.2016.06.21 5 1 13 0 20kb 20kb
yellow open this-is-project-03.1.2016.06.21 5 1 4 0 6.9kb 6.9kb
yellow open this-is-project-18.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-31.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-49.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-74.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-32.1.2016.06.21 5 1 8 0 13.3kb 13.3kb
yellow open this-is-project-81.1.2016.06.21 5 1 20 0 32.5kb 32.5kb
yellow open this-is-project-89.1.2016.06.21 5 1 3 0 6.7kb 6.7kb
yellow open this-is-project-55.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-14.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-54.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-24.1.2016.06.21 5 1 20 0 32.5kb 32.5kb
yellow open this-is-project-63.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-13.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-58.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-47.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-79.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-50.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-65.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-91.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-11.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-40.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-61.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-08.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-06.1.2016.06.21 5 1 20 0 32.5kb 32.5kb
yellow open this-is-project-57.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-51.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-37.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-66.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-69.1.2016.06.21 5 1 4 0 6.9kb 6.9kb
yellow open this-is-project-12.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-41.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-77.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-33.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-15.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-09.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-23.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-44.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-72.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-21.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-07.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-25.1.2016.06.21 5 1 20 0 32.5kb 32.5kb
yellow open this-is-project-48.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-98.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-68.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-43.1.2016.06.21 5 1 20 0 32.5kb 32.5kb
yellow open this-is-project-22.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-85.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-76.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-94.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-59.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-70.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-05.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-38.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-90.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-60.1.2016.06.21 5 1 0 0 575b 575b
yellow open .operations.2016.06.21 5 1 20 0 23.8kb 23.8kb
yellow open this-is-project-28.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-16.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-88.1.2016.06.21 5 1 5 0 7.1kb 7.1kb
yellow open this-is-project-75.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-53.1.2016.06.21 5 1 20 0 32.5kb 32.5kb
yellow open this-is-project-73.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-71.1.2016.06.21 5 1 20 0 32.6kb 32.6kb
yellow open this-is-project-92.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-42.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-67.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-52.1.2016.06.21 5 1 16 0 26.1kb 26.1kb
yellow open this-is-project-26.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-46.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-80.1.2016.06.21 5 1 9 0 13.6kb 13.6kb
yellow open this-is-project-39.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-99.1.2016.06.21 5 1 4 0 6.9kb 6.9kb
yellow open this-is-project-56.1.2016.06.21 5 1 4 0 6.9kb 6.9kb
yellow open this-is-project-87.1.2016.06.21 5 1 0 0 575b 575b
yellow open this-is-project-62.1.2016.06.21 5 1 11 0 19.5kb 19.5kb
yellow open this-is-project-10.1.2016.06.21 5 1 0 0 575b 575b
Only few indices has some documents, just few have expected 20 documents.
We are getting 429 (!) bulk.rejected
:
# curl http://10.40.2.198:9200/_cat/thread_pool?v
host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
c681d5122401 172.17.0.2 0 0 429 0 0 0 0 0 0
Just for the record if we investigate Fluentd plugin status (via REST API, which is subject of PR #4) we can see that elasticsearch output does not hold any data in bufferes:
{
"plugin_id":"object:1099830",
"plugin_category":"output",
"type":"elasticsearch_dynamic",
"config":{
"@type":"elasticsearch_dynamic",
"host":"viaq-elasticsearch",
"port":"9200",
"scheme":"http",
"index_name":"${record['kubernetes_namespace_name']}.${record['kubernetes_namespace_id']}.${Time.at(time).getutc.strftime(@logstash_dateformat)}",
"client_key":"",
"client_cert":"",
"ca_file":"",
"flush_interval":"5s",
"max_retry_wait":"300",
"disable_retry_limit":""
},
"output_plugin":true,
"buffer_queue_length":0,
"buffer_total_queued_size":0,
"retry_count":0
}
We might be suffering from default settings for queue sizes: https://www.elastic.co/guide/en/elasticsearch/reference/1.5/modules-threadpool.html
We are running into same issue when using rsyslog instead of Fluentd.
For the case:
export NMESSAGES=20
export NPROJECTS=99
We are getting bulk.rejected
= 264
Actually, as explained in official guide doc if bulk request fails due to rejection then it is not considered an error on ES side but this should be a signal to the client:
"Rejections are not errors: they just mean you should try again later."
Which means we need to check all clients (fluentd and rsyslog ATM) that they handle relevant HTTP response codes correctly. See here, could be good starting point.
In case of rsyslog and omelasticsearch we need to configure errorfile
(which is optional ATM). rsyslog should write all errored requests to this file (which should cover also rejected bulk requests - but I am not sure if this specific use case is tested in rsyslog code!). My understanding is that there is no mechanism of resubmitting data from this error file. It is responsibility of the user to setup some process to periodically check this file and investigate/decide what to do.
See http://www.rsyslog.com/doc/v8-stable/configuration/modules/omelasticsearch.html and https://github.com/rsyslog/rsyslog/issues/104. Also check this https://github.com/rsyslog/rsyslog/pull/246 for possible improvements/changes in newer versions of rsyslog (i.e. depending on version of rsyslog we use we should be able to get better support for resubmitting errored requests).
Note that starting with ES 2.2 there should be configurable OOB bulk error retry mechanism that can self-heal from some issues we see today, see https://github.com/elastic/elasticsearch/issues/14620 and https://github.com/elastic/elasticsearch/pull/14829 (see also here https://www.elastic.co/guide/en/elasticsearch/reference/2.3/release-notes-2.2.0.html#enhancement-2.2.0 the "Java API"). However, this seem to be only improvement for short peaks, we still need to watch for cases where ES can not keep up with high bulk indexing traffic.
In case of fluentd elasticsearch plugin it seem any sent data associated with following HTTP response code other than 200
is discarded, see https://github.com/uken/fluent-plugin-elasticsearch/issues/105
This discarding issue is still present in fluent-plugin-elasticsearch v1.11.1 and v2.1.1?
This discarding issue is still present in fluent-plugin-elasticsearch v1.11.1 and v2.1.1?
Probably not - we haven't tested with that
Summary
Missing log messages in Elasticsearch indices when running
openshift-test.sh
script.Details
Note: the fix from PR #5 needs to be applied/merged first.
Assume the following ENV variables:
When the test
openshift-test.sh
is executed it fails (time-outs) on verification of expected records in the index for the last project (i.e.project-09
). Specifically:Further investigation reveals that this index is missing some records (note the index
this-is-project-09.1.2016.06.21
containing only88
documents instead of110
):Note the size of source log files in the
/tmp/tmp.pKiGhuIsLw/data/docker
folder below, all are equal (which is expected and it means they all contain the same number of log messages):Further we can see that there was possible issue in pushing the data to Elasticsearch (there was 1 rejected bulk request -
bulk.rejected
):However, Elasticsearch log does not show any errors. We can see that after the cluster is started expected indices were created and mapping was updated as documents are indexed. That is all: