Open msb-dev opened 4 years ago
I can confirm this happening to me too. I'm running docker.elastic.co/logstash/logstash-oss:7.5.2
and I'm seeing the localhost:9600/
call to take just milliseconds sometimes, and sometimes over 10s.
Since I tried using the /
call as a sort of healtcheck for LogStash that causes the service to appear as constantly switching between health and unhealthy. The system isn't overloaded in terms of CPU, Memory, or I/O. I have no idea why that happens.
The same behavior with 7.17.9
Steps to reproduce
Sometimes we observe that Logstash's monitoring API on port 9600 is very slow in responding to GET requests. Typically it is responding in ~10 ms but sometimes we see it taking as long as 30 seconds. The responses seem normal when they finally do return with
"status": "green"
.We can't post our Logstash configuration here, but it has the following characteristics, in case this triggers any ideas...
We are using default values for
pipeline.batch.size
andpipeline.workers
. We use the following Kubernetes resource values:Nothing else is querying the monitoring API but we do ship monitoring events to ES via
xpack.monitoring
configuration.We are not sure this is always the case, but it seems like we might see the slow behaviour when the pipeline is at maximum capacity and there are hundreds of thousands of events in the persistent queue. Could these characteristics affect the monitoring API?
This issue is causing problems because we are using Logstash in Kubernetes using Logstash's Helm chart. This creates a liveness (and readiness) probe that queries the monitoring endpoint and waits for a 200. When the monitoring API becomes slow, this trips the default timeout of 1 second and the pod is killed. This can happen often enough that the backlog of events becomes unmanageable.
This is possibly causing the error reported in https://github.com/helm/charts/issues/9996. That issue could be fixed by increasing timeouts, but it would be better to address the underlying slowness of the monitoring API.