elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
71 stars 3.5k forks source link

Slow monitoring API #11442

Open msb-dev opened 4 years ago

msb-dev commented 4 years ago

Steps to reproduce

Sometimes we observe that Logstash's monitoring API on port 9600 is very slow in responding to GET requests. Typically it is responding in ~10 ms but sometimes we see it taking as long as 30 seconds. The responses seem normal when they finally do return with "status": "green".

We can't post our Logstash configuration here, but it has the following characteristics, in case this triggers any ideas...

We are using default values for pipeline.batch.size and pipeline.workers. We use the following Kubernetes resource values:

resources:
  limits:
    cpu: "2"
    memory: 2G
  requests:
    cpu: "1"
    memory: 1500M

Nothing else is querying the monitoring API but we do ship monitoring events to ES via xpack.monitoring configuration.

We are not sure this is always the case, but it seems like we might see the slow behaviour when the pipeline is at maximum capacity and there are hundreds of thousands of events in the persistent queue. Could these characteristics affect the monitoring API?

This issue is causing problems because we are using Logstash in Kubernetes using Logstash's Helm chart. This creates a liveness (and readiness) probe that queries the monitoring endpoint and waits for a 200. When the monitoring API becomes slow, this trips the default timeout of 1 second and the pod is killed. This can happen often enough that the backlog of events becomes unmanageable.

This is possibly causing the error reported in https://github.com/helm/charts/issues/9996. That issue could be fixed by increasing timeouts, but it would be better to address the underlying slowness of the monitoring API.

jakubgs commented 4 years ago

I can confirm this happening to me too. I'm running docker.elastic.co/logstash/logstash-oss:7.5.2 and I'm seeing the localhost:9600/ call to take just milliseconds sometimes, and sometimes over 10s.

Since I tried using the / call as a sort of healtcheck for LogStash that causes the service to appear as constantly switching between health and unhealthy. The system isn't overloaded in terms of CPU, Memory, or I/O. I have no idea why that happens.

EugeneRomanenko commented 10 months ago

The same behavior with 7.17.9