Slow monitoring API - Githubissues

msb-dev commented 4 years ago

Version: 7.4.0
Operating System: official Logstash docker image running on Google Kubernetes Engine on Google Cloud Platform.
Config File/Sample Data: characteristics described below

Steps to reproduce

Sometimes we observe that Logstash's monitoring API on port 9600 is very slow in responding to GET requests. Typically it is responding in ~10 ms but sometimes we see it taking as long as 30 seconds. The responses seem normal when they finally do return with "status": "green".

We can't post our Logstash configuration here, but it has the following characteristics, in case this triggers any ideas...

two "input" pipelines that receive inputs from Metricbeat, do some simple event formatting and then output to our "common input" pipeline (using pipeline-to-pipeline communication)
a "common input" pipeline that checks an Elasticsearch index via an HTTP filter (so is blocked waiting for ES HTTP responses) and might reject the event depending on the result. If not, it outputs the event to Elasticsearch (version 7.1.1) and also the "main pipeline"
a "main pipeline" that
- does some processing using Ruby filters
- queries an external HTTP endpoint using an HTTP filter
- outputs the mutated event to Elasticsearch

We are using default values for pipeline.batch.size and pipeline.workers. We use the following Kubernetes resource values:

resources:
  limits:
    cpu: "2"
    memory: 2G
  requests:
    cpu: "1"
    memory: 1500M

Nothing else is querying the monitoring API but we do ship monitoring events to ES via xpack.monitoring configuration.

We are not sure this is always the case, but it seems like we might see the slow behaviour when the pipeline is at maximum capacity and there are hundreds of thousands of events in the persistent queue. Could these characteristics affect the monitoring API?

This issue is causing problems because we are using Logstash in Kubernetes using Logstash's Helm chart. This creates a liveness (and readiness) probe that queries the monitoring endpoint and waits for a 200. When the monitoring API becomes slow, this trips the default timeout of 1 second and the pod is killed. This can happen often enough that the backlog of events becomes unmanageable.

This is possibly causing the error reported in https://github.com/helm/charts/issues/9996. That issue could be fixed by increasing timeouts, but it would be better to address the underlying slowness of the monitoring API.

jakubgs commented 4 years ago

I can confirm this happening to me too. I'm running docker.elastic.co/logstash/logstash-oss:7.5.2 and I'm seeing the localhost:9600/ call to take just milliseconds sometimes, and sometimes over 10s.

Since I tried using the / call as a sort of healtcheck for LogStash that causes the service to appear as constantly switching between health and unhealthy. The system isn't overloaded in terms of CPU, Memory, or I/O. I have no idea why that happens.

EugeneRomanenko commented 10 months ago

The same behavior with 7.17.9

elastic / logstash

Slow monitoring API #11442

Steps to reproduce