apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
243 stars 112 forks source link

Prometheus exporter high memory usage #658

Open stibi opened 7 months ago

stibi commented 7 months ago

Hello, we have trouble with solr exporter, it's very hungry for memory, it needs around ~6G of RAM, which is a lot and I can't figure out why.

Can I ask you any hint?

It's pretty much default setup, nothing custom:

SolrCloud 9.3.0. Nothing too much custom for the exporter deployment:

apiVersion: solr.apache.org/v1beta1
kind: SolrPrometheusExporter
metadata:
  name: solr-exporter
spec:
  customKubeOptions:
    podOptions:
      resources:
        requests:
          cpu: 500m
          memory: 3072Mi
        limits:
          cpu: 2000m
          memory: 6912Mi
      envVars:
        - name: JAVA_HEAP
          value: 6000m
  solrReference:
    cloud:
      name: "solr-cloud"
  numThreads: 6
Screenshot 2023-11-21 at 13 23 19
radu-gheorghe commented 7 months ago

I think this tells it to allocate 6GB:

      envVars:
        - name: JAVA_HEAP
          value: 6000m

I assume it can do with much less than 6000m. Try a 10th of that and see how it goes.

stibi commented 7 months ago

Ah … I thought that is its maximum value, not making it that big…that makes sense now. I was confused by another problem, where the exporter was in crashloop all the time, I solved that by tuning the livenes probe a bit… fiddling with heap size was one of the attempts to fix that…

Thanks, I think it will be quite ok with default heap size value, will try that in a moment.

On Tue 21. 11. 2023 at 17:01, Radu Gheorghe @.***> wrote:

I think this tells it to allocate 6GB:

  envVars:
    - name: JAVA_HEAP
      value: 6000m

I assume it can do with much less than 6000m. Try a 10th of that and see how it goes.

— Reply to this email directly, view it on GitHub https://github.com/apache/solr-operator/issues/658#issuecomment-1821207217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAOD2SG5PENFVBMCNCJ263YFTF55AVCNFSM6AAAAAA7UPKTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGIYDOMRRG4 . You are receiving this because you authored the thread.Message ID: @.***>

stibi commented 7 months ago

Ouch, so maybe I wasn't so wrong about it ... I removed the JAVA_HEAP env var, but the exporter started to failing with java.lang.OutOfMemoryError: Java heap space. Here we go, full circle :D

So I had to put the JAVA_HEAP back to see how much java heap space it actually needs and the number is 5G. With that much heap space, the exporter is running without error. But it takes quite some time to collect all the metrics, isn't that weird?

INFO  - 2023-11-22 09:53:39.225; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection
INFO  - 2023-11-22 09:54:39.226; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Beginning metrics collection
INFO  - 2023-11-22 09:55:15.506; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection
INFO  - 2023-11-22 09:56:15.506; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Beginning metrics collection
INFO  - 2023-11-22 09:56:53.088; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection
INFO  - 2023-11-22 09:57:53.088; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Beginning metrics collection
INFO  - 2023-11-22 09:58:29.369; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection
INFO  - 2023-11-22 09:59:29.369; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Beginning metrics collection
INFO  - 2023-11-22 10:00:06.842; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection
INFO  - 2023-11-22 10:01:06.842; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Beginning metrics collection
INFO  - 2023-11-22 10:01:41.788; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection
INFO  - 2023-11-22 10:02:41.788; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Beginning metrics collection
INFO  - 2023-11-22 10:03:22.174; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection
INFO  - 2023-11-22 10:04:22.174; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Beginning metrics collection
INFO  - 2023-11-22 10:04:57.249; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection
INFO  - 2023-11-22 10:05:57.250; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Beginning metrics collection

I was able to take a heap dump, using the jattach utility (awesome it's packaged with the container image, thanks for that!), but I guess I don't really know how to properly read it .. it says that the heap size is only 23549096B big ... which is 23.549096 MB? That's not so much.

Screenshot 2023-11-22 at 11 12 25
radu-gheorghe commented 7 months ago

Yep, that's 23MB. Weird that it takes a while to collect metrics, is that a symptom (e.g. of the Exporter stuck in GC, then it doesn't have spare CPU to collect the metrics) or a cause (e.g. you have a ton of shards in the cluster, collecting them takes a while and takes heap)?

Maybe G1 falls behind with garbage collection? You can verify this hypothesis by setting the GC_TUNE env var to -XX:+UseG1GC -XX:GCTimeRatio=2. Unless you have a ton of shards, I'm expecting something like JAVA_HEAP=1g to be enough. Or maybe we're both missing something...

stibi commented 7 months ago

The cluster is not big at all I think, 1 shard, 2 replicas, ~ 8753202 documents, taking ~22Gb of memory ...

Thanks for hints, I'll take a look on Java metrics and how GC performs.

radu-gheorghe commented 7 months ago

You're welcome.

If you need something to monitor GC/JVM metrics (and Solr metrics, for that matter), we have a tool that you might find useful.