Aiven-Open / prometheus-exporter-plugin-for-opensearch

Prometheus exporter plugin for OpenSearch & OpenSearch Mixin
Apache License 2.0
120 stars 37 forks source link

Null Pointer Exception #213

Closed Palulukas closed 1 year ago

Palulukas commented 1 year ago

Hello,

I am struggeling with getting my opensearch instances scraped. Currently I only receive null pointer exceptions when I try to access the metrics manually. The endpoints do show as "up" in prometheus agent, though. So IP and DNS are working correctly. I configured opensearch as described in the tutorial.

Here is the job config used by the prometheus agent inside prometheus.yml:

  - job_name: opensearch
    scrape_interval: 30s
    metrics_path: "/_prometheus/metrics"
    static_configs:
    - targets:
      - fqdn1:9200
      - fqdn2:9201
      - fqdn3:9200
      - ...
    basic_auth:
      username: 'user'
      password: 'pass' 

And the scrape result (for example fqdn1:9200) is:

{"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"String.length()\" because \"candidate\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"String.length()\" because \"candidate\" is null"},"status":500}

We use an opensearch cluster with 8 nodes, version 2.8.0, running in docker containers. The exporter plugin ist installed on all nodes, TLS is enabled. Security is also enabled and a role created -> tested with "all access".

Here is the created role:

PUT _plugins/_security/api/roles/metric_reader
{
    "cluster_permissions": [
      "cluster:monitor/prometheus/metrics",
      "cluster:monitor/health",
      "cluster:monitor/state",
      "cluster:monitor/nodes/info",
      "cluster:monitor/nodes/stats"
    ],
    "index_permissions": [
      {
        "index_patterns": [
          "*"
        ],
        "dls": "",
        "fls": [],
        "masked_fields": [],
        "allowed_actions": [
          "indices:monitor/stats"
        ]
      }
    ],
    "tenant_permissions": []
}

and GET_cat/plugins shows the plugin installed on all nodes.

Is there something I need to check with my config or is there a bug in the exporter?

Regards

lukas-vlcek commented 1 year ago

Hi,

do you have any logs from OpenSearch node(s)?

Palulukas commented 1 year ago

Hi,

do you have any logs from OpenSearch node(s)?

Hello lukas,

the log files for one node are ~400mb. Shall I provide it in this issue or by dm?

Regards

lukas-vlcek commented 1 year ago

What I am looking for is just a little bit of context before the exception. I would search the log file for the term candidate, that can help quickly find the relevant part of the log. If you have this part of the log file, please share it here in comments (just a few lines of the log before the error should be enough).

IMO the error suggest that the issue is in the security layer. Either it is not correctly setup or there is an issue in the plugin. I will try to recreate it.

Palulukas commented 1 year ago

Hello lukas,

we were able to fix the problem. Sorry for the inconvenience but the error at first didn't look like a permission problem to us. For future help, this is what we've done:

We created a user "svc-prometheus" for prometheus to use for scraping purposes, as stated above. We now updated its role membership from "metric_reader" (as stated in the issue entry) to "metric_reader, security_rest_api" in order to access the rest api directly. Side note: this was curiously not possible with role "all access".

And this fixed our null pointer exception. Please find a snippet of anonymized log file attached below for further documentation purposes.

Regards opensearch_errorlog_snippet.txt