metrics endpoint returns 500 after upgrading to 2.13

geckiss commented 7 months ago

Hello,

we were using this plugin for some time without any issue. However, after this weeks upgrade to 2.13, metrics endpoint returns following:

{"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"org.opensearch.index.shard.DocsStats.getCount()\" because the return value of \"org.opensearch.action.admin.indices.stats.CommonStats.getDocs()\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"org.opensearch.index.shard.DocsStats.getCount()\" because the return value of \"org.opensearch.action.admin.indices.stats.CommonStats.getDocs()\" is null"},"status":500}

I can see the message in logs every few seconds (so prometheus is probably trying to scrape?):

[2024-04-12T13:30:18,515][WARN ][r.suppressed             ] [opensearch-cluster-master-1] path: /_prometheus/metrics, params: {}
java.lang.NullPointerException: Cannot invoke "org.opensearch.index.shard.DocsStats.getCount()" because the return value of "org.opensearch.action.admin.indices.stats.CommonStats.getDocs()" is null
    at org.compuscene.metrics.prometheus.PrometheusMetricsCollector.updatePerIndexContextMetrics(PrometheusMetricsCollector.java:470) ~[?:?]
    at org.compuscene.metrics.prometheus.PrometheusMetricsCollector.updatePerIndexMetrics(PrometheusMetricsCollector.java:463) ~[?:?]
    at org.compuscene.metrics.prometheus.PrometheusMetricsCollector.updateMetrics(PrometheusMetricsCollector.java:954) ~[?:?]
    at org.opensearch.rest.prometheus.RestPrometheusMetricsAction$1.buildResponse(RestPrometheusMetricsAction.java:131) ~[?:?]
    at org.opensearch.rest.prometheus.RestPrometheusMetricsAction$1.buildResponse(RestPrometheusMetricsAction.java:107) ~[?:?]
    at org.opensearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:52) ~[opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.rest.action.RestActionListener.onResponse(RestActionListener.java:64) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.TransportNodePrometheusMetricsAction$AsyncAction.gatherRequests(TransportNodePrometheusMetricsAction.java:152) [prometheus-exporter-2.13.0.0.jar:2.13.0.0]
    at org.opensearch.action.TransportNodePrometheusMetricsAction$AsyncAction$1.onResponse(TransportNodePrometheusMetricsAction.java:161) [prometheus-exporter-2.13.0.0.jar:2.13.0.0]
    at org.opensearch.action.TransportNodePrometheusMetricsAction$AsyncAction$1.onResponse(TransportNodePrometheusMetricsAction.java:157) [prometheus-exporter-2.13.0.0.jar:2.13.0.0]
    at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:183) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.core.action.ActionListener$2.onResponse(ActionListener.java:108) [opensearch-core-2.13.0.jar:2.13.0]
    at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:355) [opensearch-core-2.13.0.jar:2.13.0]
    at org.opensearch.action.admin.cluster.state.TransportClusterStateAction.clusterManagerOperation(TransportClusterStateAction.java:136) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.admin.cluster.state.TransportClusterStateAction.clusterManagerOperation(TransportClusterStateAction.java:66) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction.masterOperation(TransportClusterManagerNodeAction.java:177) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction.clusterManagerOperation(TransportClusterManagerNodeAction.java:186) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction.lambda$executeOnLocalNode$6(TransportClusterManagerNodeAction.java:428) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction.executeOnLocalNode(TransportClusterManagerNodeAction.java:430) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.handleResponse(TransportClusterManagerNodeAction.java:399) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.handleResponse(TransportClusterManagerNodeAction.java:389) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:398) [opensearch-security-2.13.0.0.jar:2.13.0.0]
    at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1505) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:420) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:412) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:172) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) [opensearch-2.13.0.jar:2.13.0]
    at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-2.13.0.jar:2.13.0]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) [netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) [netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) [netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) [netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) [netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.107.Final.jar:4.1.107.Final]
    at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

Grafana dashboard is empty.

I'm trying to reach the endpoint at https://opensearch-instance:9200/_prometheus/metrics. No other config option regarding this plugin was used. Plugin is installed in following way (k8s pod):

containers:
- args:
    - >-
      /usr/share/opensearch/bin/opensearch-plugin install
      https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.13.0.0/prometheus-exporter-2.13.0.0.zip
      && /usr/share/opensearch/opensearch-docker-entrypoint.sh
      opensearch
  command:
    - /bin/sh
    - '-c'

lukas-vlcek commented 7 months ago

Yes, Prometheus is by default scraping /_prometheus/metrics endpoint every few seconds. Is there a chance that some of the indices that are included in the metrics are empty (without any documents)?

geckiss commented 7 months ago

Yes, some indices are empty.

lukas-vlcek commented 7 months ago

I think I can see how this can happen.

Internally, prometheus plugin calls up to four different APIs to collect individual responses:

cluster health
local NodeInfo
cluster stats
and indices stats (if detailed information about indices is requests).

It makes calls to those APIs sequentially, one at a time. But to populate Prometheus metric catalog it sometimes needs to combine data from those individual responses. And when it comes to indices the chance is that situation can change between the first call and the last call (for example index is created or deleted). This could theoretically lead to issues like that. Basically, this can be a "timing" issue (but that still means it would be a bug and should be fixed).

However, what I see from the stacktrace is that it was successfully able to retrieve information about total number of shards but right after that it failed to retrieve information about primary shards only for the same index (passed line 462 but failed on line 463):

https://github.com/Aiven-Open/prometheus-exporter-plugin-for-opensearch/blob/9fff4ce68f8bf6d5e04da55cbe01b0c43e7c0e95/src/main/java/org/compuscene/metrics/prometheus/PrometheusMetricsCollector.java#L461-L463

That sounds like the IndexStats object did not have populated information about primary shards. Hmm... maybe it was freshly created index and no shard has been allocated yet? Or maybe it is a broken index in some invalid state? It is hard to say at this point.

As a short term workaround I can put in place some safeguards to prevent from exception breaking the flow although the statistics can be incomplete in Prometheus (the plugin will log the name of the index at the WARN level for investigation).

Maybe you can help more?

If you can recreated this issue it could help if you can share output of the following two HTTP REST calls (you should be able to call these via OpenSearch Dashboards console):

Cluster Health (at the shard level): GET /_cluster/health?pretty&level=shards&local=true
Indices Stats: GET /_all/_stats?pretty&filter_path=indices.*.*.docs

If you are using any of plugin index specific configurations then you shall include these into URL params as well.

For example when I start a cluster with a single testing index-ABC I can see the following output. In your case, can you find an index where responses of these two requests are "not in sync"(*)? For example, is the first response missing "shards" section or is the second response missing the "primaries" section?:

(*) It is hard to define what "not in sync" specifically means here ATM.

# http://localhost:9200/_cluster/health?pretty&level=shards&local=true

{
  "cluster_name" : "runTask",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 1,
  "active_shards" : 1,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 50.0,
  "indices" : {
    "index-ABC" : {    #<-- testing index
      "status" : "yellow",
      "number_of_shards" : 1,
      "number_of_replicas" : 1,
      "active_primary_shards" : 1,
      "active_shards" : 1,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 1,
      "shards" : {
        "0" : {
          "status" : "yellow",
          "primary_active" : true,
          "active_shards" : 1,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 1
        }
      }
    }
  }
}

# http://localhost:9200/_all/_stats?pretty&filter_path=indices.*.*.docs

{
  "indices" : {
    "index-ABC" : {    #<-- testing index
      "primaries" : {
        "docs" : {
          "count" : 1,
          "deleted" : 0
        }
      },
      "total" : {
        "docs" : {
          "count" : 1,
          "deleted" : 0
        }
      }
    }
  }
}

lukas-vlcek commented 7 months ago

@geckiss Will you be able to test a fix if I prepare it to a new branch? Are you able to ./gradlew build the project? It would help me a lot if you could test it before I merge any modifications to main branch and do a minor release. That would help us make sure that provided fixes are useful.

Psych0meter commented 7 months ago

Hello @lukas-vlcek ,

I've the same issue, and I'm able to build the project on a test environment, so I can do some tests if it's still required.

lukas-vlcek commented 7 months ago

@Psych0meter great! Just a quick question: can you confirm that you did not face this issue before the upgrade to OpenSearch version 2.13 and respective version of the Prometheus plugin?

Psych0meter commented 7 months ago

@lukas-vlcek I only use the plugin since 2.12.0, but I didn't face any issue before

lukas-vlcek commented 7 months ago

@Psych0meter Thank you for response.

Then it seems like this issue was not occurring in 2.12. Given that plugin version for OpenSearch 2.13 was only a maintenance release (i.e. the code did not change except for OpenSearch dependency version) then this could be interpreted as an issue caused by change in OpenSearch itself. In this case I will prepare a minor plugin release with workaround (so that you can keep using this plugin without its functionality being interrupted by the exception) and then I will try to find a root cause in OpenSearch.

Do you think you can share a little bit more about your environment? Are you running OpenSearch cluster on K8s? Are you using OpenSearch K8s operator? Any other plugins except for Prometheus exporter?

Psych0meter commented 7 months ago

@lukas-vlcek Opensearch is running on Debian 12 virtual machines, 1 cluster_manager and 2 data nodes, without any plugins outside default ones and Prometheus exporter.

lukas-vlcek commented 7 months ago

@Psych0meter Would you say that there is an intensive process involving indices going on on the cluster? For example frequent index management operations with indices? Or the indices set is more static and settled? How many indices? What is the replication schema of indices? Eg.: 1 primary and several replica shards?

Psych0meter commented 7 months ago

@lukas-vlcek I have aroud 30 datastreams that creates indexes, and around 360 indices (200 created by datastreams). Each datastream index is conficured with 1 primary and 1 replica shard for now. I have some indices that are more used than others, with firewall logs for example, and as for today, I have an average input of 300 events per seconds. I also have configured an index state management policy (for test purpose), that change the replica number after 1 day, and delete the index after 7 days

lukas-vlcek commented 7 months ago

@Psych0meter That is very useful information. Thanks for sharing. How often are you getting this exception then? Are you getting it with every scrape cycle (I think it is every 1m by default)?

Psych0meter commented 7 months ago

@lukas-vlcek I have a "dev" and a "prod" environment. In dev, Opensearch is not as used as the prod one. I first tried to uninstall / reinstall the plugin, and I got only one error since the restart, on one node, in the dev environment. I tried the same on the prod environment, but unfortunately it's not working, I have the error every minute, and the metrics are always unavailable

lukas-vlcek commented 7 months ago

@Psych0meter I prepared a workaround in this repository/branch: https://github.com/lukas-vlcek/prometheus-exporter-plugin-for-opensearch/tree/catch_NPE_2.13.0.0

The workaround is quite simple, but it should give us a chance to:

capture the rest of the metrics
possibly give us some clue about what is going on with the indices that are causing the NPE

To see the logs in the log file you will need to make sure that org.compuscene.metrics.prometheus.PrometheusMetricsCollector class has the WARN level enabled. There are several ways how to enable logs, see Logs documentation.

To build the plugin you need to clone the repo:

git clone git@github.com:lukas-vlcek/prometheus-exporter-plugin-for-opensearch.git
# navigate to the repo folder and checkout the branch with the workaround
git checkout catch_NPE_2.13.0.0

And then build using gradle and Java 17 or 21. (Ignore JavaDoc warnings for now)

./gradlew clean build

After that you will find the plugin ZIP in ./build/distributions/ folder.

file ./build/distributions/prometheus-exporter-2.13.0.0.zip 
./build/distributions/prometheus-exporter-2.13.0.0.zip: Zip archive data, at least v2.0 to extract, compression method=deflate

If you do not want to build the plugin, let me know and I can upload the ZIP file somewhere.

lukas-vlcek commented 7 months ago

@Psych0meter If you can replicate the problem at least once on your dev environment then that should be enough.

Psych0meter commented 6 months ago

@lukas-vlcek I've tried to reproduce the problem, without success... The only thing that have changed on the environment since friday is that OS and installed packages (outside opensearch) were updated, and server rebooted. Since then, no more error, even in production

Here are the updated packages, if it can help :

Start-Date: 2024-04-19  08:51:56
Requested-By: xxx
Upgrade: udev:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libnss-myhostname:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), systemd-timesyncd:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), tzdata:amd64 (2023c-5+deb12u1, 2024a-0+deb12u1), libpam-systemd:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), tar:amd64 (1.34+dfsg-1.2, 1.34+dfsg-1.2+deb12u1), libsystemd0:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libnss-systemd:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), systemd:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libudev1:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), usr-is-merged:amd64 (35, 37~deb12u1), systemd-resolved:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), base-files:amd64 (12.4+deb12u4, 12.4+deb12u5), monitoring-plugins-basic:amd64 (2.3.3-5+deb12u1, 2.3.3-5+deb12u2), libcryptsetup12:amd64 (2:2.6.1-4~deb12u1, 2:2.6.1-4~deb12u2), mariadb-common:amd64 (1:10.11.4-1~deb12u1, 1:10.11.6-0+deb12u1), libnss-resolve:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libmariadb3:amd64 (1:10.11.4-1~deb12u1, 1:10.11.6-0+deb12u1), libsystemd-shared:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), monitoring-plugins-common:amd64 (2.3.3-5+deb12u1, 2.3.3-5+deb12u2), systemd-sysv:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libgnutls30:amd64 (3.7.9-2+deb12u1, 3.7.9-2+deb12u2), needrestart:amd64 (3.6-4, 3.6-4+deb12u1), postfix:amd64 (3.7.9-0+deb12u1, 3.7.10-0+deb12u1)
Remove: libevent-2.1-7:amd64 (2.1.12-stable-8), libunbound8:amd64 (1.17.1-2+deb12u2), libgnutls-dane0:amd64 (3.7.9-2+deb12u1)
End-Date: 2024-04-19  08:52:22

lukas-vlcek commented 6 months ago

@Psych0meter thanks a lot for your feedback.

I am inclined to close this ticket and let others reopen it (cc @geckiss) if they can reproduce the problem. Or at least provide more detailed logs from the branch that I prepared.

Aiven-Open / prometheus-exporter-plugin-for-opensearch

metrics endpoint returns 500 after upgrading to 2.13 #266

Maybe you can help more?