Closed geckiss closed 6 months ago
Yes, Prometheus is by default scraping /_prometheus/metrics
endpoint every few seconds.
Is there a chance that some of the indices that are included in the metrics are empty (without any documents)?
Yes, some indices are empty.
I think I can see how this can happen.
Internally, prometheus plugin calls up to four different APIs to collect individual responses:
It makes calls to those APIs sequentially, one at a time. But to populate Prometheus metric catalog it sometimes needs to combine data from those individual responses. And when it comes to indices the chance is that situation can change between the first call and the last call (for example index is created or deleted). This could theoretically lead to issues like that. Basically, this can be a "timing" issue (but that still means it would be a bug and should be fixed).
However, what I see from the stacktrace is that it was successfully able to retrieve information about total number of shards but right after that it failed to retrieve information about primary shards only for the same index (passed line 462
but failed on line 463
):
That sounds like the IndexStats
object did not have populated information about primary shards. Hmm... maybe it was freshly created index and no shard has been allocated yet? Or maybe it is a broken index in some invalid state? It is hard to say at this point.
As a short term workaround I can put in place some safeguards to prevent from exception breaking the flow although the statistics can be incomplete in Prometheus (the plugin will log the name of the index at the WARN
level for investigation).
If you can recreated this issue it could help if you can share output of the following two HTTP REST calls (you should be able to call these via OpenSearch Dashboards console):
GET /_cluster/health?pretty&level=shards&local=true
GET /_all/_stats?pretty&filter_path=indices.*.*.docs
If you are using any of plugin index specific configurations then you shall include these into URL params as well.
For example when I start a cluster with a single testing index-ABC
I can see the following output. In your case, can you find an index where responses of these two requests are "not in sync"(*)? For example, is the first response missing "shards"
section or is the second response missing the "primaries"
section?:
(*) It is hard to define what "not in sync" specifically means here ATM.
# http://localhost:9200/_cluster/health?pretty&level=shards&local=true
{
"cluster_name" : "runTask",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"discovered_master" : true,
"discovered_cluster_manager" : true,
"active_primary_shards" : 1,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0,
"indices" : {
"index-ABC" : { #<-- testing index
"status" : "yellow",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 1,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"shards" : {
"0" : {
"status" : "yellow",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1
}
}
}
}
}
# http://localhost:9200/_all/_stats?pretty&filter_path=indices.*.*.docs
{
"indices" : {
"index-ABC" : { #<-- testing index
"primaries" : {
"docs" : {
"count" : 1,
"deleted" : 0
}
},
"total" : {
"docs" : {
"count" : 1,
"deleted" : 0
}
}
}
}
}
@geckiss Will you be able to test a fix if I prepare it to a new branch? Are you able to ./gradlew build
the project?
It would help me a lot if you could test it before I merge any modifications to main
branch and do a minor release. That would help us make sure that provided fixes are useful.
Hello @lukas-vlcek ,
I've the same issue, and I'm able to build the project on a test environment, so I can do some tests if it's still required.
@Psych0meter great! Just a quick question: can you confirm that you did not face this issue before the upgrade to OpenSearch version 2.13 and respective version of the Prometheus plugin?
@lukas-vlcek I only use the plugin since 2.12.0, but I didn't face any issue before
@Psych0meter Thank you for response.
Then it seems like this issue was not occurring in 2.12. Given that plugin version for OpenSearch 2.13 was only a maintenance release (i.e. the code did not change except for OpenSearch dependency version) then this could be interpreted as an issue caused by change in OpenSearch itself. In this case I will prepare a minor plugin release with workaround (so that you can keep using this plugin without its functionality being interrupted by the exception) and then I will try to find a root cause in OpenSearch.
Do you think you can share a little bit more about your environment? Are you running OpenSearch cluster on K8s? Are you using OpenSearch K8s operator? Any other plugins except for Prometheus exporter?
@lukas-vlcek Opensearch is running on Debian 12 virtual machines, 1 cluster_manager and 2 data nodes, without any plugins outside default ones and Prometheus exporter.
@Psych0meter Would you say that there is an intensive process involving indices going on on the cluster? For example frequent index management operations with indices? Or the indices set is more static and settled? How many indices? What is the replication schema of indices? Eg.: 1 primary and several replica shards?
@lukas-vlcek I have aroud 30 datastreams that creates indexes, and around 360 indices (200 created by datastreams). Each datastream index is conficured with 1 primary and 1 replica shard for now. I have some indices that are more used than others, with firewall logs for example, and as for today, I have an average input of 300 events per seconds. I also have configured an index state management policy (for test purpose), that change the replica number after 1 day, and delete the index after 7 days
@Psych0meter That is very useful information. Thanks for sharing. How often are you getting this exception then? Are you getting it with every scrape cycle (I think it is every 1m by default)?
@lukas-vlcek I have a "dev" and a "prod" environment. In dev, Opensearch is not as used as the prod one. I first tried to uninstall / reinstall the plugin, and I got only one error since the restart, on one node, in the dev environment. I tried the same on the prod environment, but unfortunately it's not working, I have the error every minute, and the metrics are always unavailable
@Psych0meter I prepared a workaround in this repository/branch: https://github.com/lukas-vlcek/prometheus-exporter-plugin-for-opensearch/tree/catch_NPE_2.13.0.0
The workaround is quite simple, but it should give us a chance to:
To see the logs in the log file you will need to make sure that org.compuscene.metrics.prometheus.PrometheusMetricsCollector
class has the WARN
level enabled.
There are several ways how to enable logs, see Logs documentation.
To build the plugin you need to clone the repo:
git clone git@github.com:lukas-vlcek/prometheus-exporter-plugin-for-opensearch.git
# navigate to the repo folder and checkout the branch with the workaround
git checkout catch_NPE_2.13.0.0
And then build using gradle and Java 17 or 21. (Ignore JavaDoc warnings for now)
./gradlew clean build
After that you will find the plugin ZIP in ./build/distributions/
folder.
file ./build/distributions/prometheus-exporter-2.13.0.0.zip
./build/distributions/prometheus-exporter-2.13.0.0.zip: Zip archive data, at least v2.0 to extract, compression method=deflate
If you do not want to build the plugin, let me know and I can upload the ZIP file somewhere.
@Psych0meter If you can replicate the problem at least once on your dev environment then that should be enough.
@lukas-vlcek I've tried to reproduce the problem, without success... The only thing that have changed on the environment since friday is that OS and installed packages (outside opensearch) were updated, and server rebooted. Since then, no more error, even in production
Here are the updated packages, if it can help :
Start-Date: 2024-04-19 08:51:56
Requested-By: xxx
Upgrade: udev:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libnss-myhostname:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), systemd-timesyncd:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), tzdata:amd64 (2023c-5+deb12u1, 2024a-0+deb12u1), libpam-systemd:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), tar:amd64 (1.34+dfsg-1.2, 1.34+dfsg-1.2+deb12u1), libsystemd0:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libnss-systemd:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), systemd:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libudev1:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), usr-is-merged:amd64 (35, 37~deb12u1), systemd-resolved:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), base-files:amd64 (12.4+deb12u4, 12.4+deb12u5), monitoring-plugins-basic:amd64 (2.3.3-5+deb12u1, 2.3.3-5+deb12u2), libcryptsetup12:amd64 (2:2.6.1-4~deb12u1, 2:2.6.1-4~deb12u2), mariadb-common:amd64 (1:10.11.4-1~deb12u1, 1:10.11.6-0+deb12u1), libnss-resolve:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libmariadb3:amd64 (1:10.11.4-1~deb12u1, 1:10.11.6-0+deb12u1), libsystemd-shared:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), monitoring-plugins-common:amd64 (2.3.3-5+deb12u1, 2.3.3-5+deb12u2), systemd-sysv:amd64 (252.19-1~deb12u1, 252.22-1~deb12u1), libgnutls30:amd64 (3.7.9-2+deb12u1, 3.7.9-2+deb12u2), needrestart:amd64 (3.6-4, 3.6-4+deb12u1), postfix:amd64 (3.7.9-0+deb12u1, 3.7.10-0+deb12u1)
Remove: libevent-2.1-7:amd64 (2.1.12-stable-8), libunbound8:amd64 (1.17.1-2+deb12u2), libgnutls-dane0:amd64 (3.7.9-2+deb12u1)
End-Date: 2024-04-19 08:52:22
@Psych0meter thanks a lot for your feedback.
I am inclined to close this ticket and let others reopen it (cc @geckiss) if they can reproduce the problem. Or at least provide more detailed logs from the branch that I prepared.
Hello,
we were using this plugin for some time without any issue. However, after this weeks upgrade to 2.13, metrics endpoint returns following:
I can see the message in logs every few seconds (so prometheus is probably trying to scrape?):
Grafana dashboard is empty.
I'm trying to reach the endpoint at
https://opensearch-instance:9200/_prometheus/metrics
. No other config option regarding this plugin was used. Plugin is installed in following way (k8s pod):