Open east4ming opened 4 months ago
@east4ming Hi, thanks for reporting. Before I investigate further I have some questions:
In other words is it possible that Prometheus scraped the metrics right after the index was created but before even the primary shards were allocated (thus the state would be red 🔴 )? This can happen for a very short period of time but if that is the moment Prometheus scrapes the metrics then the next update of metric will come with the next scraping cycle.
@east4ming Hi, thanks for reporting. Before I investigate further I have some questions:
- What action caused the cluster health state to change? Was it index creation?
- And what is the Prometheus scraping interval?
In other words is it possible that Prometheus scraped the metrics right after the index was created but before even the primary shards were allocated (thus the state would be red 🔴 )? This can happen for a very short period of time but if that is the moment Prometheus scrapes the metrics then the next update of metric will come with the next scraping cycle.
Thank you for your quick reply. Answer:
Q: What action caused the cluster health state to change? Was it index creation? A: You're right. Status red when the index close is in progress, and then index delete.
Q: And what is the Prometheus scraping interval? A: See below yaml. scrape_interval: 1m and scrape_timeout: 30s. (Most of this configuration follows ES configuration such as scrape_interval etc and modify something to fit opensearch, and ES has never had a Status red false positive.)
global:
evaluation_interval: 1m
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: 'log_opensearch'
scrape_timeout: 30s
static_configs:
- targets:
- 192.168.1.1:9200
metrics_path: "/_prometheus/metrics"
basic_auth:
username: 'xxxx'
password: 'xxxxxxxx'
metric_relabel_configs:
- action: keep
regex: opensearch_circuitbreaker_tripped_count|opensearch_cluster_datanodes_number|opensearch_cluster_nodes_number|opensearch_cluster_pending_tasks_number|opensearch_cluster_shards_active_percent|opensearch_cluster_shards_number|opensearch_cluster_status|opensearch_cluster_task_max_waiting_time_seconds|opensearch_fs_io_total_read_bytes|opensearch_fs_io_total_write_bytes|opensearch_fs_path_free_bytes|opensearch_fs_path_total_bytes|opensearch_index_fielddata_evictions_count|opensearch_index_flush_total_count|opensearch_index_flush_total_time_seconds|opensearch_index_indexing_delete_current_number|opensearch_index_indexing_index_count|opensearch_index_indexing_index_current_number|opensearch_index_indexing_index_failed_count|opensearch_index_indexing_index_time_seconds|opensearch_index_merges_current_size_bytes|opensearch_index_merges_total_docs_count|opensearch_index_merges_total_stopped_time_seconds|opensearch_index_merges_total_throttled_time_seconds|opensearch_index_merges_total_time_seconds|opensearch_index_querycache_evictions_count|opensearch_index_querycache_hit_count|opensearch_index_querycache_memory_size_bytes|opensearch_index_querycache_miss_number|opensearch_index_refresh_total_count|opensearch_index_refresh_total_time_seconds|opensearch_index_requestcache_evictions_count|opensearch_index_requestcache_hit_count|opensearch_index_requestcache_memory_size_bytes|opensearch_index_requestcache_miss_count|opensearch_index_search_fetch_count|opensearch_index_search_fetch_current_number|opensearch_index_search_fetch_time_seconds|opensearch_index_search_query_count|opensearch_index_search_query_current_number|opensearch_index_search_query_time_seconds|opensearch_index_search_scroll_count|opensearch_index_search_scroll_current_number|opensearch_index_search_scroll_time_seconds|opensearch_index_segments_memory_bytes|opensearch_index_segments_number|opensearch_index_shards_number|opensearch_index_store_size_bytes|opensearch_index_translog_operations_number|opensearch_indices_indexing_index_count|opensearch_indices_store_size_bytes|opensearch_ingest_total_count|opensearch_ingest_total_failed_count|opensearch_ingest_total_time_seconds|opensearch_jvm_bufferpool_number|opensearch_jvm_bufferpool_total_capacity_bytes|opensearch_jvm_bufferpool_used_bytes|opensearch_jvm_gc_collection_count|opensearch_jvm_gc_collection_time_seconds|opensearch_jvm_mem_heap_committed_bytes|opensearch_jvm_mem_heap_used_bytes|opensearch_jvm_mem_nonheap_committed_bytes|opensearch_jvm_mem_nonheap_used_bytes|opensearch_jvm_threads_number|opensearch_jvm_uptime_seconds|opensearch_os_cpu_percent|opensearch_os_mem_used_percent|opensearch_os_swap_free_bytes|opensearch_os_swap_used_bytes|opensearch_threadpool_tasks_number|opensearch_threadpool_threads_number|opensearch_transport_rx_bytes_count|opensearch_transport_server_open_number|opensearch_transport_tx_bytes_count|up|opensearch_os_cpu_percent
And, my alert rule see below(I thought you might want to know):
- alert: opensearchClusterRed
expr: opensearch_cluster_status == 2
for: 0m
labels:
severity: emergency
annotations:
summary: opensearch Cluster Red (instance {{ $labels.instance }}, node {{ $labels.node }})
description: "Elastic Cluster Red status\n VALUE = {{ $value }}"
Thanks for more details.
Status red when the index close is in progress, and then index delete.
Would you mind sharing a bit more information about this process please?
close
and delete
operations are more independent?I am trying to see if we can recreate the sequence of steps to reliably replicate this issue, that is why I ask all the questions.
Thanks a lot! Lukáš
Hi Lukáš,
Answers: • Are you closing index using their full ID or are you using a wildcard patterns as well? A: Use pattern • Following that are you deleting the indices that have been closed right before that? Or the close and delete operations are more independent? A: Delete the closed indices • Based on your scrape interval (1m) it seems like the Prometheus has scraped the target several times and still got the red status. So what exactly happened during those 4 minutes that we can see in the chart above? You closed a single index and then deleted it? A: Because we need to close 194 indices and deleted them. So it(status red) went on for a while.
Thanks
@east4ming Thanks for the details.
I know that a red state can happen (for a short period) when a new index is created but based on your explanation this is not the case because you are actually closing and then deleting indices. I do not think there is any reason why the cluster should become red in this scenario.
Q: Just for clarity, do you make sure the index close
operation finishes (ie. yields success/ack response, not timeout or any error) before the delete
operation is called, right?
Yes , close operation finished, delete operation will start. These operation is called by opensearch-curator tool.
Backgrounds
We've just switched from ES to OpenSearch not long ago.
Recently, we found that appear such a case:
The prometheus-exporter-plugin-for-opensearch shows that the cluster status is red, but the direct view of the cluster status in the background (via the OpenSearch api) is always yellow(no transition to red).
See the following figure for details:
Note: UTC is 07:00 - 07:05, corresponding to UTC +8 is 15:00-15:05. The two pictures above are at the same time.
Details
cur -XGET -u username:password localhost:9200/_cat/health?v
Other
If you need more details, please reply. I'll attach it in due course.
Thank you, sir.