[Bug]? prometheus-exporter-plugin-for-opensearch shows the cluster status red, but the background view cluster status is always yellow.

east4ming commented 4 months ago

Backgrounds

We've just switched from ES to OpenSearch not long ago.

Recently, we found that appear such a case:

The prometheus-exporter-plugin-for-opensearch shows that the cluster status is red, but the direct view of the cluster status in the background (via the OpenSearch api) is always yellow(no transition to red).

See the following figure for details: prometheus-exporter-plugin-for-opensearch `opensearch_cluster_status` with prometheus ui

opensearch status with curl get opensearch api

Note: UTC is 07:00 - 07:05, corresponding to UTC +8 is 15:00-15:05. The two pictures above are at the same time.

Details

Our OpenSearch is a single node, so the state should always be yellow;
opensearch version:2.12.0
prometheus-exporter version: 2.12.0.0
View OpenSearch api status command: cur -XGET -u username:password localhost:9200/_cat/health?v

Other

If you need more details, please reply. I'll attach it in due course.

Thank you, sir.

lukas-vlcek commented 4 months ago

@east4ming Hi, thanks for reporting. Before I investigate further I have some questions:

What action caused the cluster health state to change? Was it index creation?
And what is the Prometheus scraping interval?

In other words is it possible that Prometheus scraped the metrics right after the index was created but before even the primary shards were allocated (thus the state would be red 🔴 )? This can happen for a very short period of time but if that is the moment Prometheus scrapes the metrics then the next update of metric will come with the next scraping cycle.

east4ming commented 4 months ago

@east4ming Hi, thanks for reporting. Before I investigate further I have some questions:

What action caused the cluster health state to change? Was it index creation?

And what is the Prometheus scraping interval?

In other words is it possible that Prometheus scraped the metrics right after the index was created but before even the primary shards were allocated (thus the state would be red 🔴 )? This can happen for a very short period of time but if that is the moment Prometheus scrapes the metrics then the next update of metric will come with the next scraping cycle.

Thank you for your quick reply. Answer:

Q: What action caused the cluster health state to change? Was it index creation? A: You're right. Status red when the index close is in progress, and then index delete.

Q: And what is the Prometheus scraping interval? A: See below yaml. scrape_interval: 1m and scrape_timeout: 30s. (Most of this configuration follows ES configuration such as scrape_interval etc and modify something to fit opensearch, and ES has never had a Status red false positive.)

global:
  evaluation_interval: 1m
  scrape_interval: 1m
  scrape_timeout: 10s
scrape_configs:
- job_name: 'log_opensearch'
  scrape_timeout: 30s
  static_configs:
      - targets:
        - 192.168.1.1:9200
  metrics_path: "/_prometheus/metrics"
  basic_auth:
    username: 'xxxx'
    password: 'xxxxxxxx'
  metric_relabel_configs:
  - action: keep
    regex: opensearch_circuitbreaker_tripped_count|opensearch_cluster_datanodes_number|opensearch_cluster_nodes_number|opensearch_cluster_pending_tasks_number|opensearch_cluster_shards_active_percent|opensearch_cluster_shards_number|opensearch_cluster_status|opensearch_cluster_task_max_waiting_time_seconds|opensearch_fs_io_total_read_bytes|opensearch_fs_io_total_write_bytes|opensearch_fs_path_free_bytes|opensearch_fs_path_total_bytes|opensearch_index_fielddata_evictions_count|opensearch_index_flush_total_count|opensearch_index_flush_total_time_seconds|opensearch_index_indexing_delete_current_number|opensearch_index_indexing_index_count|opensearch_index_indexing_index_current_number|opensearch_index_indexing_index_failed_count|opensearch_index_indexing_index_time_seconds|opensearch_index_merges_current_size_bytes|opensearch_index_merges_total_docs_count|opensearch_index_merges_total_stopped_time_seconds|opensearch_index_merges_total_throttled_time_seconds|opensearch_index_merges_total_time_seconds|opensearch_index_querycache_evictions_count|opensearch_index_querycache_hit_count|opensearch_index_querycache_memory_size_bytes|opensearch_index_querycache_miss_number|opensearch_index_refresh_total_count|opensearch_index_refresh_total_time_seconds|opensearch_index_requestcache_evictions_count|opensearch_index_requestcache_hit_count|opensearch_index_requestcache_memory_size_bytes|opensearch_index_requestcache_miss_count|opensearch_index_search_fetch_count|opensearch_index_search_fetch_current_number|opensearch_index_search_fetch_time_seconds|opensearch_index_search_query_count|opensearch_index_search_query_current_number|opensearch_index_search_query_time_seconds|opensearch_index_search_scroll_count|opensearch_index_search_scroll_current_number|opensearch_index_search_scroll_time_seconds|opensearch_index_segments_memory_bytes|opensearch_index_segments_number|opensearch_index_shards_number|opensearch_index_store_size_bytes|opensearch_index_translog_operations_number|opensearch_indices_indexing_index_count|opensearch_indices_store_size_bytes|opensearch_ingest_total_count|opensearch_ingest_total_failed_count|opensearch_ingest_total_time_seconds|opensearch_jvm_bufferpool_number|opensearch_jvm_bufferpool_total_capacity_bytes|opensearch_jvm_bufferpool_used_bytes|opensearch_jvm_gc_collection_count|opensearch_jvm_gc_collection_time_seconds|opensearch_jvm_mem_heap_committed_bytes|opensearch_jvm_mem_heap_used_bytes|opensearch_jvm_mem_nonheap_committed_bytes|opensearch_jvm_mem_nonheap_used_bytes|opensearch_jvm_threads_number|opensearch_jvm_uptime_seconds|opensearch_os_cpu_percent|opensearch_os_mem_used_percent|opensearch_os_swap_free_bytes|opensearch_os_swap_used_bytes|opensearch_threadpool_tasks_number|opensearch_threadpool_threads_number|opensearch_transport_rx_bytes_count|opensearch_transport_server_open_number|opensearch_transport_tx_bytes_count|up|opensearch_os_cpu_percent

And, my alert rule see below(I thought you might want to know):

- alert: opensearchClusterRed
  expr: opensearch_cluster_status == 2
  for: 0m
  labels:
    severity: emergency
  annotations:
    summary: opensearch Cluster Red (instance {{ $labels.instance }}, node {{ $labels.node }})
    description: "Elastic Cluster Red status\n  VALUE = {{ $value }}"

lukas-vlcek commented 4 months ago

Thanks for more details.

Status red when the index close is in progress, and then index delete.

Would you mind sharing a bit more information about this process please?

Are you closing index using their full ID or are you using a wildcard patterns as well?
Following that are you deleting the indices that have been closed right before that? Or the close and delete operations are more independent?
Based on your scrape interval (1m) it seems like the Prometheus has scraped the target several times and still got the red status. So what exactly happened during those 4 minutes that we can see in the chart above? You closed a single index and then deleted it?

I am trying to see if we can recreate the sequence of steps to reliably replicate this issue, that is why I ask all the questions.

Thanks a lot! Lukáš

east4ming commented 4 months ago

Hi Lukáš,

Answers: • Are you closing index using their full ID or are you using a wildcard patterns as well? A: Use pattern • Following that are you deleting the indices that have been closed right before that? Or the close and delete operations are more independent? A: Delete the closed indices • Based on your scrape interval (1m) it seems like the Prometheus has scraped the target several times and still got the red status. So what exactly happened during those 4 minutes that we can see in the chart above? You closed a single index and then deleted it? A: Because we need to close 194 indices and deleted them. So it(status red) went on for a while.

Thanks

lukas-vlcek commented 4 months ago

@east4ming Thanks for the details.

I know that a red state can happen (for a short period) when a new index is created but based on your explanation this is not the case because you are actually closing and then deleting indices. I do not think there is any reason why the cluster should become red in this scenario.

Q: Just for clarity, do you make sure the index close operation finishes (ie. yields success/ack response, not timeout or any error) before the delete operation is called, right?

east4ming commented 4 months ago

Yes , close operation finished, delete operation will start. These operation is called by opensearch-curator tool.

Aiven-Open / prometheus-exporter-plugin-for-opensearch