[Alerting][ILM] Alert for index moving into ilm.step:ERROR

yaronp68 commented 6 years ago

Describe the feature:

Automated notification about an index's ILM Explain moving into an error (with symptoms of either step:ERROR, usually with step_info:!null). (Errors will surface against the index's ILM Explain and not against the ILM Policy itself, updated from original description).

Describe a specific use case for the feature:

ILM policy will be a critical BG process for managing data in a way that ensures query performance and might save storage / HW costs / prevent running out of storage space. Failure of such policy is therefore of interest for the cluster admin / index owner. This is (like ILM policies) relevant for cluster who has constant flow of inbound data (e.g. logs, metrics, other types of high volume events).

Describe possible implementations:

There's two general Elastic ways to automate notifications about different cluster states. The below thread covers unofficial methods for manually implementing Watcher (which was recommended product when issue was filed) and Alerting (which is now recommended product). Re-purposing this issue to request an Elastic Prebuilt Alert or official documentation recommendation.

cjcenizal commented 5 years ago

@yaronp68 could you add some details about this feature? Could you also check that these are the correct labels? I’m not sure how this relates to Watcher or Stack Monitoring.

mwasilew2 commented 4 years ago

For anyone else who comes across this issue, you can use HTTP input in a watch: https://www.elastic.co/guide/en/elasticsearch/reference/current/input-http.html#_calling_elasticsearch_apis

you're looking for "step":"ERROR" in the response, so it's roughly something like this:

PUT _watcher/watch/test
{
  "trigger" : { "schedule" : { "interval" : "1m" } },
  "input": {
    "http": {
      "request": {
        "host": "localhost",
        "port": 9243,
        "path": "/*/_ilm/explain",
        "scheme": "https"
      }
    }
  },
  "condition" : {
    "script" :
    """
      return ctx.payload.message.indexOf('ERROR') != 0
    """
  },
  "actions" : {
    "my_log" : {
      "logging" : {
        "text" : "Found ILM errors."
      }
    }
  }
}

Of course you can also use any other monitoring tool capable of quering an API, e.g. blackbox_exporter with an http probe + regex.

toddferg commented 4 years ago

Here's some more example watchers:

One Cluster:

POST _watcher/watch/_execute
{
  "ignore_condition": true,
  "watch": {
    "trigger": {
      "schedule": {
        "interval": "10m"
      }
    },
    "input": {
      "http": {
        "request": {
          "scheme": "https",
          "host": "<host>",
          "port": 9200,
          "path": "*/_ilm/explain",
          "auth": {
            "basic": {
            "username": "<user>",
            "password": "<pass>"
            }
          }
        }
      }
    },
    "transform": {
      "script": """
      ctx.payload.indices.values().stream()
        .filter(ilmInfo -> ilmInfo.managed && ilmInfo.step == "ERROR")
        .collect(Collectors.toList());
      """
    },
    "condition": {
      "script": {
        "source": "return !ctx.payload.empty()"
      }
    },
    "actions" : {
      "test_log" : {
        "logging" : {
          "text" : "Failed: {{ctx.payload}}"
        }
      }
    }
  }
}

Multiple clusters:

POST _watcher/watch/_execute
{
  "ignore_condition": true,
  "watch": {
    "trigger": {
      "schedule": {
        "interval": "10m"
      }
    },
    "input": {
      "chain": {
        "inputs": [
          {
            "asm": {
              "http": {
                "request": {
                  "scheme": "https",
                  "host": "elk1.thehost.local",
                  "port": 9200,
                  "method": "get",
                  "path": "/_all/_ilm/explain",
                  "params": {},
                  "headers": {},
                  "auth": {
                    "basic": {
                      "username": "username",
                      "password": "password"
                    }
                  }
                }
              }
            }
          },
                    {
            "boo": {
              "http": {
                "request": {
                  "scheme": "https",
                  "host": "elk1.thehost.local",
                  "port": 9200,
                  "method": "get",
                  "path": "/_all/_ilm/explain",
                  "params": {},
                  "headers": {},
                  "auth": {
                    "basic": {
                      "username": "username",
                      "password": "password"
                    }
                  }
                }
              }
            }
          },
                    {
            "bah": {
              "http": {
                "request": {
                  "scheme": "https",
                  "host": "elk1.thehost.local",
                  "port": 9200,
                  "method": "get",
                  "path": "/_all/_ilm/explain",
                  "params": {},
                  "headers": {},
                  "auth": {
                    "basic": {
                      "username": "username",
                      "password": "password"
                    }
                  }
                }
              }
            }
          }
        ]
      }
    },
    "transform": {
      "script": """
      def test = [];
      HashMap level = new HashMap();
      ctx.payload.entrySet().stream().forEach(e -> {
        e.value.indices.forEach((t, index) -> {
         if (index.step == "ERROR") {
           index.server = e.key;
           test.add(index);
         }
         return true;
        });
       return true;
      });
      return test;
      """
    },
    "actions": {
      "log": {
        "logging": {
          "text": """
          ILM Error state. Please investigate:
          <table>
              <tr> 
                <th>Server</th> 
                <th>Index</th> 
                <th>Policy</th>
                <th>Failed_step</th>
              </tr>
              {{#ctx.payload._value}}
                  <tr>
                      <th>{{server}}</th>
                      <th>{{index}}</th>
                      <th>{{policy}}</th>
                      <th>{{failed_step}}</th>
                  </tr>
              {{/ctx.payload._value}}
          </table>
         """
        }
      }
    }
  }
}

bczifra commented 4 years ago

@toddferg 's watch from above isn't quite right. the condition is being ignored because of "ignore_condition": true,, and if you enable the condition, it causes a scripting error. This version is tested and working as of 7.5.1, which removes the "indices" and "_headers" fields to clean up the payload:

POST _watcher/watch/_execute
{
  "watch": {
    "trigger": {
      "schedule": {
        "interval": "10m"
      }
    },
    "input": {
      "http": {
        "request": {
          "scheme": "https",
          "host": "$host",
          "port": $port,
          "path": "*/_ilm/explain",
          "auth": {
            "basic": {
            "username": "elastic",
            "password": "$password"
            }
          }
        }
      }
    },
    "condition": {
      "script": {
        "source": """
        ctx.payload.ilm_errors =ctx.payload.indices.values().stream().filter(ilmInfo -> ilmInfo.managed && ilmInfo.step == "ERROR")
        .collect(Collectors.toList());
        ctx.payload.remove("_headers");
        ctx.payload.remove("indices");
        return ctx.payload.ilm_errors.length > 0
        """
      }
    }, 
    "actions" : {
      "test_log" : {
        "logging" : {
          "text" : "ILM Errors: {{ctx.payload.ilm_errors}}"
        }
      }
    }
  }
}

toddferg commented 3 years ago

Thanks @bczifra

connosco2011 commented 1 year ago

HI!@bczifra

I'm trying your last update in 7.17.1 and get an empty ILM Errors in action, any suggestion ? When I simulate the it i could see all the info down ilm_errors, but i trying differents way to get it without luck

bczifra commented 1 year ago

@connosco2011 I don't understand what issue you are running into, but in 7.17.1 the ILM Explain Lifecycle API has an only_errors property. The above example should be adapted to take advantage of that new API property.

connosco2011 commented 1 year ago

@bczifra sorry, i didn't check this new option, i will try to work with it and u feedback, thanks a lot for your fast answer

connosco2011 commented 1 year ago

{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "http": {
      "request": {
        "scheme": "https",
        "host": "xxxxx",
        "port": 9200,
        "method": "get",
        "path": "*/_ilm/explain",
        "params": {
          "only_errors": "",
          "format": "json"
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": """
        ctx.payload.ilm_errors =ctx.payload.indices.values().stream().collect(Collectors.toList());
        ctx.payload.remove("_headers");
        ctx.payload.remove("indices");
        return ctx.payload.ilm_errors.length > 0
        """,
      "lang": "painless"
    }
  },
  "actions": {
    "test_log": {
      "logging": {
        "level": "info",
        "text": """ILM Errors: {{#ctx.payload.ilm_errors}}
                   Index Name: {{index}} Reason: {{#step_info}}{{reason}}
                   {{/step_info}}{{/ctx.payload.ilm_errors}}"""
      }
    }
  }
}

@bczifra This is working with 7.17, I'm having issues trying to export the same result that i have in test_log to an index to index the results in different documents.

stefnestor commented 1 year ago

👋🏼 heya, Dev+Product!

Time's passed and I'm wondering if y'all could instead consider officially implementing this use case request as a Kibana Stack Monitoring Alert (based on Alerting not Watcher) looking back against either

ES cluster logs Lucene searched for "Moving to ERROR step" AND "org.elasticsearch.xpack.ilm.IndexLifecycleRunner"

[2023-08-29T21:42:13,857][ERROR][org.elasticsearch.xpack.ilm.IndexLifecycleRunner] [instance-0000000000] policy [timeseries_policy] for index [kibana_sample_data_logs] failed on step [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]. Moving to ERROR step
    java.lang.IllegalArgumentException: index.lifecycle.rollover_alias [times] does not point to index [kibana_sample_data_logs]

ilm-history* for any document in step:ERROR triggering every time occurs or only first time enters if possible. Would require es#99030. V7.17 example

elasticmachine commented 1 year ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

elastic / kibana

[Alerting][ILM] Alert for index moving into ilm.step:ERROR #21023