Open yaronp68 opened 6 years ago
@yaronp68 could you add some details about this feature? Could you also check that these are the correct labels? Iām not sure how this relates to Watcher or Stack Monitoring.
For anyone else who comes across this issue, you can use HTTP input in a watch: https://www.elastic.co/guide/en/elasticsearch/reference/current/input-http.html#_calling_elasticsearch_apis
you're looking for "step":"ERROR"
in the response, so it's roughly something like this:
PUT _watcher/watch/test
{
"trigger" : { "schedule" : { "interval" : "1m" } },
"input": {
"http": {
"request": {
"host": "localhost",
"port": 9243,
"path": "/*/_ilm/explain",
"scheme": "https"
}
}
},
"condition" : {
"script" :
"""
return ctx.payload.message.indexOf('ERROR') != 0
"""
},
"actions" : {
"my_log" : {
"logging" : {
"text" : "Found ILM errors."
}
}
}
}
Of course you can also use any other monitoring tool capable of quering an API, e.g. blackbox_exporter with an http probe + regex.
Here's some more example watchers:
One Cluster:
POST _watcher/watch/_execute
{
"ignore_condition": true,
"watch": {
"trigger": {
"schedule": {
"interval": "10m"
}
},
"input": {
"http": {
"request": {
"scheme": "https",
"host": "<host>",
"port": 9200,
"path": "*/_ilm/explain",
"auth": {
"basic": {
"username": "<user>",
"password": "<pass>"
}
}
}
}
},
"transform": {
"script": """
ctx.payload.indices.values().stream()
.filter(ilmInfo -> ilmInfo.managed && ilmInfo.step == "ERROR")
.collect(Collectors.toList());
"""
},
"condition": {
"script": {
"source": "return !ctx.payload.empty()"
}
},
"actions" : {
"test_log" : {
"logging" : {
"text" : "Failed: {{ctx.payload}}"
}
}
}
}
}
Multiple clusters:
POST _watcher/watch/_execute
{
"ignore_condition": true,
"watch": {
"trigger": {
"schedule": {
"interval": "10m"
}
},
"input": {
"chain": {
"inputs": [
{
"asm": {
"http": {
"request": {
"scheme": "https",
"host": "elk1.thehost.local",
"port": 9200,
"method": "get",
"path": "/_all/_ilm/explain",
"params": {},
"headers": {},
"auth": {
"basic": {
"username": "username",
"password": "password"
}
}
}
}
}
},
{
"boo": {
"http": {
"request": {
"scheme": "https",
"host": "elk1.thehost.local",
"port": 9200,
"method": "get",
"path": "/_all/_ilm/explain",
"params": {},
"headers": {},
"auth": {
"basic": {
"username": "username",
"password": "password"
}
}
}
}
}
},
{
"bah": {
"http": {
"request": {
"scheme": "https",
"host": "elk1.thehost.local",
"port": 9200,
"method": "get",
"path": "/_all/_ilm/explain",
"params": {},
"headers": {},
"auth": {
"basic": {
"username": "username",
"password": "password"
}
}
}
}
}
}
]
}
},
"transform": {
"script": """
def test = [];
HashMap level = new HashMap();
ctx.payload.entrySet().stream().forEach(e -> {
e.value.indices.forEach((t, index) -> {
if (index.step == "ERROR") {
index.server = e.key;
test.add(index);
}
return true;
});
return true;
});
return test;
"""
},
"actions": {
"log": {
"logging": {
"text": """
ILM Error state. Please investigate:
<table>
<tr>
<th>Server</th>
<th>Index</th>
<th>Policy</th>
<th>Failed_step</th>
</tr>
{{#ctx.payload._value}}
<tr>
<th>{{server}}</th>
<th>{{index}}</th>
<th>{{policy}}</th>
<th>{{failed_step}}</th>
</tr>
{{/ctx.payload._value}}
</table>
"""
}
}
}
}
}
@toddferg 's watch from above isn't quite right. the condition is being ignored because of "ignore_condition": true,
, and if you enable the condition, it causes a scripting error. This version is tested and working as of 7.5.1, which removes the "indices" and "_headers" fields to clean up the payload:
POST _watcher/watch/_execute
{
"watch": {
"trigger": {
"schedule": {
"interval": "10m"
}
},
"input": {
"http": {
"request": {
"scheme": "https",
"host": "$host",
"port": $port,
"path": "*/_ilm/explain",
"auth": {
"basic": {
"username": "elastic",
"password": "$password"
}
}
}
}
},
"condition": {
"script": {
"source": """
ctx.payload.ilm_errors =ctx.payload.indices.values().stream().filter(ilmInfo -> ilmInfo.managed && ilmInfo.step == "ERROR")
.collect(Collectors.toList());
ctx.payload.remove("_headers");
ctx.payload.remove("indices");
return ctx.payload.ilm_errors.length > 0
"""
}
},
"actions" : {
"test_log" : {
"logging" : {
"text" : "ILM Errors: {{ctx.payload.ilm_errors}}"
}
}
}
}
}
Thanks @bczifra
HI!@bczifra
I'm trying your last update in 7.17.1 and get an empty ILM Errors in action, any suggestion ? When I simulate the it i could see all the info down ilm_errors, but i trying differents way to get it without luck
@connosco2011 I don't understand what issue you are running into, but in 7.17.1 the ILM Explain Lifecycle API has an only_errors
property. The above example should be adapted to take advantage of that new API property.
@bczifra sorry, i didn't check this new option, i will try to work with it and u feedback, thanks a lot for your fast answer
{
"trigger": {
"schedule": {
"interval": "10m"
}
},
"input": {
"http": {
"request": {
"scheme": "https",
"host": "xxxxx",
"port": 9200,
"method": "get",
"path": "*/_ilm/explain",
"params": {
"only_errors": "",
"format": "json"
}
}
}
},
"condition": {
"script": {
"source": """
ctx.payload.ilm_errors =ctx.payload.indices.values().stream().collect(Collectors.toList());
ctx.payload.remove("_headers");
ctx.payload.remove("indices");
return ctx.payload.ilm_errors.length > 0
""",
"lang": "painless"
}
},
"actions": {
"test_log": {
"logging": {
"level": "info",
"text": """ILM Errors: {{#ctx.payload.ilm_errors}}
Index Name: {{index}} Reason: {{#step_info}}{{reason}}
{{/step_info}}{{/ctx.payload.ilm_errors}}"""
}
}
}
}
@bczifra This is working with 7.17, I'm having issues trying to export the same result that i have in test_log to an index to index the results in different documents.
šš¼ heya, Dev+Product!
Time's passed and I'm wondering if y'all could instead consider officially implementing this use case request as a Kibana Stack Monitoring Alert (based on Alerting not Watcher) looking back against either
"Moving to ERROR step" AND "org.elasticsearch.xpack.ilm.IndexLifecycleRunner"
[2023-08-29T21:42:13,857][ERROR][org.elasticsearch.xpack.ilm.IndexLifecycleRunner] [instance-0000000000] policy [timeseries_policy] for index [kibana_sample_data_logs] failed on step [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]. Moving to ERROR step
java.lang.IllegalArgumentException: index.lifecycle.rollover_alias [times] does not point to index [kibana_sample_data_logs]
ilm-history*
for any document in step:ERROR
triggering every time occurs or only first time enters if possible. Would require es#99030. V7.17 example
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)
Describe the feature:
Automated notification about an index's ILM Explain moving into an error (with symptoms of either
step:ERROR
, usually withstep_info:!null
). (Errors will surface against the index's ILM Explain and not against the ILM Policy itself, updated from original description).Describe a specific use case for the feature:
ILM policy will be a critical BG process for managing data in a way that ensures query performance and might save storage / HW costs / prevent running out of storage space. Failure of such policy is therefore of interest for the cluster admin / index owner. This is (like ILM policies) relevant for cluster who has constant flow of inbound data (e.g. logs, metrics, other types of high volume events).
Describe possible implementations:
There's two general Elastic ways to automate notifications about different cluster states. The below thread covers unofficial methods for manually implementing Watcher (which was recommended product when issue was filed) and Alerting (which is now recommended product). Re-purposing this issue to request an Elastic Prebuilt Alert or official documentation recommendation.