elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.25k stars 24.86k forks source link

Provide a way to clear/reset stats #9693

Closed ppf2 closed 8 years ago

ppf2 commented 9 years ago

Currently, a lot of the stats (Eg. node stats) are running totals and do not get reset until the node is restarted. Sometimes, it is useful for admins to be able to reset the stats without having to restart the node. For example, admins dealing with bulk thread pool rejections and playing around with the # of concurrent bulk client/threads to see if it helps decrease the # of bulk rejections, it will be nice to be able to reset the stats of the running total as they make changes without having to keep noting down what the running total count is, etc..

bleskes commented 9 years ago

+1

On Fri, Feb 13, 2015 at 7:40 PM, Pius notifications@github.com wrote:

Currently, a lot of the stats (Eg. node stats) are running totals and do not get reset until the node is restarted. Sometimes, it is useful for admins to be able to reset the stats without having to restart the node. For example, admins dealing with bulk thread pool rejections and playing around with the # of concurrent bulk client/threads to see if it helps decrease the # of bulk rejections, it will be nice to be able to reset the stats of the running total as they make changes without having to keep noting down what the running total count is, etc..

Reply to this email directly or view it on GitHub: https://github.com/elasticsearch/elasticsearch/issues/9693

clintongormley commented 9 years ago

@bleskes I like the idea, but how would we do it without throwing marvel stats way off?

bleskes commented 9 years ago

I don't think marvel should be in our way to make the API more human friendly where we can. We should solve this on the Marvel side. For this specific issue Marvel already have a protection in some places. Other places we will have to tackle. All in all it's not too bad imho as it is a local hickup which can happen anyway (node restarts).

PS. This yet another interesting error case that the coming derivative reducer will need to deal with.

FestivalBobcats commented 9 years ago

Was there any further thought on this? Tracking changes in average query time, for example, is difficult when the process uptime is 10+ days and billions of requests have come through.

I could periodically do a rolling restart (contingent on having ample replication to prevent partial downtime), but it seems like introducing the unnecessary potential for chaos.

FestivalBobcats commented 9 years ago

Would it make more sense to have values (like query_total and query_time_in_millis) presented in 2 or more forms -- for the entire process uptime and values within a time window? e.g. x total queries, x within the last 5 minutes?

Edit: okay the time window is probably impractical if it's "x minutes from now"... maybe having a "transient" total that's periodically refreshed (query_time_in_millis since x minutes ago)?

nik9000 commented 9 years ago

I don't think marvel should be in our way to make the API more human friendly where we can. We should solve this on the Marvel side. For this specific issue Marvel already have a protection in some places.

RRDtool style bounds checking is probably the way to go here.

nik9000 commented 9 years ago

Was there any further thought on this? Tracking changes in average query time, for example, is difficult when the process uptime is 10+ days and billions of requests have come through.

I know I just mentioned RRDtool - but its job is to do this kind of thing. I'm pretty sure whisper will do the job too.

FestivalBobcats commented 9 years ago

@nik9000 I admittedly may be missing something, but how would RRDtool fix the issue? To clarify, it's simply an issue of a very long running average -- it would be totally fine if I could do a moving average, calculating the value I need based on a neighboring subset of data points.

But since Elasticsearch stats do not have data points over time (which is clearly understandable), the accuracy/sensitivity of the average degrades rather quickly on clusters with heavy traffic.

I'm already aggregating these stats and persisting them over time in Kibana. But, just persisting the data points does not change the fact that averages (such as ms taken per query) can only be taken from a diluted total average (total queries / total query ms).

I guess I could do something like this:

  1. When the first "stat" document is indexed, I record the original numbers for process uptime, total queries, and total query ms.
  2. I subtract those original numbers from the values each time I aggregate stats, calculating the average from a "new" baseline (as though the values were 0 when I started monitoring).
  3. Update the starting values for process uptime, total queries, etc. for every node every time it restarts.

Maybe I'm over-thinking this. Math is not my strong suit. But from my perspective, the fix to my problem is (so, so much) easier with a "reset stats" operation.

clintongormley commented 9 years ago

it would be totally fine if I could do a moving average, calculating the value I need based on a neighboring subset of data points.

@FestivalBobcats Coming soon, to a search engine near you: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline-movavg-aggregation.html

FestivalBobcats commented 9 years ago

@clintongormley super cool feature -- looking forward to it.

However (apologies for being a nuisance), even with a moving average, if the time-series data point I'm collecting from /_nodes/stats?indices is average_query_ms (or really average_anything), it's calculated from two (potentially huge) running totals -- some total amount of time (query_time_in_millis) and some total (query_total). Thus, seeing any sharp changes in this value is impossible after a few hours of production use.

This is an issue with several large clusters I'm dealing with. The only solutions I see:

1.) Periodically resetting counters such as query_total and query_time_in_millis. I don't see much of a downside to this other than the actual implementation within Elasticsearch itself -- I'm currently not sure of the feasibility of this.

2.) On each capture of this info, storing the "existing" or "baseline" totals such as query_total (somewhere) in order to eliminate the impact of some previous state of the system in current calculations.

These "baseline" totals would have to be reset when nodes are restarted, introducing some extra complexity.

This is doable, but is a huge pain in my ass -- of course a "reset stat counter" feature may be equally so for the ES team. If it's the only practical way to achieve this for the time being, then I guess it will have to do.

3.) Not depending on Elasticsearch stats for this info... Though for something like average_query_ms, it's unclear to me how I could aggregate this data outside of Elasticsearch. Some requests go through a proxy using HTTP, but for the ones using the TransportClient, I don't have any practical external way of determining the type of operation and it's runtime.

Note all this info should make more sense in the context of me not having access to the application. Purely devops-level monitoring.

clintongormley commented 9 years ago

@FestivalBobcats we don't provide average_query_ms or similar, only totals, so for any periodic monitoring this shouldn't be an issue, no?

FestivalBobcats commented 9 years ago

@clintongormley yes, the average_query_ms is a calculation on my end.

The problem, in the simplest way I can verbalize, is that those totals become basically worthless after a significant volume of requests.

Yes, I could (and am currently doing so) store those values like query_total periodically, and then calculate the difference between those totals at each interval.

So, at either the point of indexing a doc with this average_query_ms, or at the point of searching, I must pull up the doc recorded the interval prior to know how many queries have happened during that interval.

Whatever the case, I get that this is likely a frivolous feature for the community at large, and so I'm already well on my way to a custom solution.

clintongormley commented 9 years ago

@FestivalBobcats Don't forget I was the one who opened this issue in the first place :) My reason for wanting it was a JS diagnostics plugin that just produced a snapshot of the current state (and yes the calculated averages etc) become less meaningful the longer the system is up.

Like you I ended up just keeping the first set of stats around in the browser and using them to calculate a delta. I can still see the benefit of being able to reset stats.

That said, for any long term monitoring, storing the absolute values and using aggs to do the calculations is probably the way to go.

FestivalBobcats commented 9 years ago

@clintongormley haha, sorry to hijack the spotlight. Like you, I'm still all for a reset stats feature. For the time being though, caching the running totals at each interval (to subtract from the values next interval) has worked well, and wasn't as hard as I thought.

clintongormley commented 8 years ago

Given that we now have moving averages etc, I think this feature request is no longer needed. Closing

dadoonet commented 8 years ago

Actually I don't see how moving averages is helping to clear existing stats. I'm probably missing something.

IMO we should reopen this issue and try to reset current stats (if possible) exactly like after a rolling restart but without taking any risk of restarting a cluster.

@clintongormley WDYT?

ESamir commented 8 years ago

+1 to add reset stats feature

@clintongormley how moving averages will help here ?

hsmithatemma commented 8 years ago

+1 to clearing counters. It would be helpful. Specifically these - /_cat/thread_pool: bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected

ppf2 commented 8 years ago

Reopening for it doesn't look resolved yet. As part of this request, if we do implement a stats reset, it may also be helpful to record the timestamp of the last time the stats were "requested to be" reset - for those who may actually be expecting stats from the time of the last reset (and not realizing that the stats were manually reset by someone else for example) - sometimes for troubleshooting you will actually want to see stats from the last uptime, but other times, you may want to reset stats without restarting the server to reproduce something, etc..

clintongormley commented 8 years ago

I'm -1 on resetting stats and adding this unneeded complexity. Monitoring tools are built to handle ever-increasing totals.

thenom commented 8 years ago

I currently would find a version of this fix to be handy as i am trying to setup a watcher on the rejected requests field in the marvel data. I am finding it hard to generate a conditional that will trigger on a value that doesn't represent the last x minutes.

So, for example, on our test setup our bulk rejected is 112. This might not change for days so how do i know when to trigger the action. It is fine having a running total but would it be worth have a field for 'count_since_last_sample'?

clintongormley commented 8 years ago

@thenom a better way to fix this is for watcher to be able to store some history, which can be referred to in subsequent executions. This is on the roadmap.

thenom commented 8 years ago

Brilliant, cheers.

Is this an issue on github i can subscribe to?

ebuildy commented 4 years ago

Good idea, this is really blocking for basic alerting , if we want an alert as soon as a thread is rejected.

HOSTED-POWER commented 2 years ago

Any update in 2022? We could also really use it :)