exceptionless / Exceptionless

Exceptionless application
https://exceptionless.com
Apache License 2.0
2.4k stars 517 forks source link

Smarter throttling #112

Open romanych opened 9 years ago

romanych commented 9 years ago

We are running multiple products and environments under single Exceptionless account on large plan.

Sometimes one of products on one of environments starts sending a lot of exception logs to system and the system throttles all exceptions from all products and that's a problem.

Also throttling is not really throttling - it stops receiving any errors until hour completes.

We would highly appreciate if:

Currently we are creating temporary accounts for troubleshooting of such cases that happens once in months.

niemyjski commented 9 years ago

@romanych:

@ejsmith feedback?

ejsmith commented 9 years ago

Yeah, I'm really not sure how this would work. I don't want people to have to configure this and I don't want to have a ton of options. So I'm not sure how it would work. Gotta think about it.

niemyjski commented 9 years ago

Yeah, we need to figure out a better way to handle this.

niemyjski commented 9 years ago

@romanych can you provide some feedback to my questions above.

romanych commented 9 years ago

Almost done with pretty solid suggestion. Will do my best to send it tomorrow

@romanych https://github.com/romanych can you provide some feedback to my questions above.

— Reply to this email directly or view it on GitHub https://github.com/exceptionless/Exceptionless/issues/112#issuecomment-117239654 .

romanych commented 9 years ago

@niemyjski, @ejsmith let me share my thoughts in this regards.

That there is maximum number of events/month per account and this number depends on pricing plan.

Exceptionless customers (like we are) are mostly interested in two things:

We are using Projects to logically group stacks. So I am really interested to:

These thought lead me to proposal to implement throttling on stack level, not project or account level. When I say throttling I mean setting some rate of occurrences can be logged.

To be more precise I would suggest to add following properties to stack:

When user is watching stack which was under throttling now should be shown:

Exception occurrences are  throttled now. Approximate number is ApproxOccurrences(timespan)

I didn't went deep into Exceptionless event logging system so not suggesting implementation.

It was pretty easy stuff, most hardcore stuff it is how and when calculate AcceptRate.

Let's determine T as period when some job (e.g. Throttler) checks all occurrences and decides what should be done with AcceptRate:

Accordingly to my observations you have very similar job with T = 1 hour which set AcceptRate to 0 on account level.

I would suggest to set T = 15 minutes.

Also I would like to define some functions:

It's easier to pseudo-code the algorithm.

For simplification I can suggest that each stack can get up to 25% of monthly limit.

foreach (account) {
 if (OccurrencesReceived(T, account) > MaxThroutput(T, account)) {
    account.isUnderThrottling = true;
    DecreaseAcceptRate(account);
 } else if (account.isUnderThrottling) {
    TryIncreaseAcceptRate(account);
  }
}

function DecreaseAcceptRate(account) {
  foreach (stack in account.stacks) {
    if (OccurrencesReceived(T, stack) > 0.25 * MaxThroutput(T, stack)) {
      stack.throttled = true;
      stack.acceptRate = 0.25 * MaxThroutput(T, stack) / OccurrencesReceived(T, stack);
    }
  }
}

function TryIncreaseAcceptRate() {
  foreach (stack in account.stacks.throttled) {
stack.acceptRate = MIN(1, 0.25 * MaxThroutput(T, stack) / OccurrencesReceived(T, stack));
  }
}

There is a some obvious problems:

Nevertheless I hope that my idea has sense and you will get some inspiration from it

niemyjski commented 9 years ago

Thanks for your great feedback, I'll reread it a few times over the next few days and think about it as well. Just some feedback from the top of my head.

You can see these trends today on the error dashboard. There is a list of the most frequent stacks and you can view trending data over time. Granted there is things we want to do to improve on this.

Currently we are throttling organization wide. I think the finest grain of control we could reasonably go would be throttling at the project level. We have a http handler that throttles events based on the api key (project level) without even reading the stream / processing anything.. If we did it during the event pipeline that would add a serious amount of overhead to the system (now were queueing to disk and queue the event, then deserializing it and processing it via a job, then running it through a pipeline just to see if that specific instance is throttled).

I like your idea of having a max throughput should be 15 minute based. It would be nice to have a 1.5x rate per 15 minute period and if a project hits 75% of the rate in 5 minutes we throttle just that project.. Thoughts? We need to keep the logic simple I think because then it's easy to understand/test and it's going to be lightning fast (not slow things down).

ejsmith commented 9 years ago

So the problem with this is that the majority of cost associated to accounts is the bandwidth and initial processing of the request. Our current policy is that we will throttle your account if your going over your plan, but if you are going significantly over your plan and incurring a lot of overage / cost then we would ask you to upgrade your plan. If we make this change then it's really just encouraging people to not worry about the fact that they are sending us a lot of data and costing us money. ie. if it doesn't affect them, then why should they bother trying to fix it?

Does this make sense?

ejsmith commented 9 years ago

To your original feedback, I think sending an email letting you know when we throttle your account and including data showing you why it got throttled would be really good.

Also, what we are doing currently is throttling per hour in order to give you more of a sampling of events over the course of time. Maybe we need to change this up to be smaller windows so that their aren't long periods of time with no events.

romanych commented 9 years ago

If we make this change then it's really just encouraging people to not worry about the fact that they are sending us a lot of data and costing us money. ie. if it doesn't affect them, then why should they bother trying to fix it

@ejsmith valid statement. So if stack is making trouble penalty must be more serious then throttling on a stack level.

@niemyjski, would it be possible to have project level throttling but with rate limit instead of rejecting all events? I think it will have quite good performance and allow better control. For instance project has sampling rate 0.1 in period and 100 errors were accepted it means that ~1000 events were sent. If 1000 is less than allowed throutput then accept rate can be set to 1, otherwise it can be adjusted. I think it is better then each T start collecting errors and then stopping it when allowed throutput exceeded.

Also it would be a good idea to decrease traffic amount to you (you are paying for it as well). Let me share case we had yesterday.

One of components processing queue and aggregates data into Redis. Redis was down for 30 minutes (memory exceeded, rebooted, went in read-only mode and finally recovered). This component is multi-threaded and deployed to 5 instances. Each thread start generating exceptions.

I would love to hear how you would solve the issue and make sure that in exceptionless some of exceptions are persisted?

Going further, I could imaging that in this case we can sample errors to send on client side. I don't know where you defining stack - client side or server side, therefore I don't know will it be stack or project level throttling. To me it looks like realtime config feature you recently delivered can help handle this. What are your thoughts?

niemyjski commented 9 years ago

@ejsmith what do you think?

@all, I think that in this case where your redis queue went down maybe our client side code should be more aggressive at removing duplicates. We have all the data in our handler and could return a custom header for how much of your limit is taken up and then get really aggressive? Thoughts on this? But at the same time.. there is a major issue and things are going to get throttled for an hour (it would probably be fixed during this time). I'm also thinking that having a project level throttling may introduce complexities but might be worth a look??

Yes, you could always write a plugin as well to disable error submission via our client configuration.

ejsmith commented 9 years ago

@romanych we currently throttle you for those exact reasons so that 1 event like your redis server going down doesn't eat up all of your events for the months. I think we can improve this though.

We could send a status code back to the client to tell it that the account is currently throttled (which it already does), but also include a sampling rate that it should use. So then the client would take that sampling rate and if it was 0.1 then it would only send 1 out of 10 events that it gets.

We actually used to calculate stack signatures on the client side, but the problem was that it made the clients too complicated to implement and also the calculation of the signature would get out of sync because people didn't update their clients. So now we try to make the clients as dumb as possible so that it will be really easy for people to implement clients in other platforms.

Here is what I am thinking:

  1. During the throttled period, still accept a sampling rate of errors. This sampling rate would have to be dynamically calculated based on velocity and plan limits. I think this is doable because we keep a counter of the current overage count for the throttling window and we know the plan limit.
  2. Send a throttled status back to the client along with the current sampling rate that the client should apply.
  3. Change throttling to be at the project level and use a setting to control what percentage of events a specific project should use. Percentage can add up to be more than 100% across all projects in an org and by default we would set this to 100% for each project so that it wouldn't affect the default behaviour. Expose this percentage in the manage project UI so that users could override this behaviour and keep a specific project from dominating the account. Maybe label it: "Maximum percentage of plan events this project can use"
  4. Have the clients include the number of events that it has discarded due to throttling since it's last submission. The idea is to get the client to discard events during throttling without sending them to us at all to reduce costs. But the problem is that the user doesn't know how many events are being thrown away due to throttling. If the client sends this, then we can increment our counters by the value and we would then have an accurate representation of the true volume of events and how many of them are being thrown away.

I think this would help to still get a sampling of events even while the account is throttled, but I still think it would be extremely likely that 1 type of event would dominate the account. It's hard to balance keeping the clients simple with the need to get a good sampling of the events. Maybe we could do a very simple version of stacking on the client by just hashing the event type and maybe the error types.

Thoughts?

romanych commented 9 years ago

@ejsmith I like your ideas

niemyjski commented 9 years ago

@ejsmith I heard from a end user today and each time they've had an issue they were throttled or reached there limit when they went and looked in exceptionless. They understand the limit but we need to work on getting email notifications when you are throttled.

romanych commented 8 years ago

Do you have any plans to work on it. We have been throttled yesterday and we were unable to identify why? We were forced to upgrade without any spike in UI.

niemyjski commented 8 years ago

@romanych this is something we need to work on, we could also use some help implementing it. We've spent the last few months working on performance and stability and we think we are done working in that area now as of this week. Would you mind sending me an in app message and I'll take a look into your account with you.

niemyjski commented 8 years ago

We may also make this smarter by throttling by product version: https://github.com/exceptionless/Exceptionless/issues/156

niemyjski commented 8 years ago

@stephenwelsh commented on Nov 17, 2015 In our scenario we have 1000+ clients that install our product (typically a desktop app) and although we have an in-place upgrade capability, it's optional and user driven. Therefore after a while we have a situation with a number of old installations with issues that have been resolved submitting irrelevant exception reports. Essentially it is out of our control to upgrade the older instances, however given the newer releases have resolved issues the submissions from older releases become less relevant.

Therefore we think it would be appropriate to control the clients submissions with a project level setting that enables/disables the client based on it’s version. For example:

Set the Project Configuration Setting: EnableVerson=4.2

In our application check the ‘EnableVersion’ setting from the Exceptionless client once registered, if the current application version is older (i.e. 4.1) then disable submissions. If the current version is the same or newer then submit.

In our situation the version number is enough for us to control which submissions are more/less relevant, however I would imagine there may be other criteria that may be of value to be able to leverage for enabling/disabling of submissions

niemyjski commented 8 years ago

I think we could implement the version:latest functionality and send down the latest version down to the client in a header. Then we could get really smart and even allow you to turn off old clients. I know in one of our products we can get hammered with older errors that are no longer relevant.

niemyjski commented 8 years ago

We just talked about this some more and will be updating this issue with specifics but we want to do some kind of sampling per project by sending down a header to the clients.

ejsmith commented 8 years ago

Current thoughts are this.

  1. Add support for an EventRate header that gets returned from the event post API. This would tell the client to limit it's events to X per minute. The client would then use sampling to get a disperse set of events while trying to keep it's rate to what it has been told.
  2. Project will have a setting to control what percent of events the project should take up of the orgs plan limit.
  3. Server would take the various knowledge that it has and give intelligent event rates back to the clients. Using the overall org limit and project percentage as well as knowing how many clients are reporting events for this project. Maybe even know that 1 of the clients is sending the vast majority of the events and limit that one different than the other clients. Knowing which client is which is probably an issue since clients could be behind a proxy and all coming from the same IP.
  4. Allow the client to send a header value containing a count of the number of events it has thrown away. This number will be incremented on the project so that users can see how many events are being thrown out.
stephenwelsh commented 8 years ago

That looks good and general for limiting over-rate submissions, some thoughts:

niemyjski commented 8 years ago

We also have a pull request which will help out quite a bit on client side deduping: https://github.com/exceptionless/Exceptionless.Net/pull/71

ahmet8282 commented 8 years ago

I also would like this. thank you.

niemyjski commented 8 years ago

Merged from #212:

Current throttling calculation: (total monthly events / hours in month) * 5

New

throttling calculation: (Events left in month / hours left in month) * 5

This would keep us from throttling accounts at the end of the month that haven't used a lot of their plan up. People feel cheated when we throttle and they still have a lot of events left in the month.

We can potentially calculate this rate once a day for each org depending on how expensive it is to calculate.```