Open romanych opened 9 years ago
@romanych:
@ejsmith feedback?
Yeah, I'm really not sure how this would work. I don't want people to have to configure this and I don't want to have a ton of options. So I'm not sure how it would work. Gotta think about it.
Yeah, we need to figure out a better way to handle this.
@romanych can you provide some feedback to my questions above.
Almost done with pretty solid suggestion. Will do my best to send it tomorrow
@romanych https://github.com/romanych can you provide some feedback to my questions above.
— Reply to this email directly or view it on GitHub https://github.com/exceptionless/Exceptionless/issues/112#issuecomment-117239654 .
@niemyjski, @ejsmith let me share my thoughts in this regards.
That there is maximum number of events/month per account and this number depends on pricing plan.
Exceptionless customers (like we are) are mostly interested in two things:
We are using Projects to logically group stacks. So I am really interested to:
These thought lead me to proposal to implement throttling on stack level, not project or account level. When I say throttling I mean setting some rate of occurrences can be logged.
To be more precise I would suggest to add following properties to stack:
When user is watching stack which was under throttling now should be shown:
Exception occurrences are throttled now. Approximate number is ApproxOccurrences(timespan)
I didn't went deep into Exceptionless event logging system so not suggesting implementation.
It was pretty easy stuff, most hardcore stuff it is how and when calculate AcceptRate.
Let's determine T
as period when some job (e.g. Throttler) checks all occurrences and decides what should be done with AcceptRate:
Accordingly to my observations you have very similar job with T = 1 hour which set AcceptRate
to 0 on account level.
I would suggest to set T = 15 minutes
.
Also I would like to define some functions:
MaxThroutput(minutes, account) = minutes * account.MONTH_EVENTS_LIMIT / (30 * 24 * 60)
- maximum number of events that can accepted during given minutes
so plan limit will not be exceededOccurrencesReceived(minutes, scope)
- how many events were submitted during given minutes
. Scope can be account, project or stackOccurrencesSaved(minutes, scope)
- how many events were actually saved.It's easier to pseudo-code the algorithm.
For simplification I can suggest that each stack can get up to 25% of monthly limit.
foreach (account) {
if (OccurrencesReceived(T, account) > MaxThroutput(T, account)) {
account.isUnderThrottling = true;
DecreaseAcceptRate(account);
} else if (account.isUnderThrottling) {
TryIncreaseAcceptRate(account);
}
}
function DecreaseAcceptRate(account) {
foreach (stack in account.stacks) {
if (OccurrencesReceived(T, stack) > 0.25 * MaxThroutput(T, stack)) {
stack.throttled = true;
stack.acceptRate = 0.25 * MaxThroutput(T, stack) / OccurrencesReceived(T, stack);
}
}
}
function TryIncreaseAcceptRate() {
foreach (stack in account.stacks.throttled) {
stack.acceptRate = MIN(1, 0.25 * MaxThroutput(T, stack) / OccurrencesReceived(T, stack));
}
}
There is a some obvious problems:
Nevertheless I hope that my idea has sense and you will get some inspiration from it
Thanks for your great feedback, I'll reread it a few times over the next few days and think about it as well. Just some feedback from the top of my head.
You can see these trends today on the error dashboard. There is a list of the most frequent stacks and you can view trending data over time. Granted there is things we want to do to improve on this.
Currently we are throttling organization wide. I think the finest grain of control we could reasonably go would be throttling at the project level. We have a http handler that throttles events based on the api key (project level) without even reading the stream / processing anything.. If we did it during the event pipeline that would add a serious amount of overhead to the system (now were queueing to disk and queue the event, then deserializing it and processing it via a job, then running it through a pipeline just to see if that specific instance is throttled).
I like your idea of having a max throughput should be 15 minute based. It would be nice to have a 1.5x rate per 15 minute period and if a project hits 75% of the rate in 5 minutes we throttle just that project.. Thoughts? We need to keep the logic simple I think because then it's easy to understand/test and it's going to be lightning fast (not slow things down).
So the problem with this is that the majority of cost associated to accounts is the bandwidth and initial processing of the request. Our current policy is that we will throttle your account if your going over your plan, but if you are going significantly over your plan and incurring a lot of overage / cost then we would ask you to upgrade your plan. If we make this change then it's really just encouraging people to not worry about the fact that they are sending us a lot of data and costing us money. ie. if it doesn't affect them, then why should they bother trying to fix it?
Does this make sense?
To your original feedback, I think sending an email letting you know when we throttle your account and including data showing you why it got throttled would be really good.
Also, what we are doing currently is throttling per hour in order to give you more of a sampling of events over the course of time. Maybe we need to change this up to be smaller windows so that their aren't long periods of time with no events.
If we make this change then it's really just encouraging people to not worry about the fact that they are sending us a lot of data and costing us money. ie. if it doesn't affect them, then why should they bother trying to fix it
@ejsmith valid statement. So if stack is making trouble penalty must be more serious then throttling on a stack level.
@niemyjski, would it be possible to have project level throttling but with rate limit instead of rejecting all events? I think it will have quite good performance and allow better control. For instance project has sampling rate 0.1 in period and 100 errors were accepted it means that ~1000 events were sent. If 1000 is less than allowed throutput then accept rate can be set to 1, otherwise it can be adjusted. I think it is better then each T start collecting errors and then stopping it when allowed throutput exceeded.
Also it would be a good idea to decrease traffic amount to you (you are paying for it as well). Let me share case we had yesterday.
One of components processing queue and aggregates data into Redis. Redis was down for 30 minutes (memory exceeded, rebooted, went in read-only mode and finally recovered). This component is multi-threaded and deployed to 5 instances. Each thread start generating exceptions.
I would love to hear how you would solve the issue and make sure that in exceptionless some of exceptions are persisted?
Going further, I could imaging that in this case we can sample errors to send on client side. I don't know where you defining stack - client side or server side, therefore I don't know will it be stack or project level throttling. To me it looks like realtime config feature you recently delivered can help handle this. What are your thoughts?
@ejsmith what do you think?
@all, I think that in this case where your redis queue went down maybe our client side code should be more aggressive at removing duplicates. We have all the data in our handler and could return a custom header for how much of your limit is taken up and then get really aggressive? Thoughts on this? But at the same time.. there is a major issue and things are going to get throttled for an hour (it would probably be fixed during this time). I'm also thinking that having a project level throttling may introduce complexities but might be worth a look??
Yes, you could always write a plugin as well to disable error submission via our client configuration.
@romanych we currently throttle you for those exact reasons so that 1 event like your redis server going down doesn't eat up all of your events for the months. I think we can improve this though.
We could send a status code back to the client to tell it that the account is currently throttled (which it already does), but also include a sampling rate that it should use. So then the client would take that sampling rate and if it was 0.1 then it would only send 1 out of 10 events that it gets.
We actually used to calculate stack signatures on the client side, but the problem was that it made the clients too complicated to implement and also the calculation of the signature would get out of sync because people didn't update their clients. So now we try to make the clients as dumb as possible so that it will be really easy for people to implement clients in other platforms.
Here is what I am thinking:
I think this would help to still get a sampling of events even while the account is throttled, but I still think it would be extremely likely that 1 type of event would dominate the account. It's hard to balance keeping the clients simple with the need to get a good sampling of the events. Maybe we could do a very simple version of stacking on the client by just hashing the event type and maybe the error types.
Thoughts?
@ejsmith I like your ideas
@ejsmith I heard from a end user today and each time they've had an issue they were throttled or reached there limit when they went and looked in exceptionless. They understand the limit but we need to work on getting email notifications when you are throttled.
Do you have any plans to work on it. We have been throttled yesterday and we were unable to identify why? We were forced to upgrade without any spike in UI.
@romanych this is something we need to work on, we could also use some help implementing it. We've spent the last few months working on performance and stability and we think we are done working in that area now as of this week. Would you mind sending me an in app message and I'll take a look into your account with you.
We may also make this smarter by throttling by product version: https://github.com/exceptionless/Exceptionless/issues/156
@stephenwelsh commented on Nov 17, 2015 In our scenario we have 1000+ clients that install our product (typically a desktop app) and although we have an in-place upgrade capability, it's optional and user driven. Therefore after a while we have a situation with a number of old installations with issues that have been resolved submitting irrelevant exception reports. Essentially it is out of our control to upgrade the older instances, however given the newer releases have resolved issues the submissions from older releases become less relevant.
Therefore we think it would be appropriate to control the clients submissions with a project level setting that enables/disables the client based on it’s version. For example:
Set the Project Configuration Setting: EnableVerson=4.2
In our application check the ‘EnableVersion’ setting from the Exceptionless client once registered, if the current application version is older (i.e. 4.1) then disable submissions. If the current version is the same or newer then submit.
In our situation the version number is enough for us to control which submissions are more/less relevant, however I would imagine there may be other criteria that may be of value to be able to leverage for enabling/disabling of submissions
I think we could implement the version:latest functionality and send down the latest version down to the client in a header. Then we could get really smart and even allow you to turn off old clients. I know in one of our products we can get hammered with older errors that are no longer relevant.
We just talked about this some more and will be updating this issue with specifics but we want to do some kind of sampling per project by sending down a header to the clients.
Current thoughts are this.
That looks good and general for limiting over-rate submissions, some thoughts:
We also have a pull request which will help out quite a bit on client side deduping: https://github.com/exceptionless/Exceptionless.Net/pull/71
I also would like this. thank you.
Merged from #212:
Current throttling calculation: (total monthly events / hours in month) * 5
New
throttling calculation: (Events left in month / hours left in month) * 5
This would keep us from throttling accounts at the end of the month that haven't used a lot of their plan up. People feel cheated when we throttle and they still have a lot of events left in the month.
We can potentially calculate this rate once a day for each org depending on how expensive it is to calculate.```
We are running multiple products and environments under single Exceptionless account on large plan.
Sometimes one of products on one of environments starts sending a lot of exception logs to system and the system throttles all exceptions from all products and that's a problem.
Also throttling is not really throttling - it stops receiving any errors until hour completes.
We would highly appreciate if:
Currently we are creating temporary accounts for troubleshooting of such cases that happens once in months.