amplitude / Amplitude-Node

Server-side Node.js SDK for Amplitude
MIT License
67 stars 20 forks source link

Default retry behavior causes rate limited events to be lost and swallows exceptions #86

Open tommedema opened 3 years ago

tommedema commented 3 years ago

This one is hard to explain so I just made a screen recording going over the latest source code:

https://app.usebubbles.com/bpPaAzXS3vWMpZh8tZU3N9/amplitude-retry-behavior-issues

Please let me know your thoughts. We're seeing 429s on our amplitude dashboard and we're looking to catch these 429 so that we can push them to a retry queue for future consumption.

Also, in your readme you say we can create our own retry class with a sendEventsWithRetry method, but then how do we access the underlying transport method?

This would require something like:

this._transport = this._options.transportClass ?? setupDefaultTransport(this._options);

But your setupDefaultTransport function is not exported as part of the npm module as far as we can see

Similarly, to create a retry handler we'll need to implement RetryClass, but RetryClass is not exported:

export class CustomRetryHandler implements RetryClass { // Cannot find name 'RetryClass'

Additionally, it seems like the retry logic is currently ignoring the most common kind of throttling. An example of a response payload for a throttled device is:

{
  "status":"rate_limit",
  "statusCode":429,
  "body":{
    "error":"Too many requests for some devices and users",
    "epsThreshold":15,
    "throttledDevices":{
      "0ddbe854-d786-4aef-bc9e-b8b7c43e5200":66
    },
    "throttledUsers":{
      "604ac1eb9b1b7a00699d006e":66
    },
    "exceededDailyQuotaDevices":{},
    "exceededDailyQuotaUsers":{},
    "throttledEvents":[0,1,2,3,4..]
  }
}

So in addition to looking at exceededDailyQuotaDevices and exceededDailyQuotaUsers, the props throttledDevices and throttledUsers should be considered. We are now going to write custom logic to handle this case in our own retry class, but it's not ideal. Unfortunately, the "plug and play" promise of Amplitude is far from a reality at the moment

tommedema commented 3 years ago

The retry class as a whole is very questionable / naive. For example:

Screen Shot 2021-03-13 at 7 25 24 PM

"Just toss it"? No, this is a case that should never occur-- it is clearly worth raising an exception for. It's a little scary to use something like this in production since this means that companies are going to drop production events without knowing it. There's no way to catch these cases because you're simply swallowing them.

At the current stage I'd advocate putting a big beta flag on this SDK and warn against using it in production

kelvin-lu commented 3 years ago

Thanks @tommedema for the feedback. Sorry about the friction you've experienced using the SDK, we'll be taking it to make improvements.

As for RetryClass, it's exported as part of @amplitude/types but it definitely can be exported as well, as well as the default transport. We can create a BaseRetry class that has utilities to make this easier as well and you can extend it.

No, this is a case that should never occur-- it is clearly worth raising an exception for.

The goals of the lines around error 400 for events send against the HTTP v2 API is to find and remove the events that caused the API to raise an invalid payload response, like this utility that's used to prune out events. This case is for an invalid payload of length 1 - in which case it would not make sense to retry afterwards as it means this event would not ever be successfully uploaded.

Is await client.logEvent() and await client.flush() not returning Response objects with error codes on them? We could also throw them as exceptions, but the errors are not entirely gone since the same response being passed to onEventsError is also returned in sendEventsWithRetry

kelvin-lu commented 3 years ago

Additionally, it seems like the retry logic is currently ignoring the most common kind of throttling.

Right, we don't read from throttledUsers or throttledDevices - what were your plans around using these two? If the user/device is not in the exceededDaily... fields, we retry the events on loop after seconds pause, since the throttle window is a second.

tommedema commented 3 years ago

The goals of the lines around error 400 for events send against the HTTP v2 API is to find and remove the events that caused the API to raise an invalid payload response, like this utility that's used to prune out events. This case is for an invalid payload of length 1 - in which case it would not make sense to retry afterwards as it means this event would not ever be successfully uploaded.

This doesn't address my point. You're talking about an invalid state, yet you're not throwing any exceptions. This requires basic error handling rather than simply swallowing these errors. I don't disagree that you can't retry them, but you shouldn't swallow the errors either. This should throw an exception so the consumer can handle them.

tommedema commented 3 years ago

Is await client.logEvent() and await client.flush() not returning Response objects with error codes on them? We could also throw them as exceptions, but the errors are not entirely gone since the same response being passed to onEventsError is also returned in sendEventsWithRetry

Perhaps I'm not explaining myself correctly. The fact that you return a response doesn't help if that response is not a good representation of what actually happened. For example if I log an event without a userId, this is clearly an application error and I should know about this, yet you simply remove this event in _pruneEvents and then return a success response.

I.e. you say it's successful yet clearly it was not because you just lost an event.

tommedema commented 3 years ago

Right, we don't read from throttledUsers or throttledDevices - what were your plans around using these two? If the user/device is not in the exceededDaily... fields, we retry the events on loop after seconds pause, since the throttle window is a second.

In the docs it says you should retry after 30 seconds in these cases:

You should pause sending events for that user / device for a period of 30 seconds before retrying and continue retrying until you no longer receive a 429 response.

kelvin-lu commented 3 years ago

Perhaps I'm not explaining myself correctly. The fact that you return a response doesn't help if that response is not a good representation of what actually happened. For example if I log an event without a userId, this is clearly an application error and I should know about this, yet you simply remove this event in _pruneEvents and then return a success response.

The client itself should return a SKIPPED response in newer versions, but I agree that this is dangerous without that layer. We will take this out ASAP from _pruneEvents because it can make it very confusing to debug

tommedema commented 3 years ago

Perhaps I'm not explaining myself correctly. The fact that you return a response doesn't help if that response is not a good representation of what actually happened. For example if I log an event without a userId, this is clearly an application error and I should know about this, yet you simply remove this event in _pruneEvents and then return a success response.

The client itself should return a SKIPPED response in newer versions, but I agree that this is dangerous without that layer. We will take this out ASAP from _pruneEvents because it can make it very confusing to debug

Sure, do note that you're doing this in many other places too. For example here, here, here, here, here, here, etc.

We ended up replacing the entire class with something that doesn't retry but at least bubbles up what is going wrong:

https://gist.github.com/tommedema/04de4004ee5b0f17fcb3fd29b2ad92b1

This then allows the consumer to actually retry or throw an application error:

https://gist.github.com/tommedema/0f80ae4c7334e6531c56c598643f0eb7