Open tommedema opened 3 years ago
The retry class as a whole is very questionable / naive. For example:
"Just toss it"? No, this is a case that should never occur-- it is clearly worth raising an exception for. It's a little scary to use something like this in production since this means that companies are going to drop production events without knowing it. There's no way to catch these cases because you're simply swallowing them.
At the current stage I'd advocate putting a big beta flag on this SDK and warn against using it in production
Thanks @tommedema for the feedback. Sorry about the friction you've experienced using the SDK, we'll be taking it to make improvements.
As for RetryClass
, it's exported as part of @amplitude/types
but it definitely can be exported as well, as well as the default transport. We can create a BaseRetry
class that has utilities to make this easier as well and you can extend it.
No, this is a case that should never occur-- it is clearly worth raising an exception for.
The goals of the lines around error 400 for events send against the HTTP v2 API is to find and remove the events that caused the API to raise an invalid payload response, like this utility that's used to prune out events. This case is for an invalid payload of length 1 - in which case it would not make sense to retry afterwards as it means this event would not ever be successfully uploaded.
Is await client.logEvent()
and await client.flush()
not returning Response
objects with error codes on them? We could also throw them as exceptions, but the errors are not entirely gone since the same response
being passed to onEventsError
is also returned in sendEventsWithRetry
Additionally, it seems like the retry logic is currently ignoring the most common kind of throttling.
Right, we don't read from throttledUsers
or throttledDevices
- what were your plans around using these two? If the user/device is not in the exceededDaily...
fields, we retry the events on loop after seconds pause, since the throttle window is a second.
The goals of the lines around error 400 for events send against the HTTP v2 API is to find and remove the events that caused the API to raise an invalid payload response, like this utility that's used to prune out events. This case is for an invalid payload of length 1 - in which case it would not make sense to retry afterwards as it means this event would not ever be successfully uploaded.
This doesn't address my point. You're talking about an invalid state, yet you're not throwing any exceptions. This requires basic error handling rather than simply swallowing these errors. I don't disagree that you can't retry them, but you shouldn't swallow the errors either. This should throw an exception so the consumer can handle them.
Is
await client.logEvent()
andawait client.flush()
not returningResponse
objects with error codes on them? We could also throw them as exceptions, but the errors are not entirely gone since the sameresponse
being passed toonEventsError
is also returned insendEventsWithRetry
Perhaps I'm not explaining myself correctly. The fact that you return a response doesn't help if that response is not a good representation of what actually happened. For example if I log an event without a userId, this is clearly an application error and I should know about this, yet you simply remove this event in _pruneEvents
and then return a success response.
I.e. you say it's successful yet clearly it was not because you just lost an event.
Right, we don't read from
throttledUsers
orthrottledDevices
- what were your plans around using these two? If the user/device is not in theexceededDaily...
fields, we retry the events on loop after seconds pause, since the throttle window is a second.
In the docs it says you should retry after 30 seconds in these cases:
You should pause sending events for that user / device for a period of 30 seconds before retrying and continue retrying until you no longer receive a 429 response.
Perhaps I'm not explaining myself correctly. The fact that you return a response doesn't help if that response is not a good representation of what actually happened. For example if I log an event without a userId, this is clearly an application error and I should know about this, yet you simply remove this event in _pruneEvents and then return a success response.
The client itself should return a SKIPPED
response in newer versions, but I agree that this is dangerous without that layer. We will take this out ASAP from _pruneEvents
because it can make it very confusing to debug
Perhaps I'm not explaining myself correctly. The fact that you return a response doesn't help if that response is not a good representation of what actually happened. For example if I log an event without a userId, this is clearly an application error and I should know about this, yet you simply remove this event in _pruneEvents and then return a success response.
The client itself should return a
SKIPPED
response in newer versions, but I agree that this is dangerous without that layer. We will take this out ASAP from_pruneEvents
because it can make it very confusing to debug
Sure, do note that you're doing this in many other places too. For example here, here, here, here, here, here, etc.
We ended up replacing the entire class with something that doesn't retry but at least bubbles up what is going wrong:
https://gist.github.com/tommedema/04de4004ee5b0f17fcb3fd29b2ad92b1
This then allows the consumer to actually retry or throw an application error:
https://gist.github.com/tommedema/0f80ae4c7334e6531c56c598643f0eb7
This one is hard to explain so I just made a screen recording going over the latest source code:
https://app.usebubbles.com/bpPaAzXS3vWMpZh8tZU3N9/amplitude-retry-behavior-issues
Please let me know your thoughts. We're seeing 429s on our amplitude dashboard and we're looking to catch these 429 so that we can push them to a retry queue for future consumption.
Also, in your readme you say we can create our own retry class with a sendEventsWithRetry method, but then how do we access the underlying transport method?
This would require something like:
But your
setupDefaultTransport
function is not exported as part of the npm module as far as we can seeSimilarly, to create a retry handler we'll need to implement
RetryClass
, butRetryClass
is not exported:Additionally, it seems like the retry logic is currently ignoring the most common kind of throttling. An example of a response payload for a throttled device is:
So in addition to looking at exceededDailyQuotaDevices and exceededDailyQuotaUsers, the props throttledDevices and throttledUsers should be considered. We are now going to write custom logic to handle this case in our own retry class, but it's not ideal. Unfortunately, the "plug and play" promise of Amplitude is far from a reality at the moment