amplitude / Amplitude-Node

Server-side Node.js SDK for Amplitude
MIT License
67 stars 20 forks source link

Default retry behavior causes infinite event loop on error #87

Open tommedema opened 3 years ago

tommedema commented 3 years ago

There's a flaw in the default retry behavior on line 59: whenever an error occurs, a new call stack is created and we immediately return.

Instead of returning, we should await this._onEventsError here and ensure that onEventsError is a promise.

Currently waiting for amplitude.flush causes the node.js event loop to never clear in the case of a rate limit error.

Here's an example integration test without a rate limit using jest:

Screen Shot 2021-03-13 at 5 28 10 PM

And here's one with a rate limit, notice how jest never exits:

Screen Shot 2021-03-13 at 5 28 42 PM
tommedema commented 3 years ago

FYI - here's an example of how you could resolve this. Notice that this is very important, otherwise statements like await amplitude.flush() have zero meaning-- since the promise would resolve without issues even though amplitude is still busy flushing.

kelvin-lu commented 3 years ago

HI @tommedema,

We originally designed this retry class to start things on the call stack because of how long the behavior lasted for throttling - there's a timer pattern that's semi-matching to exponential backoff that might be why your Jest tests are failing. In our own tests, we over-ride this to a smaller value to have our tests to run in a reasonable amount of time while looking at throttling behavior.

One limitation of returning Promise<Response> is that we can only provide one response the event was in even if it was retried multiple times. I agree with you that await-ing the full retry loop and sending the last payload's response in the return of logEvent makes more sense than the first.

It gets trickier for flush(), because we are not guaranteeing that all the events flush at the same time - if your flush includes events from both users x and y, and only x is being throttled, our retry behavior prioritizes getting y's events in so that they're not being throttled (in the batch API, the entire payload is rejected if any of the users are exceeding limits)