amplitude / experiment-python-server

Amplitude Experiment Python Server SDK
MIT License
3 stars 1 forks source link

[Experiment] Fetch failed error when use fetch_v2 #48

Closed raulgzm closed 3 months ago

raulgzm commented 4 months ago

We are receiving different kind of errors when we call to fetch experiments:

What do those errors mean?

Expected Behavior

fetch_v2() method working properly

Current Behavior

we receive different kind of errors, not always.

Possible Solution

we don't know the root cause

Steps to Reproduce

we don't know.

I can share sentry traces and error information with you if prefer.

Environment

zhukaihan commented 4 months ago

Hey Raul,

Thanks for submitting the issue.

These all looks like issues related to network calls. When did this start to happen and does it continue to happen? Is there special configurations of server_url (ex. proxies or EU datacenter)?

Thanks.

raulgzm commented 4 months ago

Hey Peter,

First of all, thank you so much for your support.

It started almost from the beginning when we started to use the library to integrate Amplitude experiments with our backend service. It continues happening, actually this night we had another new issue with a different error:

Fetch failed: Fetch error response: status=502 Bad Gateway

There is no special configuration in from of the server for the connections from the server to the outside world. The trace does not indicate anything interesting because the response is just a 502.

What can cause these kinds of errors?

Thank you so much for your help.

On Tue, 16 Jul 2024 at 18:50, Peter Zhu @.***> wrote:

Hey Raul,

Thanks for submitting the issue.

These all looks like issues related to network calls. When did this start to happen and does it continue to happen? Is there special configurations of server_url (ex. proxies or EU datacenter)?

Thanks.

— Reply to this email directly, view it on GitHub https://github.com/amplitude/experiment-python-server/issues/48#issuecomment-2231389203, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTKIZRYEO4L625J5Y4GCTDZMVFONAVCNFSM6AAAAABK4ZQTZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZRGM4DSMRQGM . You are receiving this because you authored the thread.Message ID: @.***>

zhukaihan commented 4 months ago

Hi Raul,

Thanks for the response and confirming that it's a persistent error.

We have just identified an issue with one of our CDN vendors which should be the root cause of above errors. We have just stopped routing request to that CDN. The 502 is a new one. It's possible that it's also related to CDN. Please let us know if you still see any errors.

May I just get a bit more extra info: When (date / time) did you started to use the library / error starts? And what percentage of the request volume resulted in one of the above errors?

Thanks!

raulgzm commented 4 months ago

Hi Peter,

Thank you for your support. Do you know why we have seen the last of these errors 7 hours ago? Maybe the change has not be done yet?

"[Experiment] Fetch failed: Remote end closed connection without response"

We started to use the library 2 months ago. The first error we saw was 4th June 2024 what percentage of the request volume resulted in one of the above errors? I can`t answer this question with an exact number but we have seen 80 events in Sentry, with a high number of requests in the service. So I would say that the percentage is probably low but has a high impact because we can't use that feature flag in our service properly.

zhukaihan commented 4 months ago

The CDN change was applied within minutes.

To ensure these evaluations are served properly, I would suggest to tweak retry parameters to retry these requests.

It's quite unusual that the endpoint just simply close connection without any status code or triggering a timeout. The following questions can help us understand patterns and potential causes. How frequently does this error occur? Does the error happen in bursts, or at a consistent rate over the course of a day? How long does it take for requests to fail, the average latency of failures?

Thanks.

raulgzm commented 4 months ago

Hi Peter,

Let me try to answer those questions with helpful information:

How frequently does this error occur? If you mean the last error "Remote end closed connection without response" we have had in 2 months 157 errors. The last one was on Jul 23, 2:04 AM. The first one was on Jun 4, 12:21 PM. The error seems to happen everyday:

image

Does the error happen in bursts, or at a consistent rate over the course of a day? over the course of a day as you can see in the previous image, and the hour when the error occurs is always different. It seems that it does not follow any pattern.

How long does it take for requests to fail, the average latency of failures? It seems, seeing the traces, fails immediately. Or for some reason, Sentry does not show us the total duration of each trace, so, we can not know that latency on average.

How else can I help out?

zhukaihan commented 4 months ago

Thanks for the info! Looks like we have around 2 errors per day at different times, and worse on several days. What would be the timezone of the time you mentioned? I'm trying to match numbers up with any of our metrics. Thanks!

raulgzm commented 4 months ago

Hi Peter!

anytime!

The times are in UTC as you can see here: [image: image.png]

On Wed, 24 Jul 2024 at 02:36, Peter Zhu @.***> wrote:

Thanks for the info! Looks like we have around 2 errors per day at different times, and worse on several days. What would be the timezone of the time you mentioned? I'm trying to match numbers up with any of our metrics. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/amplitude/experiment-python-server/issues/48#issuecomment-2246634881, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTKIZT2AVWI7NAAOBK7UJLZN3ZKRAVCNFSM6AAAAABK4ZQTZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBWGYZTIOBYGE . You are receiving this because you authored the thread.Message ID: @.***>

zhukaihan commented 4 months ago

The error happened after we receive a huge spike in traffic. The time is indeed unpredictable. We have been continuously improving the performance of our service during large spikes, and have plans to keep doing so. For the SDK side actions, my suggestion would be to configure retries. Some tweaking may be needed to achieve the optimal results. Thanks for raising this to us!