clearlydefined / service

The service side of clearlydefined.io
MIT License
45 stars 41 forks source link

Cloudflare 524 Timeout error during an API calls #749

Open pombredanne opened 4 years ago

pombredanne commented 4 years ago

I am getting frequently these:

524 A Timeout Occurred Cloudflare was able to complete a TCP connection to the origin server, but did not receive a timely HTTP response.

For instance, https://api.clearlydefined.io/harvest/pypi/pypi/-/django

What is the benefit of Cloudflare again? And is someone ever notified from these errors since they are not backend errors, they are from Cloudflare?

wwahammy commented 4 years ago

I've had similar problems with this as well yesterday when generating notice files.

Cloudflare is honestly pretty great for a lot of things. One thing that it's not great at is keeping connections open when it doesn't receive an http response quickly (i.e. 100 seconds).

These long running tasks probably shouldn't be a single standard rest call. One way of handling it would be to have a call to begin a long-running task which immediately returned with a token which the client can use to poll and check if the task has finished. I'm sure there's other ways around websockets too. I can see benefits and disadvantages to them but I suspect either would involve a not-insignificant amount of work.

wwahammy commented 4 years ago

cc @jeffmcaffer

wwahammy commented 4 years ago

As requested, this is an attempt to describe a reproducible example of the 524 errors.

To explain, the errors are not consistent and happen about 25% of the time and seem to be bunched together in time. What I mean is that if a call succeeds, it's likely that other calls soon after will also succeed. However if a call fails, it's likely that calls soon after will fail.

Here's how the error can be recreated, albeit inconsistently:

  1. Clone the Houdini Project repo using git clone https://github.com/houdiniproject/houdini.git
  2. Run yarn to install all the packages
  3. Run npx @houdiniproject/noticeme@^1.1.0-pre1 -f NOTICE-js -i included.json -s 250 which is what the Houdini Project's build runs. Note that the -s parameter is set to 250 which makes noticeme split the dependencies into chunks of 250 and then make consecutive requests for NOTICE files and concatenates the results into one NOTICE file. If you pass 0 to -s, that make a request with all of the dependencies at once (over 1500 packages), which is more likely to cause the 524.

If the 524 error happens, something in the console will mention the 524 status code. Upon a successful call to ClearlyDefined, noticeme will mention the notice file is up to date or not.

jeffmcaffer commented 4 years ago

From @MichaelTsengLZ via Discord Some investigations:

  1. For https://api.clearlydefined.io/harvest/pypi/pypi/-/django, because the service gets around 250k blobs, which makes it supper slow. I think this api should use long-polling or web-socket, which means more discussion.

  2. For the notice api. The conclusion is that it's better to lower down the batch size to 50 at this point.

  3. I probably need to change the rate limit setting tomorrow in order to slow down requests.

jeffmcaffer commented 4 years ago

Thanks @MichaelTsengLZ and cc @nellshamrell

The perf numbers are interesting. It feels like something is going on with the backend somehow. Even those P90 numbers are pretty bad for simple (and relatively small) blob access. Is the service getting throttled or something? Feels like that's part of the root cause here.

Some other questions: Re 1, is there really part of this doing a GET on the harvest API? that seems unexpected. Do you know where that's happening? I'd be really interested to hear about the scenario where that API is needed. We are not really expecting folks to be calling that at scale.

Also, @Michael Zeng when you say "250k blobs" is that one blob that is 250K in size or literally 250K different blobs? if the former, that's big but not that big. If the latter, what the heck?! Would be great to understand the scenario there. Any insight into what's going on? Generating the notices for 1500 components should be roughly 1500+N blob fetches (where the 1500 is for the relevant definitions and N is for the relevant license files, perhaps more than one per component).

Re # 2, The mention of "notice API" made me realize that I had misunderstood what noticeme was doing wrt chunking. For whatever reason I thought that noticeme was calling the definitions API with different sized chunks. I see in https://github.com/houdiniproject/noticeme/blob/main/noticeme.js#L41 that it's calling the notice api and then concatenating the files. Interesting. That will work for certain configurations of the notice service but, for example, it is set to aggregate by license, the chunking will defeat the aggregation.

Chunking to 50 will help in this particular case but we need to figure out the root cause.

wwahammy commented 4 years ago

Thanks all for the help.

As you pointed out, @jeffmcaffer, the chunking is not ideal as it expands the size of the NOTICE file. In my experience, 250 package chunks adds about 20-30% to the NOTICE file size. I'm fine temporarily reducing the chunk size but I'd really like to get to the point where I don't have to chunk at all.

fossygirl commented 4 years ago

Tagging @MichaelTsengLZ since I think the wrong Michael got tagged in the comments.

MichaelTsengLZ commented 4 years ago

@jeffmcaffer,

  1. I double-checked the Resource health of the Azure blob, there are throttle warnings. For the short term, @jeffmcaffer would it be better to change the azure storage client retry policy from fixed linear to random exponential? For the long term, we need to dig into what exceeds the resource limit, whether need a better partition key, and whether need request a higher resource limit.

  2. For /harvest API, I found an average of 2614 calls to this API every day in the last month. Your latter explanation of "250k blobs" is what I observed. There are 250k blobs that prefixed with /pypi/pypi/-/django. The /harvest API asks for what's prefixed with a coordinator instead of an exact match. I'm not sure whether that was expected.

fossygirl commented 4 years ago

Just wanted to check in on this. @MichaelTsengLZ or @nellshamrell is there any further work on this you folks are planning?

And on the other side of things, @wwahammy or @pombredanne are you folks still experiencing these errors in, say, the last couple weeks?

wwahammy commented 4 years ago

I haven't noticed an issue in the last few weeks. But I also haven't been making as many calls and I haven't been trying a call of all 1600 dependencies either.

nellshamrell commented 4 years ago

@jeffmcaffer and I discussed this yesterday - it is definitely still on my radar.

nellshamrell commented 3 years ago

Digging into this now. Seeing if I can find a way to reproduce it (I'll follow the steps detailed in some of the previous comments) and observe the behavior in both Cloudflare and Azure.

Reference/tips about Cloudflare 524 errors

nellshamrell commented 3 years ago

I haven't been able to reproduce this through the Houdini example, but I can reproduce it using a call to the harvest API

$ curl https://api.clearlydefined.io/harvest/pypi/pypi/-/django
<html>
<head><title>524 Origin Time-out</title></head>
<body bgcolor="white">
<center><h1>524 Origin Time-out</h1></center>
<hr><center>cloudflare-nginx</center>
</body>
</html>

Investigating.

nellshamrell commented 3 years ago

This is what I'm seeing when looking that Resource health of our ClearlyDefined prod storage account on Azure. Not seeing anything too helpful atm, but will continue digging.

image

Another interesting thing - I don't see the same error when requesting the same harvest with the dev api

$ curl https://dev-api.clearlydefined.io/harvest/pypi/pypi/-/django
["pypi/pypi/-/django-celery-beat/1.1.1/clearlydefined/1.3.1","pypi/pypi/-/django-celery-beat/1.1.1/licensee/9.13.0","pypi/pypi/-/django-celery-beat/1.1.1/scancode/3.2.2","pypi/pypi/-/django-guardian/2.0.0/clearlydefined/1.3.1","pypi/pypi/-/django-guardian/2.0.0/licensee/9.13.0","pypi/pypi/-/django-guardian/2.0.0/scancode/3.2.2","pypi/pypi/-/django-heroku/0.3.1/clearlydefined/1.3.1","pypi/pypi/-/django-heroku/0.3.1/scancode/3.2.2","pypi/pypi/-/django-nose/1.4.4/clearlydefined/1.3.1","pypi/pypi/-/django-pyodbc-azure/2.0.1.0/clearlydefined/1.3.1","pypi/pypi/-/django-pyodbc-azure/2.1.0.0/clearlydefined/1.3.1","pypi/pypi/-/django-pyodbc-azure/2.1.0.0/licensee/9.13.0","pypi/pypi/-/django-pyodbc-azure/2.1.0.0/scancode/3.2.2","pypi/pypi/-/django-rest-polymorphic/0.1.8/clearlydefined/1.3.1","pypi/pypi/-/django-rest-polymorphic/0.1.8/scancode/3.2.2","pypi/pypi/-/django-rest-swagger/2.2.0/clearlydefined/1.3.1","pypi/pypi/-/django-sslserver/0.20/clearlydefined/1.3.1","pypi/pypi/-/django-sslserver/0.20/fossology/3.6.0","pypi/pypi/-/django-sslserver/0.20/licensee/9.12.1","pypi/pypi/-/django-sslserver/0.20/scancode/3.2.2","pypi/pypi/-/django-user-agents/0.4.0/clearlydefined/1.3.1","pypi/pypi/-/django-user-agents/0.4.0/licensee/9.13.0","pypi/pypi/-/django-user-agents/0.4.0/scancode/3.2.2",
(...)
nellshamrell commented 3 years ago

Ah....

So this returns the error (when we don't specify a version for django):

$ curl https://api.clearlydefined.io/harvest/pypi/pypi/-/django

This DOES NOT return an error - when we specify a version

curl https://api.clearlydefined.io/harvest/pypi/pypi/-/django/3.1.5

It makes sense that the non-version specific call would take much longer with so many potential blobs (250k, according to the estimate above) involved. I'm not surprised it's timing out.

That is an excellent pointer to the problem with this specific call.

It would be really helpful to know if other people have seen this error recently in other contexts - of if anyone sees it in the future. If you see this error, please add a comment here with: 1) The exact command that was run (and any appropriate context) 2) What Date and Time it was run, including your timezone

nellshamrell commented 3 years ago

I am seeing this occasionally when running my script to update all 800,000 Maven records (detailed in #799). It seems to happen every few thousand calls to GET /definition with force=true

nellshamrell commented 3 years ago

I believe the 524 errors correspond with this big spike in failing transactions on the Clearlydefinedprod storage account (where our blob store is)

image

nellshamrell commented 3 years ago

Possibly related https://docs.microsoft.com/en-us/azure/storage/common/storage-monitoring-diagnosing-troubleshooting?tabs=dotnet#transient-increase-in-PercentThrottlingError

I bet I could replicate this in our dev environment by scripting something to make a few thousand GET requests to the API

nellshamrell commented 3 years ago

Did some more investigating on this:

It definitely seems like the choke point is our Azure Blob Storage. If I do more than 1000 recompute requests within 5 minutes, the storage times out. I am better there is a limit on READs as well.

Making this my priority for next week.

nellshamrell commented 3 years ago

Got another report of this in production:

(From Discord)

duranbc (Brian Duran) — Yesterday at 2:06 PM Hi All,

I appear to be seeing my first HTTP 524 (Cloudflare?) error.

Trying to make this API call to GET definition info: https://api.clearlydefined.io/definitions/git/github/haproxy/haproxy/ac198b92d461515551b95daae20954b3053ce87e

My http client logged 101142 milliseconds before timeout and error.

Also, the definition itself in the web UI appears "Not Harvested":

https://clearlydefined.io/definitions/git/github/haproxy/haproxy/ac198b92d461515551b95daae20954b3053ce87e

Any thoughts or suggestions?

FYI @Nell Shamrell :slight_smile:

Thanks,

Brian

nellshamrell commented 3 years ago

Here's the request in our app logs

azure_logs_1

nellshamrell commented 3 years ago

And a trace of what went on after it. Looks like several "definition not available" and "GitHubCurationService.autocurate.notApplicable" messages.

image

nellshamrell commented 3 years ago

I do not see any exceptions around that time.

Here is a view of the dependencies that were called as a result of the request:

Section 1

image

nellshamrell commented 3 years ago

Section 2

azure_logs_3

Note that the one the arrow is pointing to (a call to clearlydefined as a dependency) lasted for a total of 11 min, 51.12 seconds. That may be our culprit - Cloudflare 524 errors are returned when a request takes longer than 100 seconds to return (source: https://bobcares.com/blog/cloudflare-error-524/)

Digging into that record.

nellshamrell commented 3 years ago

Looks like we were running a find on our MongoDB for 11.9 min, which I would expect to cause a timeout.

image

nellshamrell commented 3 years ago

Taking a look at a call I made using

curl https://api.clearlydefined.io/definitions/git/github/haproxy/haproxy/ac198b92d461515551b95daae20954b3053ce87e

which did return successfully.

The clearlydefined dependency duration only lasted 90.00 ms

image

nellshamrell commented 3 years ago

And the find request to the MongoDB only lasted 90 ms

image

Interesting...I wonder why that query took so long yesterday (11 min, 51 sec!) but only 90 ms today. Checking out what was going on in the MongoDB at that time yesterday.

nellshamrell commented 3 years ago

Looks like there was a giant spike of requests to the Mongo database around the time of the error on 4/19.

image

nellshamrell commented 3 years ago

(Thanks to @MichaelTsengLZ for pointing me to the above graph!)

nellshamrell commented 3 years ago

Looking at the total request metrics for the past month, I see two big spikes:

image

Seeing if I can figure out where all those requests were coming from.

nellshamrell commented 3 years ago

My working theory is that this is happening because we are hitting our maximum throughput set for the definitions-paged collection in the clearlydefined mongodb in the clearlydefined-prod cosmosdb account.

(for more information on what throughput is in the context of cosmosdb, check out this article)

When we have a spike in requests (particularly of the scale of the ones above) I also see a spike in throttling, which is what happens when we hit our maximum allowed throughput. The maximum throughput (it autoscales) we currently allow for the definitions-paged collection is 70,000 Request Units (RU/s are the currency for throughput). I just increased it to 100,000 RU/s. Let's see if we continue to see these kinds of spikes and whether we are still hitting the limit even though it is increased.

I'm leaving this open for tracking purposes - if anyone else sees 524 errors please report them here/

wwahammy commented 3 years ago

Just an FYI, I just received a 524 error.

wwahammy commented 3 years ago

I've actually received 524 errors on every attempt (about 5 of them) over the past 10 minutes.

MichaelTsengLZ commented 3 years ago

FYI, there is a ClearlyDefined overload from 5AM to 2:00PM (PST) today.

nellshamrell commented 3 years ago

Yep, I think that was related to the overload 4 days ago, apologies for that. We are discussing adding a limit to the API call which caused the overload.

wwahammy commented 3 years ago

Just an FYI, I'm getting 524 errors again right now.

MichaelTsengLZ commented 3 years ago

Service is not stable because of the high load previous. And now it's recovered.

nellshamrell commented 3 years ago

@MichaelTsengLZ and I are meeting on Tuesday to discuss a more permanent solution for this

wwahammy commented 3 years ago

@MichaelTsengLZ and I are meeting on Tuesday to discuss a more permanent solution for this

@nellshamrell anything we can do to help?

nellshamrell commented 3 years ago

Hi @wwahammy! I'm revisiting this issue after a report of 524's today. I saw your comment from 24 days ago and want to say thank you so much for the offer to help! I will definitely reach out if/when there are things I can use your help on related to this issue. ❤️

nellshamrell commented 3 years ago

Latest investigation notes:

After examining a lot of the 524 errors received from CD by an internal Microsoft service, here are my observations:

Overall, there is not an easily identifiable bottleneck in the ClearlyDefined production infrastructure.

When a request takes > 1 min to return (which is what causes the 524 error from Cloudflare), it is inconsistent what part of the ClearlyDefined infra causes the delay

Overall - I think ClearlyDefined is getting more traffic than it can handle. I'd like to lower the rate limit (atm it's 1000 api requests/60 sec) and see if that has an effect, but that does mean that some users will need to throttle their requests to ClearlyDefined.

nellshamrell commented 3 years ago

Talked to @jeffwilcox about this and he had two good suggestions for investigating the occasional GH API call slowdowns we are seeing.

Investigating both of these now!

wwahammy commented 3 years ago

Just an update: I'm getting these errors again today.

nellshamrell commented 3 years ago

@wwahammy ty for letting us know. I will take a look.

nellshamrell commented 3 years ago

@wwahammy is it possible to give me an example of coordinates you were using in the API call that was receiving the 524 errors?

nellshamrell commented 3 years ago

I am seeing a significant spike in latency from our Azure Blob storage 7 hours ago (it appears to have stabilized shortly after that)

wwahammy commented 2 years ago

Hey all, I just started getting 524 errors just now. The service has been slow-ish all day but hadn't timed out until now.

fossygirl commented 2 years ago

@MichaelTsengLZ Could you look into this, please?

wwahammy commented 2 years ago

As an FYI, I'm getting 524's right now.

MichaelTsengLZ commented 2 years ago

@wwahammy which API do you get 524's? Could you give me some examples?