Open pombredanne opened 4 years ago
I've had similar problems with this as well yesterday when generating notice files.
Cloudflare is honestly pretty great for a lot of things. One thing that it's not great at is keeping connections open when it doesn't receive an http response quickly (i.e. 100 seconds).
These long running tasks probably shouldn't be a single standard rest call. One way of handling it would be to have a call to begin a long-running task which immediately returned with a token which the client can use to poll and check if the task has finished. I'm sure there's other ways around websockets too. I can see benefits and disadvantages to them but I suspect either would involve a not-insignificant amount of work.
cc @jeffmcaffer
As requested, this is an attempt to describe a reproducible example of the 524 errors.
To explain, the errors are not consistent and happen about 25% of the time and seem to be bunched together in time. What I mean is that if a call succeeds, it's likely that other calls soon after will also succeed. However if a call fails, it's likely that calls soon after will fail.
Here's how the error can be recreated, albeit inconsistently:
git clone https://github.com/houdiniproject/houdini.git
yarn
to install all the packagesnpx @houdiniproject/noticeme@^1.1.0-pre1 -f NOTICE-js -i included.json -s 250
which is what the Houdini Project's build runs. Note that the -s
parameter is set to 250 which makes noticeme split the dependencies into chunks of 250 and then make consecutive requests for NOTICE files and concatenates the results into one NOTICE file. If you pass 0 to -s
, that make a request with all of the dependencies at once (over 1500 packages), which is more likely to cause the 524.If the 524 error happens, something in the console will mention the 524 status code. Upon a successful call to ClearlyDefined, noticeme will mention the notice file is up to date or not.
From @MichaelTsengLZ via Discord Some investigations:
For https://api.clearlydefined.io/harvest/pypi/pypi/-/django, because the service gets around 250k blobs, which makes it supper slow. I think this api should use long-polling or web-socket, which means more discussion.
For the notice api. The conclusion is that it's better to lower down the batch size to 50 at this point.
I probably need to change the rate limit setting tomorrow in order to slow down requests.
Thanks @MichaelTsengLZ and cc @nellshamrell
The perf numbers are interesting. It feels like something is going on with the backend somehow. Even those P90 numbers are pretty bad for simple (and relatively small) blob access. Is the service getting throttled or something? Feels like that's part of the root cause here.
Some other questions:
Re 1, is there really part of this doing a GET on the harvest
API? that seems unexpected. Do you know where that's happening?
I'd be really interested to hear about the scenario where that API is needed. We are not really expecting folks to be calling that at scale.
Also, @Michael Zeng when you say "250k blobs" is that one blob that is 250K in size or literally 250K different blobs? if the former, that's big but not that big. If the latter, what the heck?! Would be great to understand the scenario there. Any insight into what's going on? Generating the notices for 1500 components should be roughly 1500+N blob fetches (where the 1500 is for the relevant definitions and N is for the relevant license files, perhaps more than one per component).
Re # 2, The mention of "notice API" made me realize that I had misunderstood what noticeme was doing wrt chunking. For whatever reason I thought that noticeme was calling the definitions
API with different sized chunks. I see in https://github.com/houdiniproject/noticeme/blob/main/noticeme.js#L41 that it's calling the notice api and then concatenating the files. Interesting. That will work for certain configurations of the notice service but, for example, it is set to aggregate by license, the chunking will defeat the aggregation.
Chunking to 50 will help in this particular case but we need to figure out the root cause.
Thanks all for the help.
As you pointed out, @jeffmcaffer, the chunking is not ideal as it expands the size of the NOTICE file. In my experience, 250 package chunks adds about 20-30% to the NOTICE file size. I'm fine temporarily reducing the chunk size but I'd really like to get to the point where I don't have to chunk at all.
Tagging @MichaelTsengLZ since I think the wrong Michael got tagged in the comments.
@jeffmcaffer,
I double-checked the Resource health
of the Azure blob, there are throttle warnings. For the short term, @jeffmcaffer would it be better to change the azure storage client retry policy from fixed linear to random exponential? For the long term, we need to dig into what exceeds the resource limit, whether need a better partition key, and whether need request a higher resource limit.
For /harvest
API, I found an average of 2614 calls to this API every day in the last month. Your latter explanation of "250k blobs" is what I observed. There are 250k blobs that prefixed with /pypi/pypi/-/django
. The /harvest API asks for what's prefixed with a coordinator instead of an exact match. I'm not sure whether that was expected.
Just wanted to check in on this. @MichaelTsengLZ or @nellshamrell is there any further work on this you folks are planning?
And on the other side of things, @wwahammy or @pombredanne are you folks still experiencing these errors in, say, the last couple weeks?
I haven't noticed an issue in the last few weeks. But I also haven't been making as many calls and I haven't been trying a call of all 1600 dependencies either.
@jeffmcaffer and I discussed this yesterday - it is definitely still on my radar.
Digging into this now. Seeing if I can find a way to reproduce it (I'll follow the steps detailed in some of the previous comments) and observe the behavior in both Cloudflare and Azure.
I haven't been able to reproduce this through the Houdini example, but I can reproduce it using a call to the harvest API
$ curl https://api.clearlydefined.io/harvest/pypi/pypi/-/django
<html>
<head><title>524 Origin Time-out</title></head>
<body bgcolor="white">
<center><h1>524 Origin Time-out</h1></center>
<hr><center>cloudflare-nginx</center>
</body>
</html>
Investigating.
This is what I'm seeing when looking that Resource health of our ClearlyDefined prod storage account on Azure. Not seeing anything too helpful atm, but will continue digging.
Another interesting thing - I don't see the same error when requesting the same harvest with the dev api
$ curl https://dev-api.clearlydefined.io/harvest/pypi/pypi/-/django
["pypi/pypi/-/django-celery-beat/1.1.1/clearlydefined/1.3.1","pypi/pypi/-/django-celery-beat/1.1.1/licensee/9.13.0","pypi/pypi/-/django-celery-beat/1.1.1/scancode/3.2.2","pypi/pypi/-/django-guardian/2.0.0/clearlydefined/1.3.1","pypi/pypi/-/django-guardian/2.0.0/licensee/9.13.0","pypi/pypi/-/django-guardian/2.0.0/scancode/3.2.2","pypi/pypi/-/django-heroku/0.3.1/clearlydefined/1.3.1","pypi/pypi/-/django-heroku/0.3.1/scancode/3.2.2","pypi/pypi/-/django-nose/1.4.4/clearlydefined/1.3.1","pypi/pypi/-/django-pyodbc-azure/2.0.1.0/clearlydefined/1.3.1","pypi/pypi/-/django-pyodbc-azure/2.1.0.0/clearlydefined/1.3.1","pypi/pypi/-/django-pyodbc-azure/2.1.0.0/licensee/9.13.0","pypi/pypi/-/django-pyodbc-azure/2.1.0.0/scancode/3.2.2","pypi/pypi/-/django-rest-polymorphic/0.1.8/clearlydefined/1.3.1","pypi/pypi/-/django-rest-polymorphic/0.1.8/scancode/3.2.2","pypi/pypi/-/django-rest-swagger/2.2.0/clearlydefined/1.3.1","pypi/pypi/-/django-sslserver/0.20/clearlydefined/1.3.1","pypi/pypi/-/django-sslserver/0.20/fossology/3.6.0","pypi/pypi/-/django-sslserver/0.20/licensee/9.12.1","pypi/pypi/-/django-sslserver/0.20/scancode/3.2.2","pypi/pypi/-/django-user-agents/0.4.0/clearlydefined/1.3.1","pypi/pypi/-/django-user-agents/0.4.0/licensee/9.13.0","pypi/pypi/-/django-user-agents/0.4.0/scancode/3.2.2",
(...)
Ah....
So this returns the error (when we don't specify a version for django):
$ curl https://api.clearlydefined.io/harvest/pypi/pypi/-/django
This DOES NOT return an error - when we specify a version
curl https://api.clearlydefined.io/harvest/pypi/pypi/-/django/3.1.5
It makes sense that the non-version specific call would take much longer with so many potential blobs (250k, according to the estimate above) involved. I'm not surprised it's timing out.
That is an excellent pointer to the problem with this specific call.
It would be really helpful to know if other people have seen this error recently in other contexts - of if anyone sees it in the future. If you see this error, please add a comment here with: 1) The exact command that was run (and any appropriate context) 2) What Date and Time it was run, including your timezone
I am seeing this occasionally when running my script to update all 800,000 Maven records (detailed in #799). It seems to happen every few thousand calls to GET /definition with force=true
I believe the 524 errors correspond with this big spike in failing transactions on the Clearlydefinedprod storage account (where our blob store is)
I bet I could replicate this in our dev environment by scripting something to make a few thousand GET requests to the API
Did some more investigating on this:
It definitely seems like the choke point is our Azure Blob Storage. If I do more than 1000 recompute requests within 5 minutes, the storage times out. I am better there is a limit on READs as well.
Making this my priority for next week.
Got another report of this in production:
(From Discord)
duranbc (Brian Duran) — Yesterday at 2:06 PM Hi All,
I appear to be seeing my first HTTP 524 (Cloudflare?) error.
Trying to make this API call to GET definition info: https://api.clearlydefined.io/definitions/git/github/haproxy/haproxy/ac198b92d461515551b95daae20954b3053ce87e
My http client logged 101142 milliseconds before timeout and error.
Also, the definition itself in the web UI appears "Not Harvested":
Any thoughts or suggestions?
FYI @Nell Shamrell :slight_smile:
Thanks,
Brian
Here's the request in our app logs
And a trace of what went on after it. Looks like several "definition not available" and "GitHubCurationService.autocurate.notApplicable" messages.
I do not see any exceptions around that time.
Here is a view of the dependencies that were called as a result of the request:
Note that the one the arrow is pointing to (a call to clearlydefined as a dependency) lasted for a total of 11 min, 51.12 seconds. That may be our culprit - Cloudflare 524 errors are returned when a request takes longer than 100 seconds to return (source: https://bobcares.com/blog/cloudflare-error-524/)
Digging into that record.
Looks like we were running a find on our MongoDB for 11.9 min, which I would expect to cause a timeout.
Taking a look at a call I made using
curl https://api.clearlydefined.io/definitions/git/github/haproxy/haproxy/ac198b92d461515551b95daae20954b3053ce87e
which did return successfully.
The clearlydefined dependency duration only lasted 90.00 ms
And the find request to the MongoDB only lasted 90 ms
Interesting...I wonder why that query took so long yesterday (11 min, 51 sec!) but only 90 ms today. Checking out what was going on in the MongoDB at that time yesterday.
Looks like there was a giant spike of requests to the Mongo database around the time of the error on 4/19.
(Thanks to @MichaelTsengLZ for pointing me to the above graph!)
Looking at the total request metrics for the past month, I see two big spikes:
Seeing if I can figure out where all those requests were coming from.
My working theory is that this is happening because we are hitting our maximum throughput set for the definitions-paged collection in the clearlydefined mongodb in the clearlydefined-prod cosmosdb account.
(for more information on what throughput is in the context of cosmosdb, check out this article)
When we have a spike in requests (particularly of the scale of the ones above) I also see a spike in throttling, which is what happens when we hit our maximum allowed throughput. The maximum throughput (it autoscales) we currently allow for the definitions-paged collection is 70,000 Request Units (RU/s are the currency for throughput). I just increased it to 100,000 RU/s. Let's see if we continue to see these kinds of spikes and whether we are still hitting the limit even though it is increased.
I'm leaving this open for tracking purposes - if anyone else sees 524 errors please report them here/
Just an FYI, I just received a 524 error.
I've actually received 524 errors on every attempt (about 5 of them) over the past 10 minutes.
FYI, there is a ClearlyDefined overload from 5AM to 2:00PM (PST) today.
Yep, I think that was related to the overload 4 days ago, apologies for that. We are discussing adding a limit to the API call which caused the overload.
Just an FYI, I'm getting 524 errors again right now.
Service is not stable because of the high load previous. And now it's recovered.
@MichaelTsengLZ and I are meeting on Tuesday to discuss a more permanent solution for this
@MichaelTsengLZ and I are meeting on Tuesday to discuss a more permanent solution for this
@nellshamrell anything we can do to help?
Hi @wwahammy! I'm revisiting this issue after a report of 524's today. I saw your comment from 24 days ago and want to say thank you so much for the offer to help! I will definitely reach out if/when there are things I can use your help on related to this issue. ❤️
Latest investigation notes:
After examining a lot of the 524 errors received from CD by an internal Microsoft service, here are my observations:
Overall, there is not an easily identifiable bottleneck in the ClearlyDefined production infrastructure.
When a request takes > 1 min to return (which is what causes the 524 error from Cloudflare), it is inconsistent what part of the ClearlyDefined infra causes the delay
Overall - I think ClearlyDefined is getting more traffic than it can handle. I'd like to lower the rate limit (atm it's 1000 api requests/60 sec) and see if that has an effect, but that does mean that some users will need to throttle their requests to ClearlyDefined.
Talked to @jeffwilcox about this and he had two good suggestions for investigating the occasional GH API call slowdowns we are seeing.
Investigating both of these now!
Just an update: I'm getting these errors again today.
@wwahammy ty for letting us know. I will take a look.
@wwahammy is it possible to give me an example of coordinates you were using in the API call that was receiving the 524 errors?
I am seeing a significant spike in latency from our Azure Blob storage 7 hours ago (it appears to have stabilized shortly after that)
Hey all, I just started getting 524 errors just now. The service has been slow-ish all day but hadn't timed out until now.
@MichaelTsengLZ Could you look into this, please?
As an FYI, I'm getting 524's right now.
@wwahammy which API do you get 524's? Could you give me some examples?
I am getting frequently these:
For instance, https://api.clearlydefined.io/harvest/pypi/pypi/-/django
What is the benefit of Cloudflare again? And is someone ever notified from these errors since they are not backend errors, they are from Cloudflare?