Closed devtayls closed 2 months ago
This is happening almost 1 of every 5 times for us. Seems like something that might be getting rate limited just slowed down.
I can not say that I've hit this particular problem. My assumption is that this related to a hexpm mirror vs a bug in a code, but that's an assumption.
@agundy @devtayls is this still happening for you?
happens to us regularly:
Run erlef/setup-beam@v1
Action fetch /builds/otp/ubuntu-22.04/builds.txt failed for mirror https://builds.hex.pm,/ with Error: Request timeout: /builds/otp/ubuntu-22.04/builds.txt
Error: Could not fetch /builds/otp/ubuntu-22.04/builds.txt from any hex.pm mirror
is there a way to add a retry?
I'd say we're hitting this on about 5-10% of action invocations, multiplied by several jobs in a workflow all using the same action. It's causing a painfully high incidence of spurious CI failures
is there a way to add a retry?
Yes, there is retry functionality in the toolkit used, we may also need to consider new / better caching strategy.
Hey all, we actually already retry :) I had go to take a second out to look at the code, we can increase this number and possibly make it configurable. I'll open up a PR, and do some experimentation.
@ericmj or @wojtekmach do you know of issues going on with hex in regards to where the builds.txt files reside? It may nothing on your end, and just problems with inside github runners.
Additional problem: Their http client interface does not currently retry 408 (request timeout), instead it just throws an error. There seems to be an open issue and PR for this here
Thus, even if we increase the number of retries as part of the http client itself, it will make no difference. The only thing we can do it seems is wait for this to be merged and released by github or add retry ourselves. We might be able to use their retry helper
Here are the options at our disposal :
@paulo-ferraz-oliveira Have any thoughts on this?
Edit:
Additional note : We may have to add retry logic ourselves. Without going through what tool cache will and will not retry on, when I looked at the error in the screen shot above and the one in #261 this appears to be a socket timeout vs a 408 . There default timeout seems to be 3 minutes, I'll see if I can replicate this to verify that's how long the operation is hanging before throwing an error.
do you know of issues going on with hex in regards to where the builds.txt files reside? It may nothing on your end, and just problems with inside github runners.
We are not aware of any issues other than that github has issues fetching it.
Additional problem: Their http client interface does not currently retry 408 (request timeout), instead it just throws an error.
This 408 error is interesting, where is it coming from? We do not explicitly return it in our custom CDN code and I cannot find any documentation about Fastly returning it.
Additional problem: Their http client interface does not currently retry 408 (request timeout), instead it just throws an error.
This 408 error is interesting, where is it coming from? We do not explicitly return it in our custom CDN code and I cannot find any documentation about Fastly returning it.
I fired off to fast here, this seems like the socket timing out vs a 408 being returned. I could be wrong, but the http client code in github's action toolkit reads that way
As such, there's not much we can do besides retry ourselves.
As an aside the action could have built in cache (if it already does please ignore this whole comment) for the build artifacts as well as builds.txt, and so it would mostly download stuff from gh infra as opposed to repo.hex.pm, increasing performance and reliability. I suppose builds.txt is ideally fresh to resolve versions so maybe it’d be a fallback to cache unless version matching is strict.
This does not solve this network problem directly however for projects with warm cache, which I think would be a lot of projects, it kind of does.
As an aside the action could have built in cache (if it already does please ignore this whole comment) for the build artifacts as well as builds.txt, and so it would mostly download stuff from gh infra as opposed to repo.hex.pm, increasing performance and reliability. I suppose builds.txt is ideally fresh to resolve versions so maybe it’d be a fallback to cache unless version matching is strict.
This is where my head is at. There's no build information on GH right now though? But yes, cache builds.txt, if there is a build not in what is cached, then invalidate the cache and attempt to obtain a fresh copy. We cache builds.txt for say 24h (it isn't updated too frequently).
There's no build information on GH right now though?
could you elaborate?
We cache builds.txt for say 24h (it isn't updated too frequently).
Just to be clear I meant to use actions/cache API and quickly looking at the docs it doesn’t have a ttl option.
On second thought, caching builds.txt is maybe not such a great idea. The point is if we already have fresh OTP build there’s no point in hitting builds.txt to resolve versions. So scratch that. The point still stands though, if we have built in cache we make it less likely to hit the network outside gh infra and run into issues.
There's no build information on GH right now though?
could you elaborate?
What I meant is builds.txt
files are not kept in the repo for bob, but I haven't looked recently, maybe that changed?
Just to be clear I meant to use actions/cache API and quickly looking at the docs it doesn’t have a ttl option.
On second thought, caching builds.txt is maybe not such a great idea. The point is if we already have fresh OTP build there’s no point in hitting builds.txt to resolve versions. So scratch that. The point still stands though, if we have built in cache we make it less likely to hit the network outside gh infra and run into issues.
Yes, agreed, and on inspection there is no ttl option, we'd simply overwrite the cache if there is no corresponding build there. I also agree with your other point : If we already have the cached artifact, then there's no point in even looking at builds.txt, but yes, it should still be cached in order to mitigate network boo boos.
Right, builds.txt is not on GitHub but on repo.hex.pm. There is a link to it on GitHub.com/hexpm/bob.
@paulo-ferraz-oliveira Have any thoughts on this?
@starbelly, my thoughts on this are (⚠️ unhelpful):
setup-beam
)setup-beam
much/at allHello all! This is happening commonly for our projects. It would be really helpful if anything could be done to help with the instability of this action 🙏 Thank you
Hello all! This is happening commonly for our projects. It would be really helpful if anything could be done to help with the instability of this action 🙏 Thank you
I'm not sure this is anything instable about the action, but rather github (as I see it), and perhaps it's http-client in the toolkit. Based on the conversations above, what would you like to see? Caching is the only viable course I believe, but has it's own problems, though builds.txt should change very infrequently, assuming that is your main issue.
This action is definitely abnormally unstable. I use many actions and this is the only one that has problems, and I know other Gleam users have the same problem with this action.
I'm running a test locally where by builds.txt is fetched every 10 seconds, it will fail if the fetch fails. My thoughts here still are for some reason github has problems fetching from hex sometimes (although I rarely see this myself). The results shall be interesting, I will let this run for say 12 hours (every 10 seconds).
Right, builds.txt is not on GitHub but on repo.hex.pm. There is a link to it on GitHub.com/hexpm/bob.
My tests aside as noted above, which are for my own curiosity, would it be possible to publish builds.txt to github when it's made available on repo.hex.pm? I suspect that everything that hits github is going to have better odds.
Edit : Meant to ping @wojtekmach
I'm running a test locally where by builds.txt is fetched every 10 seconds, it will fail if the fetch fails. My thoughts here still are for some reason github has problems fetching from hex sometimes (although I rarely see this myself). The results shall be interesting, I will let this run for say 12 hours (every 10 seconds).
This has run for over 12 hours with problems on my machine. This doesn't tell us what goes wrong in github ofc when hitting repo.hex.pm (fastly), but it does give some merit to the notion that the timeout events experienced are isolated to that environment. There are at least 3 possibilities here :
My curiosity is satisfied regardless. I think the best we can do is either cache builds.txt or if hexpm team can push those fiels to github on release.
or if hexpm team can push those fiels to github on release.
could you elaborate? On release of what?
could you elaborate? On release of what?
Whenever builds are updated, and builds.txt is updated. It seems, for whatever reason, this is what people seem to see fail for people the the most. The hypothesis here is that if it were on github, then this would cease to be a problem (or less likely anyway). Yet, that's only a hypothesis.
Got it, thanks. I don't think there is a great place on GitHub to attach these builds to, though. There are hacks like a repo that stores builds in git or a repo that stores builds in "fake" releases (i.e. releases follow OTP releases but the underlying repo doesn't actually change), but neither sounds great.
Got it, thanks. I don't think there is a great place on GitHub to attach these builds to, though. There are hacks like a repo that stores builds in git or a repo that stores builds in "fake" releases (i.e. releases follow OTP releases but the underlying repo doesn't actually change), but neither sounds great.
You will find no disagreement here, just trying to obviate caching 😄 It also doesn't fix the problem at the source. Maybe we just need to knock on github's door.
I think caching is the way to go. If anyone is able to add basic built-in build caching I think it will go a long way. Perhaps it works just on version-type: strict
so we don't need to keep hitting builds.txt. On cold cache we grab the build and builds.txt to check the checksum, cache the build, that's it. We don't even need to cache all things this action downloads, OTP builds are by far the biggest and top priority. Elixir and Gleam builds are downloaded from GitHub releases anyway.
@wojtekmach I suppose, I think we already stated a PR for this would be welcome.
There's a linked pull request with a potential fix for this, if y'all wanna test it. Instead of erlef/setup-beam@v...
you can do btkostner/setup-beam@retry
. It'd be important to gather feedback...
The bug
On occasion, while running the action a request to fetch the underlying Ubuntu image times out. Due to the time out, the action fails. To resolve the time-outs, we can attempt to increase the request time-out for fetching the Ubuntu image.
Software versions
A list of software versions where the bug is apparent, as detailed as possible:
setup-beam
: @V1ubuntu
: ubuntu-22.04How to replicate
erlef/setup-beam@v1
Expected behaviour
erlef/setup-beam@v1
Additional context