erlef / setup-beam

Set up your BEAM-based GitHub Actions workflow (Erlang, Elixir, Gleam, ...)
MIT License
373 stars 50 forks source link

Action sporadically times out when fetching Ubuntu #260

Open devtayls opened 5 months ago

devtayls commented 5 months ago

The bug

On occasion, while running the action a request to fetch the underlying Ubuntu image times out. Due to the time out, the action fails. To resolve the time-outs, we can attempt to increase the request time-out for fetching the Ubuntu image.

Software versions

A list of software versions where the bug is apparent, as detailed as possible:

How to replicate

  1. Run erlef/setup-beam@v1
  2. On occasion, the request to fetch the Ubuntu image times out
  3. On timing out, the action fails.

Expected behaviour

  1. Run erlef/setup-beam@v1
  2. The request to fetch the ubuntu image does not time out

Additional context

[...]
Run erlef/setup-beam@v1
  with:
    version-type: strict
    version-file: .tool-versions
    github-token: ***
    install-hex: true
    install-rebar: true
    disable_problem_matchers: false
    hexpm-mirrors: https://builds.hex.pm

  env:
    MIX_ENV: test
Parsing version file at .tool-versions
  Consuming erlang at version 26.1.1
  Consuming elixir at version 1.[15](https://github.com/pdq/portal.pdq.com/actions/runs/8331205394/job/22797552637#step:4:16).5-otp-26
  ... done!
Action fetch /builds/otp/ubuntu-22.04/builds.txt failed for mirror https://builds.hex.pm/, with Error: Request timeout: /builds/otp/ubuntu-[22](https://github.com/pdq/portal.pdq.com/actions/runs/8331205394/job/22797552637#step:4:24).04/builds.txt
Error: Could not fetch /builds/otp/ubuntu-22.04/builds.txt from any hex.pm mirror

image

agundy commented 5 months ago

This is happening almost 1 of every 5 times for us. Seems like something that might be getting rate limited just slowed down.

starbelly commented 5 months ago

I can not say that I've hit this particular problem. My assumption is that this related to a hexpm mirror vs a bug in a code, but that's an assumption.

@agundy @devtayls is this still happening for you?

paulz commented 5 months ago

happens to us regularly:

Run erlef/setup-beam@v1
Action fetch /builds/otp/ubuntu-22.04/builds.txt failed for mirror https://builds.hex.pm,/ with Error: Request timeout: /builds/otp/ubuntu-22.04/builds.txt
Error: Could not fetch /builds/otp/ubuntu-22.04/builds.txt from any hex.pm mirror

is there a way to add a retry?

pdavies commented 5 months ago

I'd say we're hitting this on about 5-10% of action invocations, multiplied by several jobs in a workflow all using the same action. It's causing a painfully high incidence of spurious CI failures

starbelly commented 5 months ago

is there a way to add a retry?

Yes, there is retry functionality in the toolkit used, we may also need to consider new / better caching strategy.

starbelly commented 5 months ago

Hey all, we actually already retry :) I had go to take a second out to look at the code, we can increase this number and possibly make it configurable. I'll open up a PR, and do some experimentation.

starbelly commented 5 months ago

@ericmj or @wojtekmach do you know of issues going on with hex in regards to where the builds.txt files reside? It may nothing on your end, and just problems with inside github runners.

starbelly commented 5 months ago

Additional problem: Their http client interface does not currently retry 408 (request timeout), instead it just throws an error. There seems to be an open issue and PR for this here

Thus, even if we increase the number of retries as part of the http client itself, it will make no difference. The only thing we can do it seems is wait for this to be merged and released by github or add retry ourselves. We might be able to use their retry helper

Here are the options at our disposal :

  1. Wait for the above the referenced PR to be merged and new cut of their tool kit to ensue.
  2. Use toolcache for fetching builds.txt as it should handle this via a request utils package (I'm not sure this is exposed for general use though). Need to think through cache implications here though.
  3. Add our own retry functionality, wouldn't be great (our goal should be to maintain as little javascript as possible), but if we absolute had to, is an option.

@paulo-ferraz-oliveira Have any thoughts on this?

Edit:

Additional note : We may have to add retry logic ourselves. Without going through what tool cache will and will not retry on, when I looked at the error in the screen shot above and the one in #261 this appears to be a socket timeout vs a 408 . There default timeout seems to be 3 minutes, I'll see if I can replicate this to verify that's how long the operation is hanging before throwing an error.

ericmj commented 5 months ago

do you know of issues going on with hex in regards to where the builds.txt files reside? It may nothing on your end, and just problems with inside github runners.

We are not aware of any issues other than that github has issues fetching it.

ericmj commented 5 months ago

Additional problem: Their http client interface does not currently retry 408 (request timeout), instead it just throws an error.

This 408 error is interesting, where is it coming from? We do not explicitly return it in our custom CDN code and I cannot find any documentation about Fastly returning it.

starbelly commented 5 months ago

Additional problem: Their http client interface does not currently retry 408 (request timeout), instead it just throws an error.

This 408 error is interesting, where is it coming from? We do not explicitly return it in our custom CDN code and I cannot find any documentation about Fastly returning it.

I fired off to fast here, this seems like the socket timing out vs a 408 being returned. I could be wrong, but the http client code in github's action toolkit reads that way

As such, there's not much we can do besides retry ourselves.

wojtekmach commented 5 months ago

As an aside the action could have built in cache (if it already does please ignore this whole comment) for the build artifacts as well as builds.txt, and so it would mostly download stuff from gh infra as opposed to repo.hex.pm, increasing performance and reliability. I suppose builds.txt is ideally fresh to resolve versions so maybe it’d be a fallback to cache unless version matching is strict.

This does not solve this network problem directly however for projects with warm cache, which I think would be a lot of projects, it kind of does.

starbelly commented 5 months ago

As an aside the action could have built in cache (if it already does please ignore this whole comment) for the build artifacts as well as builds.txt, and so it would mostly download stuff from gh infra as opposed to repo.hex.pm, increasing performance and reliability. I suppose builds.txt is ideally fresh to resolve versions so maybe it’d be a fallback to cache unless version matching is strict.

This is where my head is at. There's no build information on GH right now though? But yes, cache builds.txt, if there is a build not in what is cached, then invalidate the cache and attempt to obtain a fresh copy. We cache builds.txt for say 24h (it isn't updated too frequently).

wojtekmach commented 5 months ago

There's no build information on GH right now though?

could you elaborate?

We cache builds.txt for say 24h (it isn't updated too frequently).

Just to be clear I meant to use actions/cache API and quickly looking at the docs it doesn’t have a ttl option.

On second thought, caching builds.txt is maybe not such a great idea. The point is if we already have fresh OTP build there’s no point in hitting builds.txt to resolve versions. So scratch that. The point still stands though, if we have built in cache we make it less likely to hit the network outside gh infra and run into issues.

starbelly commented 5 months ago

There's no build information on GH right now though?

could you elaborate?

What I meant is builds.txt files are not kept in the repo for bob, but I haven't looked recently, maybe that changed?

Just to be clear I meant to use actions/cache API and quickly looking at the docs it doesn’t have a ttl option.

On second thought, caching builds.txt is maybe not such a great idea. The point is if we already have fresh OTP build there’s no point in hitting builds.txt to resolve versions. So scratch that. The point still stands though, if we have built in cache we make it less likely to hit the network outside gh infra and run into issues.

Yes, agreed, and on inspection there is no ttl option, we'd simply overwrite the cache if there is no corresponding build there. I also agree with your other point : If we already have the cached artifact, then there's no point in even looking at builds.txt, but yes, it should still be cached in order to mitigate network boo boos.

wojtekmach commented 5 months ago

Right, builds.txt is not on GitHub but on repo.hex.pm. There is a link to it on GitHub.com/hexpm/bob.

paulo-ferraz-oliveira commented 5 months ago

@paulo-ferraz-oliveira Have any thoughts on this?

@starbelly, my thoughts on this are (⚠️ unhelpful):

  1. the GH client is a mess
    1. it throws exceptions that are hard to handle
    2. it treats timeouts as a particular type of error (and IIRC the retry is broken for this - on the client, not setup-beam)
    3. sometimes exceptions are objects that can be handled, sometimes "they're just strings"
  2. a while back I tested some changes on top of the ones Eric did for timeouts and cache, but couldn't improve setup-beam much/at all
  3. we accept pull requests; if this is hindering for an individual or a company we are supportive that they spend time and energy to fix the problem 👍, especially since when it happens those entities have much more context to work with
lpil commented 1 month ago

Hello all! This is happening commonly for our projects. It would be really helpful if anything could be done to help with the instability of this action 🙏 Thank you

starbelly commented 1 month ago

Hello all! This is happening commonly for our projects. It would be really helpful if anything could be done to help with the instability of this action 🙏 Thank you

I'm not sure this is anything instable about the action, but rather github (as I see it), and perhaps it's http-client in the toolkit. Based on the conversations above, what would you like to see? Caching is the only viable course I believe, but has it's own problems, though builds.txt should change very infrequently, assuming that is your main issue.

lpil commented 1 month ago

This action is definitely abnormally unstable. I use many actions and this is the only one that has problems, and I know other Gleam users have the same problem with this action.

starbelly commented 1 month ago

I'm running a test locally where by builds.txt is fetched every 10 seconds, it will fail if the fetch fails. My thoughts here still are for some reason github has problems fetching from hex sometimes (although I rarely see this myself). The results shall be interesting, I will let this run for say 12 hours (every 10 seconds).

starbelly commented 1 month ago

Right, builds.txt is not on GitHub but on repo.hex.pm. There is a link to it on GitHub.com/hexpm/bob.

My tests aside as noted above, which are for my own curiosity, would it be possible to publish builds.txt to github when it's made available on repo.hex.pm? I suspect that everything that hits github is going to have better odds.

Edit : Meant to ping @wojtekmach

starbelly commented 1 month ago

I'm running a test locally where by builds.txt is fetched every 10 seconds, it will fail if the fetch fails. My thoughts here still are for some reason github has problems fetching from hex sometimes (although I rarely see this myself). The results shall be interesting, I will let this run for say 12 hours (every 10 seconds).

This has run for over 12 hours with problems on my machine. This doesn't tell us what goes wrong in github ofc when hitting repo.hex.pm (fastly), but it does give some merit to the notion that the timeout events experienced are isolated to that environment. There are at least 3 possibilities here :

  1. Github has network flakiness sometimes with some external networks
  2. The http client has socket and/or protocol related bugs
  3. Fastly doesn't play well with github and it has nothing to do with github's network.

My curiosity is satisfied regardless. I think the best we can do is either cache builds.txt or if hexpm team can push those fiels to github on release.

wojtekmach commented 1 month ago

or if hexpm team can push those fiels to github on release.

could you elaborate? On release of what?

starbelly commented 1 month ago

could you elaborate? On release of what?

Whenever builds are updated, and builds.txt is updated. It seems, for whatever reason, this is what people seem to see fail for people the the most. The hypothesis here is that if it were on github, then this would cease to be a problem (or less likely anyway). Yet, that's only a hypothesis.

wojtekmach commented 1 month ago

Got it, thanks. I don't think there is a great place on GitHub to attach these builds to, though. There are hacks like a repo that stores builds in git or a repo that stores builds in "fake" releases (i.e. releases follow OTP releases but the underlying repo doesn't actually change), but neither sounds great.

starbelly commented 1 month ago

Got it, thanks. I don't think there is a great place on GitHub to attach these builds to, though. There are hacks like a repo that stores builds in git or a repo that stores builds in "fake" releases (i.e. releases follow OTP releases but the underlying repo doesn't actually change), but neither sounds great.

You will find no disagreement here, just trying to obviate caching 😄 It also doesn't fix the problem at the source. Maybe we just need to knock on github's door.

wojtekmach commented 1 month ago

I think caching is the way to go. If anyone is able to add basic built-in build caching I think it will go a long way. Perhaps it works just on version-type: strict so we don't need to keep hitting builds.txt. On cold cache we grab the build and builds.txt to check the checksum, cache the build, that's it. We don't even need to cache all things this action downloads, OTP builds are by far the biggest and top priority. Elixir and Gleam builds are downloaded from GitHub releases anyway.

starbelly commented 1 month ago

@wojtekmach I suppose, I think we already stated a PR for this would be welcome.