ghcr.io: push times out unreproducible

junghans commented 3 years ago

From https://github.com/votca/buildenv/runs/1534243887?check_suite_focus=true, I am getting a lot on unreproducible error when pushing a container:

#15 pushing layers 24.1s done
#15 pushing manifest for registry.gitlab.com/votca/buildenv/ubuntu:18.04
#15 pushing manifest for registry.gitlab.com/votca/buildenv/ubuntu:18.04 1.5s done
#15 pushing layers 631.9s done
#15 ERROR: failed commit on ref "layer-sha256:49507e21bb5d0cee4bce2fee3248063d1eff846974de5da2d70f7edc07af9181": no response
------
 > exporting to image:
------
failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:49507e21bb5d0cee4bce2fee3248063d1eff846974de5da2d70f7edc07af9181": no response
Error: buildx call failed with: failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:49507e21bb5d0cee4bce2fee3248063d1eff846974de5da2d70f7edc07af9181": no response

Never happened with v1.

crazy-max commented 3 years ago

@junghans Can you change the following step and let me know? Thanks:

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
        with:
          driver-opts: image=moby/buildkit:buildx-stable-1
          buildkitd-flags: --debug

junghans commented 3 years ago

2020-12-11T16:24:18.7837214Z #15 ERROR: failed commit on ref "layer-sha256:dbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c": failed to do request: Put https://ghcr.io/v2/votca/buildenv/ubuntu/blobs/upload/ac1e0dc5-5197-4415-a443-fb7411ef858a?digest=sha256%3Adbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c: read tcp 172.17.0.2:49118->140.82.113.33:443: read: connection timed out
2020-12-11T16:24:18.7841220Z ------
2020-12-11T16:24:18.7842055Z  > exporting to image:
2020-12-11T16:24:18.7843039Z ------
2020-12-11T16:24:18.7847131Z failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:dbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c": failed to do request: Put https://ghcr.io/v2/votca/buildenv/ubuntu/blobs/upload/ac1e0dc5-5197-4415-a443-fb7411ef858a?digest=sha256%3Adbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c: read tcp 172.17.0.2:49118->140.82.113.33:443: read: connection timed out
2020-12-11T16:24:18.9082160Z ##[error]buildx call failed with: failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:dbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c": failed to do request: Put https://ghcr.io/v2/votca/buildenv/ubuntu/blobs/upload/ac1e0dc5-5197-4415-a443-fb7411ef858a?digest=sha256%3Adbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c: read tcp 172.17.0.2:49118->140.82.113.33:443: read: connection timed out

From https://github.com/votca/buildenv/runs/1538301715?check_suite_focus=true

crazy-max commented 3 years ago

@junghans Looks like an issue with GHCR (cc @clarkbw).

clarkbw commented 3 years ago

We'll take a look but someone might not get to it until Monday.

davidski commented 3 years ago

I'm having an issue that may be related. I'm trying to push an image to AWS ECR and receiving the following error:

2020-12-14T04:15:17.0475346Z #9 ERROR: unexpected status: 401 Unauthorized
2020-12-14T04:15:17.0476491Z ------
2020-12-14T04:15:17.0476886Z  > exporting to image:
2020-12-14T04:15:17.0477377Z ------
2020-12-14T04:15:17.0477987Z failed to solve: rpc error: code = Unknown desc = unexpected status: 401 Unauthorized
2020-12-14T04:15:17.0679853Z ##[error]buildx call failed with: failed to solve: rpc error: code = Unknown desc = unexpected status: 401 Unauthorized

I can confirm that the credentials work for pushing to this ECR repo from a local instance. The login-action succeeds without error. Full disclosure, this is my first attempt using this action for pushing to ECR. :wink: Pushing to GHCR has worked smoothly for me, though.

If this doesn't look related, please do let me know and I'll work on creating a clean separate issue for review. Thanks!

crazy-max commented 3 years ago

@davidski Yes please create another bug report as this one is related to GHCR.

markphelps commented 3 years ago

@junghans I took a look on our end in the GHCR logs and it looks like your layer was uploaded successfully.

GHCR received the request at 2020-12-11T16:13:52Z and it looks like it was completed by us at 2020-12-11T16:14:27Z

@crazy-max Is there some kind of 30s timeout in the request that buildx uses when pushing a layer? The presence of no response and read: connection timed out leads me to believe this is a client side timeout?

crazy-max commented 3 years ago

@markphelps Yes there is a default timeout of 30s but this one looks like a networking issue.

@junghans Can you try with the latest stable release of buildx please? (v0.5.1 atm):

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
        with:
          version: latest
          buildkitd-flags: --debug

junghans commented 3 years ago

Different image, but still fails:

2020-12-15T13:52:52.6570268Z #16 [auth] votca/buildenv/fedora:pull,push token for ghcr.io
2020-12-15T13:52:52.6571975Z #16 sha256:e8541c1073feec4e25a6fc2c2959aa1538edc7e6d2dfb710d10730ded1df9309
2020-12-15T13:52:52.6573387Z #16 DONE 0.0s
2020-12-15T13:52:52.8064217Z 
2020-12-15T13:52:52.8066189Z #14 exporting to image
2020-12-15T13:52:52.8067223Z #14 sha256:e8c613e07b0b7ff33893b694f7759a10d42e180f2b4dc349fb57dc6b71dcab00
2020-12-15T14:03:30.9576987Z #14 pushing layers 638.5s done
2020-12-15T14:03:30.9581863Z #14 ERROR: failed commit on ref "layer-sha256:7d57542efffe8baa32825f71430bd49f142a0ac4e6fdc42c77b0e654f0381ac8": failed to do request: Put https://ghcr.io/v2/votca/buildenv/fedora/blobs/upload/e0620cbc-8822-4325-bac6-5d00317c0d1b?digest=sha256%3A7d57542efffe8baa32825f71430bd49f142a0ac4e6fdc42c77b0e654f0381ac8: read tcp 172.17.0.2:44330->140.82.112.33:443: read: connection timed out

(https://github.com/votca/buildenv/runs/1557317848?check_suite_focus=true)

crazy-max commented 3 years ago

@junghans Ok I think I got it, that's because you can only have a single namespace with GHCR. So ghcr.io/votca/buildenv/fedora:latest will not work here. Try instead with ghcr.io/votca/buildenv-fedora:latest. See https://github.com/docker/build-push-action/issues/115#issuecomment-689831379

junghans commented 3 years ago

Huh? It works with build-push-action@v1 and the image shows up here: https://github.com/orgs/votca/packages/container/package/buildenv%2Ffedora

clarkbw commented 3 years ago

So ghcr.io/votca/buildenv/fedora:latest will not work here. Try instead with ghcr.io/votca/buildenv-fedora:latest

Odd, this should work with GHCR. We don't have a limit on namespaces.

crazy-max commented 3 years ago

Odd, this should work with GHCR. We don't have a limit on namespaces.

My bad there is no limit actually yes!

junghans commented 3 years ago

Any news on this?

crazy-max commented 3 years ago

@junghans I've made some tests with several options linked to buildx and containerd and the following combination works for me (version: latest is buildx v0.5.1 atm):

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
        with:
          version: latest
          buildkitd-flags: --debug

junghans commented 3 years ago

Still failing: https://github.com/votca/buildenv/actions/runs/430519742, is there anything else you changed?

crazy-max commented 3 years ago

@junghans Nothing special that could change the current behavior actually :thinking:

junghans commented 3 years ago

Could it be a raise condition as all job run simultaneously?

crazy-max commented 3 years ago

@junghans I don't think so but can you set max-parallel: 4 in your job's strategy?

junghans commented 3 years ago

Actually that worked!?

crazy-max commented 3 years ago

@junghans That's strange, maybe an abuse prevention on bandwidth usage on GitHub Container Registry. Your images are quite huge (fedore:intel ~10GB). WDYT @markphelps?

junghans commented 3 years ago

Yeah, the intel compiler is just huge.....

junghans commented 3 years ago

Now it fails even when running sequentially: https://github.com/votca/buildenv/runs/1609439618?check_suite_focus=true

koppor commented 3 years ago

Same issue here with a 2.2GB image. I might be mistaken, but isn't rsync a possible technical solution for a transfer? In a bash world, I would use rsync -avz --progress --partial in a a look (see also https://serverfault.com/a/98750/107832). There are also BSD-licensed implementations available: https://github.com/kristapsdz/openrsync. Think, this could be a way for transferring large files reliably and without wasting bandwidth at retries.

markphelps commented 3 years ago

@junghans That's strange, maybe an abuse prevention on bandwidth usage on GitHub Container Registry. Your images are quite huge (fedore:intel ~10GB). WDYT @markphelps?

There shouldn't be an issue with abuse prevention or bandwidth usage but I will look into this more and @koppor 's issue as well tomorrow (just got back from vacation today).

@koppor do you have an example of your push to ghcr failing I can see?

shepmaster commented 3 years ago

@markphelps I’ve a number of them. A recent one

koppor commented 3 years ago

@koppor do you have an example of your push to ghcr failing I can see?

Sure: https://github.com/dante-ev/docker-texlive/pull/18/checks:

Scroll down to line 10517:

------
 > exporting to image:
------
failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1": failed to do request: Put ghcr.io/v2/dante-ev/texlive/blobs/upload/fb0545fd-10da-4b3f-aad0-578ded9514c0?digest=sha256%3A7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1: read tcp 172.17.0.2:58002->140.82.113.33:443: read: connection timed out
Error: buildx call failed with: failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1": failed to do request: Put ghcr.io/v2/dante-ev/texlive/blobs/upload/fb0545fd-10da-4b3f-aad0-578ded9514c0?digest=sha256%3A7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1: read tcp 172.17.0.2:58002->140.82.113.33:443: read: connection timed out

A later run succeeded though - maybe, some intermediate layers have been uploaded successfully then at the first try?

https://github.com/dante-ev/docker-texlive/runs/1646525219?check_suite_focus=true:

#19 exporting to image
#19 exporting layers
#19 exporting layers 335.7s done
#19 exporting manifest sha256:0a7d487f9857a0d12dce2e9f8486d25e3767d7522c87f5618725323906555c67 0.0s done
#19 exporting config sha256:79d92fa1bb795807e096b74b1d112b9f9e7608acfde2a2544e7b5b1c2b74c4f1 done
#19 pushing layers
#19 pushing layers 80.8s done
#19 pushing manifest for docker.io/danteev/texlive:latest
#19 pushing manifest for docker.io/danteev/texlive:latest 0.4s done
#19 pushing layers 63.2s done
#19 pushing manifest for ghcr.io/dante-ev/texlive:latest
#19 pushing manifest for ghcr.io/dante-ev/texlive:latest 0.9s done
#19 DONE 481.1s

BTW: Welcome back from holiday.

markphelps commented 3 years ago

Ok, so I did some digging in our logs and have a good idea about what is happening with @koppor 's request, and I think it is similar to what I mentioned earlier about @junghans .

Here is the gist of our logs:

Buildx does a HEAD request to see if the blob already exists:

time="2021-01-04T21:05:31Z"  method=HEAD request_id="0404:7452:112C3E:28E5B9:5FF3831B" status=404 url="/v2/dante-ev/texlive/blobs/sha256:7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1" user_agent=containerd/1.4.0+unknown

It does not, so we return a 404 Not Found

Buildx initiates the blob upload:

time="2021-01-04T21:05:32Z" content_length=2225142670 method=PUT request_id="0404:7452:112C41:28E5BD:5FF3831C" url="/v2/dante-ev/texlive/blobs/upload/fb0545fd-10da-4b3f-aad0-578ded9514c0?digest=sha256%3A7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1" user_agent=containerd/1.4.0+unknown

We (GHCR) process the blob and store it. Then we respond with a 201 Created.

These logs are from our edge load balancer:

{"@timestamp":"2021-01-04T21:06:40+0000””,method":"PUT",”request_id":"0404:7452:112C41:28E5BD:5FF3831C””,status":201,"ta":68075,"term_state":"----","ubytes":2225143062,"uri":"/v2/dante-ev/texlive/blobs/upload/fb0545fd-10da-4b3f-aad0-578ded9514c0?digest=sha256%3A7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1","user_agent":"containerd/1.4.0+unknown"}

You can verify that Steps 2-3 above are the same request via the request_id 0404:7452:112C41:28E5BD:5FF3831C.

"ta":68075 in the above log means time active (from request to response) which is 68075 ms or 68.075 seconds. This matches up to the logs as the start of the upload was at 21:05:32 and we responded at 21:06:40 which is a difference of 68 seconds

This leads me to believe it is a client timeout issue, per @crazy-max 's reply here:

@markphelps Yes there is a default timeout of 30s but this one looks like a networking issue.

@crazy-max , if buildx does not receive a response in 30s from the registry server does that mean it will cancel the request context? If so, I think this would result in what @koppor, @junghans and @shepmaster are seeing for 'large' image layers if they take over 30s to upload.

Perhaps we are missing a keep-alive header on our end or some other value to inform buildx not to close the connection for long uploads?

jens-maus commented 3 years ago

After having setup our new ghcr.io registry push using build-push-action I came across the same issue as it seems:

https://github.com/jens-maus/RaspberryMatic/runs/1652419446?check_suite_focus=true

There, my final build-push-action is rejected with a similar 172.17.0.2:36770->140.82.114.34:443: read: connection timed out error message:

Error: buildx call failed with: failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:ab755bdbc94398e3a07906395be4cc0c207f103fef71a3289b6af30f356b5ae9": failed to do request: Put https://ghcr.io/v2/jens-maus/raspberrymatic/blobs/upload/6746f9c5-3f26-45d9-b10c-db36c2f0b779?digest=sha256%3Aab755bdbc94398e3a07906395be4cc0c207f103fef71a3289b6af30f356b5ae9: read tcp 172.17.0.2:36770->140.82.114.34:443: read: connection timed out

So what is the current state of affairs regarding this issue? I am a new user on build-push-action and expected it will run out-of-the-box with ghcr.io. And this is the first time I am trying to push to ghcr.io in that workflow.

EDIT: I could perfectly reproduce the issue and generate a dummy workflow which for one tries to upload only a small package using build-push-action and this works: https://github.com/jens-maus/RaspberryMatic/runs/1652792879?check_suite_focus=true#step:11:70. And then I generate a several hundred MB large package the exact same way and there it breaks with the connected timed out message: https://github.com/jens-maus/RaspberryMatic/runs/1652839402?check_suite_focus=true

shepmaster commented 3 years ago

@jens-maus please read the comment immediately preceding yours for the "current state".

clarkbw commented 3 years ago

Perhaps we are missing a keep-alive header on our end or some other value to inform buildx not to close the connection for long uploads?

@thaJeztah any idea about this? ☝️

thaJeztah commented 3 years ago

Hm... not sure; @AkihiroSuda @tonistiigi any idea if that could be the cause?

junghans commented 3 years ago

Any news on this?

Eifoen commented 3 years ago

Encountered the same problem today with my texlive image (which is quite large aswell)

https://github.com/Eifoen/alpine-texlive/actions/runs/489938545

Nuru commented 3 years ago

I am having the same (or very similar) issues pushing to docker.io using docker/setup-buildx-action@v1 and docker/build-push-action@v2. Seems the whole thing fails if it takes more than 30 seconds to push a layer. (Which it shouldn't, the layers are not that big.)

Eifoen commented 3 years ago

My job is on a schedule every sunday.. so tonight it did finish successfully (same commit that failed after the push trigger). Makes me think this Problem is related to the current load on ghcr.

turgs commented 3 years ago

I've been having this issue trying to push to GHCR also. I don't understand why, but I added max-parallel as suggested above to the job in my workflow and it suddenly worked. Now I'm not sure if it was just coincidence! :tada:

on:
  push:
    branches: [master]
jobs:
  push_job:
+   strategy:
+     max-parallel: 4
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

Eifoen commented 3 years ago

My job is on a schedule every sunday.. so tonight it did finish successfully (same commit that failed after the push trigger). Makes me think this Problem is related to the current load on ghcr.

Edit: The last couple of schedules worked flawless.

junghans commented 3 years ago

This is still broken for me even with max-parallel: 1 and reproducible every build. See some example gha action runs that failed here: https://github.com/votca/buildenv/actions/runs/604675018, https://github.com/votca/buildenv/actions/runs/604311273 and https://github.com/votca/buildenv/actions/runs/604131009 The lastest is using https://github.com/votca/buildenv/blob/5cf4b17a66851587355164c458b742f903678edd/.github/workflows/continuous-integration-workflow.yml

@crazy-max @clarkbw any ETA for a solution.

Background: We cannot use build-push-action@v1 for opensuse/tumbleweed:latest anymore due to https://www.reddit.com/r/openSUSE/comments/lm48zo/install_git_in_docker_tumbleweed/, which boils done to an bug in old libseccomp. However, github action runs said old version of libseccomp and hence opensuse needs to build with --security-opt seccomp:unconfined, which doesn't seem to be possible to inject in build-push-action@v1.

tomy0000000 commented 3 years ago

I'm building many images with matrix, and having a similar issue when pushing the exported image to ghcr.io, even with max-parallel.

My action runs automatically upon commit, and I'll manually tag the commit to kickoff another workflow for pushing the image. Therefore, I'm certain this has got to be the issue with the ghcr.io server.

Notice that the following three commits runs on an identical dockerfile, and the timeout issue isn't always reproducible.

	wo/ Push	w/ Push
`max-parallel`: 2	Action	Action
`max-parallel`: 4	Action	Action
without `max-parallel`	Action	Action

markphelps commented 3 years ago

Sorry for the radio silence. We've made some changes in ghcr.io that should hopefully help alleviate these concurrency issues. Would you mind letting me know if your parallel builds succeed?

hiaselhans commented 3 years ago

hi @markphelps

ran into the same problem today uploading a rather big but not huuge (~2GB) image to ghcr: https://github.com/airgproducts/llvm/runs/2335603402?check_suite_focus=true

jessfraz commented 3 years ago

as a stop gap is there any way to have a retry build on error functionality?

crazy-max commented 3 years ago

Hi @jessfraz, is it about this workflow?

tonistiigi commented 3 years ago

as a stop gap is there any way to have a retry build on error functionality?

There is a push retry in latest buildkit https://github.com/moby/buildkit/blob/master/util/resolver/retryhandler/retry.go#L51

The original "no response" error reported was fixed in https://github.com/containerd/containerd/pull/4724

jessfraz commented 3 years ago

Ah sorry its a private repo, also in looking more into this, it is not "no response"

Basically 1 in 50 error with either 403 forbidden or 503 service error, happy to make a different issue, maybe a race or something?

tonistiigi commented 3 years ago

503 retry handler was added recently. It is in master but not in the latest release yet.

403 is weird though and could be misbehaving registry. Problems with auth should return 401.

crazy-max commented 3 years ago

@tonistiigi

403 is weird though and could be misbehaving registry. Problems with auth should return 401.

Yes agree with that.

@jessfraz Do you use the GITHUB_TOKEN or a PAT to login to ghcr?

markphelps commented 3 years ago

hi @markphelps

ran into the same problem today uploading a rather big but not huuge (~2GB) image to ghcr: https://github.com/airgproducts/llvm/runs/2335603402?check_suite_focus=true

@hiaselhans

Upon looking for your request that errored out in your linked workflow run, I see this error:

 #26 ERROR: failed commit on ref "layer-sha256:db2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953": failed to do request: Put https://ghcr.io/v2/airgproducts/llvm/blobs/upload/514e25f9-7efa-421b-9a2a-acbabde7bbdd?digest=sha256%3Adb2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953: read tcp 172.17.0.2:48072->140.82.114.33:443: read: connection timed out
------

This still makes me think this is a client side (buildx) timeout issue. I looked in our load balancer logs for this request to upload this layer, and see that it (our server) did infact respond with a 201 Created status:

query":"?digest=sha256%3Adb2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953","recv_timestamp":"2021-04-13T19:08:43+0000","status":201,"ta":19130,"tc":0,"td":0,"term_state":"----","th":0,"ti":0,"tq":0,"tr":19130,"trb":0,"tt":19130,"tw":0,"type":"syslog","ubytes":223072382,"uri":"/v2/airgproducts/llvm/blobs/upload/514e25f9-7efa-421b-9a2a-acbabde7bbdd?digest=sha256%3Adb2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953","user":"-","user_agent":"containerd/1.4.0+unknown"}

Note fields tt: 19130 and tr: 19130 which are total time in milliseconds elapsed between the accept and the last close and total time in milliseconds spent waiting for the server to send a full HTTP response, not counting data respectively.

So it looks like the server (GHCR) spent 19 seconds before returning the 201 response for that layer upload. This leads me to think there is an issue client side, timing out if it does not receive a response in a certain period of time or perhaps a networking issue (on which side I do not know yet)

jessfraz commented 3 years ago

Do you use the GITHUB_TOKEN or a PAT to login to ghcr?

@crazy-max I am using a PAT but can switch to GITHUB_TOKEN since I turned that on as well!

docker / build-push-action

ghcr.io: push times out unreproducible #251