Closed junghans closed 3 years ago
@junghans Can you change the following step and let me know? Thanks:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
with:
driver-opts: image=moby/buildkit:buildx-stable-1
buildkitd-flags: --debug
2020-12-11T16:24:18.7837214Z #15 ERROR: failed commit on ref "layer-sha256:dbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c": failed to do request: Put https://ghcr.io/v2/votca/buildenv/ubuntu/blobs/upload/ac1e0dc5-5197-4415-a443-fb7411ef858a?digest=sha256%3Adbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c: read tcp 172.17.0.2:49118->140.82.113.33:443: read: connection timed out
2020-12-11T16:24:18.7841220Z ------
2020-12-11T16:24:18.7842055Z > exporting to image:
2020-12-11T16:24:18.7843039Z ------
2020-12-11T16:24:18.7847131Z failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:dbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c": failed to do request: Put https://ghcr.io/v2/votca/buildenv/ubuntu/blobs/upload/ac1e0dc5-5197-4415-a443-fb7411ef858a?digest=sha256%3Adbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c: read tcp 172.17.0.2:49118->140.82.113.33:443: read: connection timed out
2020-12-11T16:24:18.9082160Z ##[error]buildx call failed with: failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:dbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c": failed to do request: Put https://ghcr.io/v2/votca/buildenv/ubuntu/blobs/upload/ac1e0dc5-5197-4415-a443-fb7411ef858a?digest=sha256%3Adbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c: read tcp 172.17.0.2:49118->140.82.113.33:443: read: connection timed out
From https://github.com/votca/buildenv/runs/1538301715?check_suite_focus=true
@junghans Looks like an issue with GHCR (cc @clarkbw).
We'll take a look but someone might not get to it until Monday.
I'm having an issue that may be related. I'm trying to push an image to AWS ECR and receiving the following error:
2020-12-14T04:15:17.0475346Z #9 ERROR: unexpected status: 401 Unauthorized
2020-12-14T04:15:17.0476491Z ------
2020-12-14T04:15:17.0476886Z > exporting to image:
2020-12-14T04:15:17.0477377Z ------
2020-12-14T04:15:17.0477987Z failed to solve: rpc error: code = Unknown desc = unexpected status: 401 Unauthorized
2020-12-14T04:15:17.0679853Z ##[error]buildx call failed with: failed to solve: rpc error: code = Unknown desc = unexpected status: 401 Unauthorized
I can confirm that the credentials work for pushing to this ECR repo from a local instance. The login-action succeeds without error. Full disclosure, this is my first attempt using this action for pushing to ECR. :wink: Pushing to GHCR has worked smoothly for me, though.
If this doesn't look related, please do let me know and I'll work on creating a clean separate issue for review. Thanks!
@davidski Yes please create another bug report as this one is related to GHCR.
@junghans I took a look on our end in the GHCR logs and it looks like your layer was uploaded successfully.
GHCR received the request at 2020-12-11T16:13:52Z
and it looks like it was completed by us at 2020-12-11T16:14:27Z
@crazy-max Is there some kind of 30s
timeout in the request that buildx
uses when pushing a layer? The presence of no response
and read: connection timed out
leads me to believe this is a client side timeout?
@markphelps Yes there is a default timeout of 30s
but this one looks like a networking issue.
@junghans Can you try with the latest stable release of buildx please? (v0.5.1 atm):
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
with:
version: latest
buildkitd-flags: --debug
Different image, but still fails:
2020-12-15T13:52:52.6570268Z #16 [auth] votca/buildenv/fedora:pull,push token for ghcr.io
2020-12-15T13:52:52.6571975Z #16 sha256:e8541c1073feec4e25a6fc2c2959aa1538edc7e6d2dfb710d10730ded1df9309
2020-12-15T13:52:52.6573387Z #16 DONE 0.0s
2020-12-15T13:52:52.8064217Z
2020-12-15T13:52:52.8066189Z #14 exporting to image
2020-12-15T13:52:52.8067223Z #14 sha256:e8c613e07b0b7ff33893b694f7759a10d42e180f2b4dc349fb57dc6b71dcab00
2020-12-15T14:03:30.9576987Z #14 pushing layers 638.5s done
2020-12-15T14:03:30.9581863Z #14 ERROR: failed commit on ref "layer-sha256:7d57542efffe8baa32825f71430bd49f142a0ac4e6fdc42c77b0e654f0381ac8": failed to do request: Put https://ghcr.io/v2/votca/buildenv/fedora/blobs/upload/e0620cbc-8822-4325-bac6-5d00317c0d1b?digest=sha256%3A7d57542efffe8baa32825f71430bd49f142a0ac4e6fdc42c77b0e654f0381ac8: read tcp 172.17.0.2:44330->140.82.112.33:443: read: connection timed out
(https://github.com/votca/buildenv/runs/1557317848?check_suite_focus=true)
@junghans Ok I think I got it, that's because you can only have a single namespace with GHCR. So ghcr.io/votca/buildenv/fedora:latest
will not work here. Try instead with ghcr.io/votca/buildenv-fedora:latest
. See https://github.com/docker/build-push-action/issues/115#issuecomment-689831379
Huh? It works with build-push-action@v1 and the image shows up here: https://github.com/orgs/votca/packages/container/package/buildenv%2Ffedora
So
ghcr.io/votca/buildenv/fedora:latest
will not work here. Try instead withghcr.io/votca/buildenv-fedora:latest
Odd, this should work with GHCR. We don't have a limit on namespaces.
Odd, this should work with GHCR. We don't have a limit on namespaces.
My bad there is no limit actually yes!
Any news on this?
@junghans I've made some tests with several options linked to buildx and containerd and the following combination works for me (version: latest
is buildx v0.5.1
atm):
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
with:
version: latest
buildkitd-flags: --debug
Still failing: https://github.com/votca/buildenv/actions/runs/430519742, is there anything else you changed?
@junghans Nothing special that could change the current behavior actually :thinking:
Could it be a raise condition as all job run simultaneously?
@junghans I don't think so but can you set max-parallel: 4
in your job's strategy?
Actually that worked!?
@junghans That's strange, maybe an abuse prevention on bandwidth usage on GitHub Container Registry. Your images are quite huge (fedore:intel
~10GB). WDYT @markphelps?
Yeah, the intel compiler is just huge.....
Now it fails even when running sequentially: https://github.com/votca/buildenv/runs/1609439618?check_suite_focus=true
Same issue here with a 2.2GB image. I might be mistaken, but isn't rsync a possible technical solution for a transfer? In a bash world, I would use rsync -avz --progress --partial
in a a look (see also https://serverfault.com/a/98750/107832). There are also BSD-licensed implementations available: https://github.com/kristapsdz/openrsync. Think, this could be a way for transferring large files reliably and without wasting bandwidth at retries.
@junghans That's strange, maybe an abuse prevention on bandwidth usage on GitHub Container Registry. Your images are quite huge (
fedore:intel
~10GB). WDYT @markphelps?
There shouldn't be an issue with abuse prevention or bandwidth usage but I will look into this more and @koppor 's issue as well tomorrow (just got back from vacation today).
@koppor do you have an example of your push to ghcr failing I can see?
@markphelps I’ve a number of them. A recent one
@koppor do you have an example of your push to ghcr failing I can see?
Sure: https://github.com/dante-ev/docker-texlive/pull/18/checks:
Scroll down to line 10517:
------
> exporting to image:
------
failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1": failed to do request: Put ghcr.io/v2/dante-ev/texlive/blobs/upload/fb0545fd-10da-4b3f-aad0-578ded9514c0?digest=sha256%3A7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1: read tcp 172.17.0.2:58002->140.82.113.33:443: read: connection timed out
Error: buildx call failed with: failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1": failed to do request: Put ghcr.io/v2/dante-ev/texlive/blobs/upload/fb0545fd-10da-4b3f-aad0-578ded9514c0?digest=sha256%3A7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1: read tcp 172.17.0.2:58002->140.82.113.33:443: read: connection timed out
A later run succeeded though - maybe, some intermediate layers have been uploaded successfully then at the first try?
https://github.com/dante-ev/docker-texlive/runs/1646525219?check_suite_focus=true:
#19 exporting to image
#19 exporting layers
#19 exporting layers 335.7s done
#19 exporting manifest sha256:0a7d487f9857a0d12dce2e9f8486d25e3767d7522c87f5618725323906555c67 0.0s done
#19 exporting config sha256:79d92fa1bb795807e096b74b1d112b9f9e7608acfde2a2544e7b5b1c2b74c4f1 done
#19 pushing layers
#19 pushing layers 80.8s done
#19 pushing manifest for docker.io/danteev/texlive:latest
#19 pushing manifest for docker.io/danteev/texlive:latest 0.4s done
#19 pushing layers 63.2s done
#19 pushing manifest for ghcr.io/dante-ev/texlive:latest
#19 pushing manifest for ghcr.io/dante-ev/texlive:latest 0.9s done
#19 DONE 481.1s
BTW: Welcome back from holiday.
Ok, so I did some digging in our logs and have a good idea about what is happening with @koppor 's request, and I think it is similar to what I mentioned earlier about @junghans .
Here is the gist of our logs:
time="2021-01-04T21:05:31Z" method=HEAD request_id="0404:7452:112C3E:28E5B9:5FF3831B" status=404 url="/v2/dante-ev/texlive/blobs/sha256:7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1" user_agent=containerd/1.4.0+unknown
It does not, so we return a 404 Not Found
time="2021-01-04T21:05:32Z" content_length=2225142670 method=PUT request_id="0404:7452:112C41:28E5BD:5FF3831C" url="/v2/dante-ev/texlive/blobs/upload/fb0545fd-10da-4b3f-aad0-578ded9514c0?digest=sha256%3A7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1" user_agent=containerd/1.4.0+unknown
201 Created
. These logs are from our edge load balancer:
{"@timestamp":"2021-01-04T21:06:40+0000””,method":"PUT",”request_id":"0404:7452:112C41:28E5BD:5FF3831C””,status":201,"ta":68075,"term_state":"----","ubytes":2225143062,"uri":"/v2/dante-ev/texlive/blobs/upload/fb0545fd-10da-4b3f-aad0-578ded9514c0?digest=sha256%3A7965e1fc3104b399bf7b2dacb209de86494b45aba89cd0b87a813584067f40f1","user_agent":"containerd/1.4.0+unknown"}
You can verify that Steps 2-3 above are the same request via the request_id 0404:7452:112C41:28E5BD:5FF3831C
.
"ta":68075
in the above log means time active
(from request to response) which is 68075 ms
or 68.075 seconds
. This matches up to the logs as the start of the upload was at 21:05:32
and we responded at 21:06:40
which is a difference of 68 seconds
This leads me to believe it is a client timeout issue, per @crazy-max 's reply here:
@markphelps Yes there is a default timeout of 30s but this one looks like a networking issue.
@crazy-max , if buildx does not receive a response in 30s from the registry server does that mean it will cancel the request context? If so, I think this would result in what @koppor, @junghans and @shepmaster are seeing for 'large' image layers if they take over 30s to upload.
Perhaps we are missing a keep-alive header on our end or some other value to inform buildx not to close the connection for long uploads?
After having setup our new ghcr.io registry push using build-push-action I came across the same issue as it seems:
https://github.com/jens-maus/RaspberryMatic/runs/1652419446?check_suite_focus=true
There, my final build-push-action is rejected with a similar 172.17.0.2:36770->140.82.114.34:443: read: connection timed out
error message:
Error: buildx call failed with: failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:ab755bdbc94398e3a07906395be4cc0c207f103fef71a3289b6af30f356b5ae9": failed to do request: Put https://ghcr.io/v2/jens-maus/raspberrymatic/blobs/upload/6746f9c5-3f26-45d9-b10c-db36c2f0b779?digest=sha256%3Aab755bdbc94398e3a07906395be4cc0c207f103fef71a3289b6af30f356b5ae9: read tcp 172.17.0.2:36770->140.82.114.34:443: read: connection timed out
So what is the current state of affairs regarding this issue? I am a new user on build-push-action and expected it will run out-of-the-box with ghcr.io. And this is the first time I am trying to push to ghcr.io in that workflow.
EDIT: I could perfectly reproduce the issue and generate a dummy workflow which for one tries to upload only a small package using build-push-action and this works: https://github.com/jens-maus/RaspberryMatic/runs/1652792879?check_suite_focus=true#step:11:70. And then I generate a several hundred MB large package the exact same way and there it breaks with the connected timed out message: https://github.com/jens-maus/RaspberryMatic/runs/1652839402?check_suite_focus=true
@jens-maus please read the comment immediately preceding yours for the "current state".
Perhaps we are missing a keep-alive header on our end or some other value to inform buildx not to close the connection for long uploads?
@thaJeztah any idea about this? ☝️
Hm... not sure; @AkihiroSuda @tonistiigi any idea if that could be the cause?
Any news on this?
Encountered the same problem today with my texlive image (which is quite large aswell)
https://github.com/Eifoen/alpine-texlive/actions/runs/489938545
I am having the same (or very similar) issues pushing to docker.io
using docker/setup-buildx-action@v1
and docker/build-push-action@v2
. Seems the whole thing fails if it takes more than 30 seconds to push a layer. (Which it shouldn't, the layers are not that big.)
My job is on a schedule every sunday.. so tonight it did finish successfully (same commit that failed after the push trigger). Makes me think this Problem is related to the current load on ghcr.
I've been having this issue trying to push to GHCR also. I don't understand why, but I added max-parallel
as suggested above to the job in my workflow and it suddenly worked. Now I'm not sure if it was just coincidence! :tada:
on:
push:
branches: [master]
jobs:
push_job:
+ strategy:
+ max-parallel: 4
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
My job is on a schedule every sunday.. so tonight it did finish successfully (same commit that failed after the push trigger). Makes me think this Problem is related to the current load on ghcr.
Edit: The last couple of schedules worked flawless.
This is still broken for me even with max-parallel: 1
and reproducible every build.
See some example gha action runs that failed here: https://github.com/votca/buildenv/actions/runs/604675018, https://github.com/votca/buildenv/actions/runs/604311273 and https://github.com/votca/buildenv/actions/runs/604131009
The lastest is using https://github.com/votca/buildenv/blob/5cf4b17a66851587355164c458b742f903678edd/.github/workflows/continuous-integration-workflow.yml
@crazy-max @clarkbw any ETA for a solution.
Background: We cannot use build-push-action@v1 for opensuse/tumbleweed:latest
anymore due to https://www.reddit.com/r/openSUSE/comments/lm48zo/install_git_in_docker_tumbleweed/, which boils done to an bug in old libseccomp. However, github action runs said old version of libseccomp and hence opensuse needs to build with --security-opt seccomp:unconfined
, which doesn't seem to be possible to inject in build-push-action@v1
.
I'm building many images with matrix, and having a similar issue when pushing the exported image to ghcr.io, even with max-parallel
.
My action runs automatically upon commit, and I'll manually tag the commit to kickoff another workflow for pushing the image. Therefore, I'm certain this has got to be the issue with the ghcr.io server.
Notice that the following three commits runs on an identical dockerfile, and the timeout issue isn't always reproducible.
wo/ Push | w/ Push | |
---|---|---|
max-parallel : 2 |
Action | Action |
max-parallel : 4 |
Action | Action |
without max-parallel |
Action | Action |
Sorry for the radio silence. We've made some changes in ghcr.io that should hopefully help alleviate these concurrency issues. Would you mind letting me know if your parallel builds succeed?
hi @markphelps
ran into the same problem today uploading a rather big but not huuge (~2GB) image to ghcr: https://github.com/airgproducts/llvm/runs/2335603402?check_suite_focus=true
as a stop gap is there any way to have a retry build on error functionality?
Hi @jessfraz, is it about this workflow?
as a stop gap is there any way to have a retry build on error functionality?
There is a push retry in latest buildkit https://github.com/moby/buildkit/blob/master/util/resolver/retryhandler/retry.go#L51
The original "no response" error reported was fixed in https://github.com/containerd/containerd/pull/4724
Ah sorry its a private repo, also in looking more into this, it is not "no response"
Basically 1 in 50 error with either 403 forbidden or 503 service error, happy to make a different issue, maybe a race or something?
503 retry handler was added recently. It is in master but not in the latest release yet.
403 is weird though and could be misbehaving registry. Problems with auth should return 401.
@tonistiigi
403 is weird though and could be misbehaving registry. Problems with auth should return 401.
Yes agree with that.
@jessfraz Do you use the GITHUB_TOKEN
or a PAT to login to ghcr?
hi @markphelps
ran into the same problem today uploading a rather big but not huuge (~2GB) image to ghcr: https://github.com/airgproducts/llvm/runs/2335603402?check_suite_focus=true
@hiaselhans
Upon looking for your request that errored out in your linked workflow run, I see this error:
#26 ERROR: failed commit on ref "layer-sha256:db2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953": failed to do request: Put https://ghcr.io/v2/airgproducts/llvm/blobs/upload/514e25f9-7efa-421b-9a2a-acbabde7bbdd?digest=sha256%3Adb2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953: read tcp 172.17.0.2:48072->140.82.114.33:443: read: connection timed out
------
This still makes me think this is a client side (buildx) timeout issue. I looked in our load balancer logs for this request to upload this layer, and see that it (our server) did infact respond with a 201 Created
status:
query":"?digest=sha256%3Adb2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953","recv_timestamp":"2021-04-13T19:08:43+0000","status":201,"ta":19130,"tc":0,"td":0,"term_state":"----","th":0,"ti":0,"tq":0,"tr":19130,"trb":0,"tt":19130,"tw":0,"type":"syslog","ubytes":223072382,"uri":"/v2/airgproducts/llvm/blobs/upload/514e25f9-7efa-421b-9a2a-acbabde7bbdd?digest=sha256%3Adb2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953","user":"-","user_agent":"containerd/1.4.0+unknown"}
Note fields tt: 19130
and tr: 19130
which are total time in milliseconds elapsed between the accept and the last close
and total time in milliseconds spent waiting for the server to send a full HTTP response, not counting data
respectively.
So it looks like the server (GHCR) spent 19 seconds before returning the 201 response
for that layer upload. This leads me to think there is an issue client side, timing out if it does not receive a response in a certain period of time or perhaps a networking issue (on which side I do not know yet)
Do you use the GITHUB_TOKEN or a PAT to login to ghcr?
@crazy-max I am using a PAT but can switch to GITHUB_TOKEN since I turned that on as well!
From https://github.com/votca/buildenv/runs/1534243887?check_suite_focus=true, I am getting a lot on unreproducible error when pushing a container:
Never happened with v1.