Closed mosheavni closed 1 year ago
We have the same exact issue. Already reported in issue #1627 It seems that there is already a fix merged #1685 #6380 that adds a new flag. Waiting for a newer release > 1.6.0 I also have a support ticket open by Microsoft Azure team and they said they found the root cause and are in a rollout. But I don't have any ETA for this
I got feedback from Microsoft last week Friday:
RCA: Azure storage uses multiple frontend nodes within a single storage scale unit to serve multiple storage accounts. As part of regular maintenance of these nodes, we reboot them after draining the existing requests and then put them back into production. As a result of investigating this incident, we have learned that it is possible for storage front-end nodes to be rebooted while the requests are still draining. This will cause any existing connections to the front end node to be closed causing a connection reset of associated pipelines. The precise cause of why the requests are not drained fully is still under investigation but it is likely due to a faulty feedback logic of when nodes get taken down. Because the load balancer distributes requests evenly across many front-end nodes, clients are unlikely to experience a Reset like this if they retry the request. We are still looking into ways we can proactively detect and mitigate these nodes in the short term. Longer-term, we will have a permanent fix to prevent this issue from happening.
Resolution
any update on this issue ? experiencing the exact same issue as @mosheavni
@mehdibenfeguir as a workaround we are using the flag "--image-fs-extract-retry" in Kaniko 1.7.0. I got no new reports that these connection resets are happening.
do you mean this image gcr.io/kaniko-project/executor:v1.7.0-debug ??
this is the result with --image-fs-extract-retry 5
the argument worked, but retrying gives the same result
error building image: error building stage: failed to get filesystem from image: read tcp "MY_IP_HERE":34650->"MY_IP_HERE":443: read: connection reset by peer
[36mINFO[0m[0012] Unpacking rootfs as cmd RUN mkdir -p /t && cp -r /twgl/common-service/.gradle /t requires it.
[33mWARN[0m[0037] Retrying operation after 1s due to read tcp MY_FIRST_IP:38536->MY_SECOND_IP:443: read: connection reset by peer
[33mWARN[0m[0061] Retrying operation after 2s due to read tcp MY_FIRST_IP:38960->MY_SECOND_IP:443: read: connection reset by peer
[33mWARN[0m[0089] Retrying operation after 4s due to read tcp MY_FIRST_IP:39254->MY_SECOND_IP:443: read: connection reset by peer
[33mWARN[0m[0123] Retrying operation after 8s due to read tcp MY_FIRST_IP:39662->MY_SECOND_IP:443: read: connection reset by peer
[33mWARN[0m[0157] Retrying operation after 16s due to read tcp MY_FIRST_IP:40112->MY_SECOND_IP:443: read: connection reset by peer
error building image: error building stage: failed to get filesystem from image: read tcp MY_FIRST_IP:40540->MY_SECOND_IP:443: read: connection reset by peer
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withCredentials
[Pipeline] }
[Pipeline] // container
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline
ERROR: script returned exit code 1
[Bitbucket] Notifying commit build result
[Bitbucket] Build result notified
Finished: FAILURE
@Crapshit could you please help
We are not getting such often connection reset issues in our environment. And with that mentioned flag I can't see any issues anymore. I could even see the resets with Docker itself, but Docker retries it 5 times as default, so it never failed in CICD pipelines...
the argument worked, but retrying gives the same result
error building image: error building stage: failed to get filesystem from image: read tcp "MY_IP_HERE":34650->"MY_IP_HERE":443: read: connection reset by peer
Experiencing the same issue here, --image-fs-extract-retry 5
didn't seem to help at all
I've met this problem now, and I test in different machine with different network. The one of them met this problem everytime and the error information is there:
INFO[0032] Unpacking rootfs as cmd COPY package*.json ./ requires it.
error building image: error building stage: failed to get filesystem from image: read tcp 172.17.0.3:60130->104.18.123.25:443: read: connection reset by peer
I'm confused with that why the option of "COPY" in Dockerfile would trigger the network connection with such 104.18.123.25:443 (for that i don't get the principle of kaniko) And it seems got some error due to the network..... i've try many times in container exec or CI pipeline and it all stuck at this COPY instruction. besides, in CI-pipeline I got following errors:
INFO[0034] Unpacking rootfs as cmd COPY package*.json ./ requires it.
error building image: error building stage: failed to get filesystem from image: stream error: stream ID 7; INTERNAL_ERROR
and
INFO[0014] Unpacking rootfs as cmd COPY package*.json ./ requires it.
error building image: error building stage: failed to get filesystem from image: stream error: stream ID 3; INTERNAL_ERROR; received from peer
And it got no solutions except image-fs-extract-retry
or --push-retry
in github issue.
I would be appreciated you guys teach me some way to find the core reason or how to debug for it!
/(γoγ)/~~
We had the same issue. A simple Dockerfile with FROM and COPY statement was enough. I interpreted it so that COPY requires the extraction of the FROM image and the download for it failed because of the connection reset.
We had the same issue. A simple Dockerfile with FROM and COPY statement was enough. I interpreted it so that COPY requires the extraction of the FROM image and the download for it failed because of the connection reset.
Thanks for your prompt reply, I've solve it by using local image register to restore the FROM image. It is probably stuck when kaniko pulls image from official registry where I have speed problem to access in my region. Additionally can I try to change the default image registry from docker.io to dockerhub or orther mirrors registry by config kaniko when pulling such FROM image?
Actual behavior The kaniko build always fails with this error:
This is the entire log:
Expected behavior Build to succeed
To Reproduce Not sure it would be reproducible, other builds are ok, this specific image and some other are failing for similar error. the executor cmd used:
Additional Information
gcr.io/kaniko-project/executor:debug
digestsha256:fcccd2ab9f3892e33fc7f2e950c8e4fc665e7a4c66f6a9d70b300d7a2103592f
I want to better understand the nature of this error, what can cause it and what are the possible fixes. Thanks. Triage Notes for the Maintainers
--cache
flag