GoogleContainerTools / kaniko

Build Container Images In Kubernetes
Apache License 2.0
14.61k stars 1.42k forks source link

failed to get filesystem from image: connection reset by peer #1717

Closed mosheavni closed 10 months ago

mosheavni commented 3 years ago

Actual behavior The kaniko build always fails with this error:

error building image: error building stage: failed to get filesystem from image: read tcp 10.50.99.48:34650->52.XXX.141.196:443: read: connection reset by peer

This is the entire log:

INFO[0001] Resolved base name my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 to builder 
INFO[0001] Resolved base name my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 to networks 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Retrieving image my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 from registry my-registry.azurecr.io 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Returning cached image manifest              
INFO[0001] Retrieving image manifest my-registry.azurecr.io/backend/bb/base:latest 
INFO[0001] Retrieving image my-registry.azurecr.io/backend/bb/base:latest from registry my-registry.azurecr.io 
INFO[0001] Built cross stage deps: map[0:[/app/_build/deploy/rel/bb_release /usr/local/cuda-9.1] 1:[/app/nnconfig]] 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Returning cached image manifest              
INFO[0001] Executing 0 build triggers                   
INFO[0001] Unpacking rootfs as cmd RUN pip install jinja2-cli==0.7.0 requires it. 
error building image: error building stage: failed to get filesystem from image: read tcp 10.50.99.48:34650->52.XXX.141.196:443: read: connection reset by peer

Expected behavior Build to succeed

To Reproduce Not sure it would be reproducible, other builds are ok, this specific image and some other are failing for similar error. the executor cmd used:

$ /kaniko/executor \
  --context . \
  --dockerfile Dockerfile \
  --destination $DOCKER_BUILD_IMAGE:$DOCKER_PROD_BUILD_TAG \
  --push-retry=4

Additional Information

I want to better understand the nature of this error, what can cause it and what are the possible fixes. Thanks. Triage Notes for the Maintainers

Description Yes/No
Please check if this a new feature you are proposing
  • - [ ]
Please check if the build works in docker but not in kaniko
  • - [X]
Please check if this error is seen when you use --cache flag
  • - [X]
Please check if your dockerfile is a multistage dockerfile
  • - [X]
Crapshit commented 3 years ago

We have the same exact issue. Already reported in issue #1627 It seems that there is already a fix merged #1685 #6380 that adds a new flag. Waiting for a newer release > 1.6.0 I also have a support ticket open by Microsoft Azure team and they said they found the root cause and are in a rollout. But I don't have any ETA for this

Crapshit commented 2 years ago

I got feedback from Microsoft last week Friday:

RCA: Azure storage uses multiple frontend nodes within a single storage scale unit to serve multiple storage accounts. As part of regular maintenance of these nodes, we reboot them after draining the existing requests and then put them back into production. As a result of investigating this incident, we have learned that it is possible for storage front-end nodes to be rebooted while the requests are still draining. This will cause any existing connections to the front end node to be closed causing a connection reset of associated pipelines. The precise cause of why the requests are not drained fully is still under investigation but it is likely due to a faulty feedback logic of when nodes get taken down. Because the load balancer distributes requests evenly across many front-end nodes, clients are unlikely to experience a Reset like this if they retry the request. We are still looking into ways we can proactively detect and mitigate these nodes in the short term. Longer-term, we will have a permanent fix to prevent this issue from happening.

Resolution

  1. We have decreased the reboot frequency on the impacted storage scale units spreading them further apart to reduce the impact
  2. We are further investigating on validating all requests draining to complete before we reboot the front end nodes.
mehdibenfeguir commented 2 years ago

any update on this issue ? experiencing the exact same issue as @mosheavni

Crapshit commented 2 years ago

@mehdibenfeguir as a workaround we are using the flag "--image-fs-extract-retry" in Kaniko 1.7.0. I got no new reports that these connection resets are happening.

mehdibenfeguir commented 2 years ago

do you mean this image gcr.io/kaniko-project/executor:v1.7.0-debug ??

mehdibenfeguir commented 2 years ago

this is the result with --image-fs-extract-retry 5

mehdibenfeguir commented 2 years ago

the argument worked, but retrying gives the same result

error building image: error building stage: failed to get filesystem from image: read tcp "MY_IP_HERE":34650->"MY_IP_HERE":443: read: connection reset by peer

mehdibenfeguir commented 2 years ago
[36mINFO[0012] Unpacking rootfs as cmd RUN mkdir -p /t && cp -r /twgl/common-service/.gradle /t requires it. 
WARN[0037] Retrying operation after 1s due to read tcp MY_FIRST_IP:38536->MY_SECOND_IP:443: read: connection reset by peer 
WARN[0061] Retrying operation after 2s due to read tcp MY_FIRST_IP:38960->MY_SECOND_IP:443: read: connection reset by peer 
WARN[0089] Retrying operation after 4s due to read tcp MY_FIRST_IP:39254->MY_SECOND_IP:443: read: connection reset by peer 
WARN[0123] Retrying operation after 8s due to read tcp MY_FIRST_IP:39662->MY_SECOND_IP:443: read: connection reset by peer 
WARN[0157] Retrying operation after 16s due to read tcp MY_FIRST_IP:40112->MY_SECOND_IP:443: read: connection reset by peer 
error building image: error building stage: failed to get filesystem from image: read tcp MY_FIRST_IP:40540->MY_SECOND_IP:443: read: connection reset by peer
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withCredentials
[Pipeline] }
[Pipeline] // container
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline
ERROR: script returned exit code 1
[Bitbucket] Notifying commit build result
[Bitbucket] Build result notified
Finished: FAILURE
mehdibenfeguir commented 2 years ago

@Crapshit could you please help

Crapshit commented 2 years ago

We are not getting such often connection reset issues in our environment. And with that mentioned flag I can't see any issues anymore. I could even see the resets with Docker itself, but Docker retries it 5 times as default, so it never failed in CICD pipelines...

pierreyves-lebrun commented 2 years ago

the argument worked, but retrying gives the same result

error building image: error building stage: failed to get filesystem from image: read tcp "MY_IP_HERE":34650->"MY_IP_HERE":443: read: connection reset by peer

Experiencing the same issue here, --image-fs-extract-retry 5 didn't seem to help at all

lzd-1230 commented 1 year ago

I've met this problem now, and I test in different machine with different network. The one of them met this problem everytime and the error information is there:

INFO[0032] Unpacking rootfs as cmd COPY package*.json ./ requires it.
error building image: error building stage: failed to get filesystem from image: read tcp 172.17.0.3:60130->104.18.123.25:443: read: connection reset by peer

I'm confused with that why the option of "COPY" in Dockerfile would trigger the network connection with such 104.18.123.25:443 (for that i don't get the principle of kaniko) And it seems got some error due to the network..... i've try many times in container exec or CI pipeline and it all stuck at this COPY instruction. besides, in CI-pipeline I got following errors:

INFO[0034] Unpacking rootfs as cmd COPY package*.json ./ requires it. 
error building image: error building stage: failed to get filesystem from image: stream error: stream ID 7; INTERNAL_ERROR
and
INFO[0014] Unpacking rootfs as cmd COPY package*.json ./ requires it. 
error building image: error building stage: failed to get filesystem from image: stream error: stream ID 3; INTERNAL_ERROR; received from peer

And it got no solutions except image-fs-extract-retry or --push-retry in github issue. I would be appreciated you guys teach me some way to find the core reason or how to debug for it! /(γ„’oγ„’)/~~

Crapshit commented 1 year ago

We had the same issue. A simple Dockerfile with FROM and COPY statement was enough. I interpreted it so that COPY requires the extraction of the FROM image and the download for it failed because of the connection reset.

lzd-1230 commented 1 year ago

We had the same issue. A simple Dockerfile with FROM and COPY statement was enough. I interpreted it so that COPY requires the extraction of the FROM image and the download for it failed because of the connection reset.

Thanks for your prompt reply, I've solve it by using local image register to restore the FROM image. It is probably stuck when kaniko pulls image from official registry where I have speed problem to access in my region. Additionally can I try to change the default image registry from docker.io to dockerhub or orther mirrors registry by config kaniko when pulling such FROM image?