failed to get filesystem from image: connection reset by peer

mosheavni commented 3 years ago

Actual behavior The kaniko build always fails with this error:

error building image: error building stage: failed to get filesystem from image: read tcp 10.50.99.48:34650->52.XXX.141.196:443: read: connection reset by peer

This is the entire log:

INFO[0001] Resolved base name my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 to builder 
INFO[0001] Resolved base name my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 to networks 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Retrieving image my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 from registry my-registry.azurecr.io 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Returning cached image manifest              
INFO[0001] Retrieving image manifest my-registry.azurecr.io/backend/bb/base:latest 
INFO[0001] Retrieving image my-registry.azurecr.io/backend/bb/base:latest from registry my-registry.azurecr.io 
INFO[0001] Built cross stage deps: map[0:[/app/_build/deploy/rel/bb_release /usr/local/cuda-9.1] 1:[/app/nnconfig]] 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Returning cached image manifest              
INFO[0001] Executing 0 build triggers                   
INFO[0001] Unpacking rootfs as cmd RUN pip install jinja2-cli==0.7.0 requires it. 
error building image: error building stage: failed to get filesystem from image: read tcp 10.50.99.48:34650->52.XXX.141.196:443: read: connection reset by peer

Expected behavior Build to succeed

To Reproduce Not sure it would be reproducible, other builds are ok, this specific image and some other are failing for similar error. the executor cmd used:

$ /kaniko/executor \
  --context . \
  --dockerfile Dockerfile \
  --destination $DOCKER_BUILD_IMAGE:$DOCKER_PROD_BUILD_TAG \
  --push-retry=4

Additional Information

Dockerfile Please provide either the Dockerfile you're trying to build or one that can reproduce this error.
Build Context Please provide or clearly describe any files needed to build the Dockerfile (ADD/COPY commands)
Kaniko Image (fully qualified with digest): gcr.io/kaniko-project/executor:debug digest sha256:fcccd2ab9f3892e33fc7f2e950c8e4fc665e7a4c66f6a9d70b300d7a2103592f

I want to better understand the nature of this error, what can cause it and what are the possible fixes. Thanks. Triage Notes for the Maintainers

Description	Yes/No
Please check if this a new feature you are proposing	- [ ]
Please check if the build works in docker but not in kaniko	- [X]
Please check if this error is seen when you use `--cache` flag	- [X]
Please check if your dockerfile is a multistage dockerfile	- [X]

Crapshit commented 3 years ago

We have the same exact issue. Already reported in issue #1627 It seems that there is already a fix merged #1685 #6380 that adds a new flag. Waiting for a newer release > 1.6.0 I also have a support ticket open by Microsoft Azure team and they said they found the root cause and are in a rollout. But I don't have any ETA for this

Crapshit commented 3 years ago

I got feedback from Microsoft last week Friday:

RCA: Azure storage uses multiple frontend nodes within a single storage scale unit to serve multiple storage accounts. As part of regular maintenance of these nodes, we reboot them after draining the existing requests and then put them back into production. As a result of investigating this incident, we have learned that it is possible for storage front-end nodes to be rebooted while the requests are still draining. This will cause any existing connections to the front end node to be closed causing a connection reset of associated pipelines. The precise cause of why the requests are not drained fully is still under investigation but it is likely due to a faulty feedback logic of when nodes get taken down. Because the load balancer distributes requests evenly across many front-end nodes, clients are unlikely to experience a Reset like this if they retry the request. We are still looking into ways we can proactively detect and mitigate these nodes in the short term. Longer-term, we will have a permanent fix to prevent this issue from happening.

Resolution

We have decreased the reboot frequency on the impacted storage scale units spreading them further apart to reduce the impact
We are further investigating on validating all requests draining to complete before we reboot the front end nodes.