GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
877 stars 336 forks source link

Periodic `error: [Errno 2] No such file or directory` on gsutil cp #500

Open hodgesmr opened 6 years ago

hodgesmr commented 6 years ago

Issue Description

We're attempting to use gsutil to download files as part of our DevOps flow. We have gzipped tar archives in a GCS bucket and we're spinning up a docker container in kubernetes to pull down the archives.

Periodically, the gsutil cp command will raise an exception:

Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 240, in serve_client
    request = recv()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 27, in <module>
    from gslib import copy_helper
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/copy_helper.py", line 165, in <module>
    else AtomicDict(manager=gslib.util.manager))
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/parallelism_framework_util.py", line 52, in __init__
    self.lock = manager.Lock()
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 667, in temp
    token, exp = self._create(typeid, *args, **kwds)
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 565, in _create
    conn = self._Client(self._address, authkey=self._authkey)
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
    c = SocketClient(address)
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
    s.connect(address)
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 2] No such file or directory

We're authenticating with a service account json file:

gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

and then attempting to download:

gsutil cp gs://<PROJECT-ID>-helm/echoserver/1.8.tgz chart.tgz

I can run this repeatedly on the same file in the same bucket. Sometimes it works fine, other times it raises the above exception.

More Details

Our Dockerfile installs the google-cloud-sdk like so:

FROM debian:jessie

RUN apt-get update -y
RUN apt-get install -y curl git unzip gnupg lsb-release apt-transport-https openssh-client

WORKDIR /tmp/

ENV GOOGLE_SDK_VERSION 187.0.0
RUN export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && \
    echo "deb https://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" > /etc/apt/sources.list.d/google-cloud-sdk.list && \
    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
RUN apt-get update -y
RUN apt-get install -y google-cloud-sdk
RUN gcloud config set core/disable_usage_reporting true
RUN gcloud config set component_manager/disable_update_check true

Our entrypoint is a simple shell script that calls gsutil cp (as described above):

WORKDIR /
ADD run.sh /
RUN ["chmod", "+x", "/run.sh"]
ENTRYPOINT ["/run.sh"]

When this exception occurs, the container does not terminate.

This is run via a Spinnker pipeline which deploys the container into our Kubernetes cluster with the following manifest:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu request for container
      <PROJECT-ID>-manifestgenerator'
  creationTimestamp: 2018-02-01T21:43:27Z
  name: echoserver-3fc4227dcf281b1c
  namespace: default
  resourceVersion: "2686221"
  selfLink: /api/v1/namespaces/default/pods/echoserver-3fc4227dcf281b1c
  uid: f01cabfb-0798-11e8-83c5-42010a800363
spec:
  containers:
  - args:

    # This is the file being pulled from GCS
    - gs://<PROJECT-ID>-helm/echoserver/1.8.tgz

    env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: /tmp/manifest-generator.json
    image: gcr.io/<PROJECT-ID>/manifest_generator:0.0.1
    imagePullPolicy: Always
    name: <PROJECT-ID>-manifestgenerator
    ports:
    - containerPort: 80
      name: http
      protocol: TCP
    resources:
      requests:
        cpu: 100m
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp/
      name: "1517433956541"
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-wt4w7
      readOnly: true
  dnsPolicy: ClusterFirst
  imagePullSecrets:
  - name: gcr-<PROJECT-ID>
  nodeName: <NODE-NAME>
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.alpha.kubernetes.io/notReady
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.alpha.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: "1517433956541"
    secret:
      defaultMode: 420
      secretName: manifest-generator-sa
  - name: default-token-wt4w7
    secret:
      defaultMode: 420
      secretName: default-token-wt4w7
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2018-02-01T21:43:27Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2018-02-01T21:43:29Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2018-02-01T21:43:27Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://868982eaf5bb0d38c008abbbdab3dc494c80165b6453e357dd7a153f89d1a655
    image: gcr.io/<PROJECT-ID>/manifest_generator:0.0.1
    imageID: docker-pullable://gcr.io/<PROJECT-ID>/manifest_generator@sha256:3528171f1b46d85e805abece6ba5cecf19985d281d5a20d19943204417a950b7
    lastState: {}
    name: <PROJECT-ID>-manifestgenerator
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2018-02-01T21:43:28Z
  hostIP: 10.128.0.2
  phase: Running
  podIP: 10.52.12.19
  qosClass: Burstable
  startTime: 2018-02-01T21:43:27Z
houglum commented 6 years ago

The error message above comes from Python's multiprocessing module -- specifically, the Manager class. It looks like it's unable to connect to the Manager's server process to create a Lock object. This seems like a pretty low-level Python module issue, rather than with anything gsutil is doing; we don't do anything special to alter the behavior of the Manager class, so this should "just work" ™. I assume this is a fundamental issue with the Multiprocessing module when used within specific containerized environments... but I don't have anything to base that on except the stack trace above and the fact that I've only seen this problem occur in something running within a Docker container.

For thoroughness, would you mind running gsutil version -l within the container and posting that output? This should give lots of output, but I'm mainly interested in the Python version and gsutil version being used, along with some metadata about the environment gsutil is invoked from.

hodgesmr commented 6 years ago

@houglum thanks for the response!

Here's the output from gsutil version -l running in the container (run before authing with gcloud):

gsutil version: 4.28
checksum: ca9bccbeb7ce0c439a9cfdf998a08dd0 (OK)
boto version: 2.48.0
python version: 2.7.9 (default, Jun 29 2016, 13:08:31) [GCC 4.9.2]
OS: Linux 4.4.0-1027-gke
multiprocessing available: True
using cloud sdk: True
pass cloud sdk credentials to gsutil: False
config path(s): no config found
gsutil path: /usr/lib/google-cloud-sdk/platform/gsutil/gsutil
compiled crcmod: False
installed via package manager: False
editable install: False
houglum commented 6 years ago

Thanks, hodgesmr@. You're on python 2.7.9 and the most recent version of gsutil, so I'll stick to my original guess :)

A good way of attempting to confirm this might be to write a script that simply creates a multiprocessing.Manager, then attempts to create a Lock within it (ideally, adding some timing to gsutil in the stack trace points above, to see how far apart the Manager and Lock creation occur so you can mimic that) -- if you can deploy and run that in a similar container environment and occasionally reproduce this, it's most certainly an issue with Multiprocessing on Docker/Kubernetes.

tchristensenowlet commented 5 years ago

I've also seen this on occasion when using tmux. I have not found any kind of pattern for consistent reproducibility, though.