DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Failure message: wrapping host binaries: pulling image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]: exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/ #218

Closed carbocation closed 3 years ago

carbocation commented 3 years ago

This is a new error message for me, and I checked the GCP status page in case it was transient but don't see any active issues.

I am launching a bunch of tasks across a bunch of zones (within a single job) using the google-cls-v2 API with dsub 0.4.3. I am getting the following error, but only on some tasks within this job (while others within the same job are launching and succeeding):

Failure message: wrapping host binaries: pulling image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]: exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\n")

This strikes me as an odd message, because (1) I'm running these machines without a public IP, and so (2) I'm only using gcr.io Docker instances with them. I don't know why they'd be pointing to registry-1.docker.io.

If this isn't a dsub issue, I can point this to the mailing list. (And if there is a quick way for me to get a flavor for "dsub issue vs not dsub issue", just let me know so I can self-triage.)

Thanks!

mbookman commented 3 years ago

Can you double-check the task, perhaps pull up the underlying API operation?

Unless I am misunderstanding the error, that it is trying ["docker" "pull" "bash"] indicates that it is trying to pull the bash image from Dockerhub.

Note that I just saw this same problem for the first time earlier this week for a workflow run by Cromwell under Terra. In that case, it dockerhub was throttling the connections to pull down the ubuntu:latest image.

carbocation commented 3 years ago

So, a bit interesting. Here is the full output from gcloud alpha lifesciences operations describe 4679186047174961621. It strikes me as... odd that it seems to try (and fail) to connect to registry-1.docker.io when all of the images are from gcr.io, not dockerhub (at least the ones that are explicitly mentioned in this output below).


gcloud alpha lifesciences operations describe 4679186047174961621
done: true
error:
  code: 14
  message: 'Execution failed: generic::unavailable: wrapping host binaries: pulling
    image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]:
    exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/:
    net/http: request canceled while waiting for connection (Client.Timeout exceeded
    while awaiting headers)\n")'
metadata:
  '@type': type.googleapis.com/google.cloud.lifesciences.v2beta.Metadata
  createTime: '2021-03-31T19:13:53.704845Z'
  endTime: '2021-03-31T19:30:38.561799866Z'
  events:
  - description: Worker released
    timestamp: '2021-03-31T19:30:38.561799866Z'
    workerReleased:
      instance: google-pipelines-worker-157f9eda0bbba01e17a66d3fd85b15fb
      zone: us-central1-f
  - description: 'Execution failed: generic::unavailable: wrapping host binaries:
      pulling image: retry budget exhausted (10 attempts): running ["docker" "pull"
      "bash"]: exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/:
      net/http: request canceled while waiting for connection (Client.Timeout exceeded
      while awaiting headers)\n")'
    failed:
      cause: 'Execution failed: generic::unavailable: wrapping host binaries: pulling
        image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]:
        exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/:
        net/http: request canceled while waiting for connection (Client.Timeout exceeded
        while awaiting headers)\n")'
      code: UNAVAILABLE
    timestamp: '2021-03-31T19:30:37.949807010Z'
  - description: Worker "google-pipelines-worker-157f9eda0bbba01e17a66d3fd85b15fb"
      assigned in "us-central1-f" on a "custom-4-8192" machine
    timestamp: '2021-03-31T19:14:03.234988572Z'
    workerAssigned:
      instance: google-pipelines-worker-157f9eda0bbba01e17a66d3fd85b15fb
      machineType: custom-4-8192
      zone: us-central1-f
  labels:
    dsub-version: v0-4-4
    job-id: maf-filter--james--210331-151350-57
    job-name: maf-filter
    task-id: '6'
    user-id: james
  pipeline:
    actions:
    - commands:
      - -c
      - |
        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }

        while [[ ! -d /google/logs/action/6 ]]; do
          mkdir -p /tmp/continuous_logging_action
        log_cp /google/logs/action/4/stdout "${STDOUT_PATH}" /tmp/continuous_logging_action "true" "${USER_PROJECT}" &
        STDOUT_PID=$!
        log_cp /google/logs/action/4/stderr "${STDERR_PATH}" /tmp/continuous_logging_action "true" "${USER_PROJECT}" &
        STDERR_PID=$!

        log_cp /google/logs/output "${LOGGING_PATH}" /tmp/continuous_logging_action "false" "${USER_PROJECT}" &
        LOG_PID=$!

        wait "${STDOUT_PID}"
        wait "${STDERR_PID}"
        wait "${LOG_PID}"

          sleep 60s
        done
      containerName: logging
      entrypoint: /bin/bash
      environment:
        LOGGING_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6.log
        STDERR_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6-stderr.log
        STDOUT_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6-stdout.log
        USER_PROJECT: ''
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
      runInBackground: true
    - commands:
      - -c
      - |+
        #!/bin/bash

        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        mkdir -m 777 -p "/mnt/data/script" 
        mkdir -m 777 -p "/mnt/data/tmp" 
        mkdir -m 777 -p "/mnt/data/workingdir" 

        echo "${_SCRIPT_REPR}"     | python -c 'import ast
        import sys

        sys.stdout.write(ast.literal_eval(sys.stdin.read()))
        '     > "/mnt/data/script/command.sh"
        chmod a+x "/mnt/data/script/command.sh"

        for ((i=0; i < DIR_COUNT; i++)); do
          DIR_VAR="DIR_${i}"

          log_info "mkdir -m 777 -p "${!DIR_VAR}""
          mkdir -m 777 -p "${!DIR_VAR}"
        done

      containerName: prepare
      entrypoint: /bin/bash
      environment:
        DIR_0: /mnt/data/input/gs/bucket/path/prevalent_disease_af_or_nonnsr
        DIR_1: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2
        DIR_2: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2
        DIR_3: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2
        DIR_COUNT: '4'
        _META_YAML_REPR: "\"create-time: 2021-03-31 15:13:50.570367-04:00\\ndsub-version:\
          \ v0-4-4\\nenvs:\\n  MAF: 1E-2\\n  PVAL: '1'\\njob-id: maf-filter--james--210331-151350-57\\\
          njob-name: maf-filter\\nlogging: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/\\\
          nscript-name: command.sh\\ntask-ids: 1-6\\ntasks:\\n- inputs:\\n    file:\
          \ gs://bucket/path/prevalent_disease_af_or_nonnsr/bolt.imputed.tsv.gz\\\
          n  logging-path: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6.log\\\
          n  outputs:\\n    count_file: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.n_snps\\\
          n    filtered: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz\\\
          n    filtered_tbi: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz.tbi\\\
          n  task-id: '6'\\nuser-id: james\\n\""
        _SCRIPT_REPR: "\"# Must be run on an image that has bgzip and tabix installed\\\
          n\\n# Screen out erroneous SE and ChiSq values ($12>0 && $15>0)\\n# And\
          \ require that MAF and P be within your specified range\\ngunzip -c ${file}\
          \ | awk -v maf=${MAF} -v pval=${PVAL} -F $'\\\\t' 'NR == 1{print $0}; NR\
          \ > 1 && $12>0 && $15>0 && ($16 <= pval) && (($7 > maf) && (1-$7 > maf))\
          \ {print $0}' > output.file\\n\\n# For X, skip the header and append to\
          \ what we already produced\\n# Disabled for now since not produced by REGENIE\\\
          n# gunzip -c ${file_x} | awk -v maf=${MAF} -v pval=${PVAL} -F $'\\\\t' 'NR\
          \ > 1 && $12>0 && $15>0 && ($16 <= pval) && (($7 > maf) && (1-$7 > maf))\
          \ {print $0}' >> output.file\\n\\n# Count the SNPs\\necho $(( $(wc -l <\
          \ output.file) - 1)) > ${count_file}\\n\\n# BGZip\\nbgzip -c output.file\
          \ > ${filtered}\\n\\n# Tabix. This is BOLT-specific formatting\\ntabix \\\
          \\\\n    -s 2 \\\\\\n    -b 3 \\\\\\n    -e 3 \\\\\\n    -S 1 \\\\\\n  \
          \  ${filtered}\\n\""
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
      mounts:
      - disk: datadisk
        path: /mnt/data
    - commands:
      - -c
      - |+
        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_rsync() {
          local src="${1}"
          local dst="${2}"
          local user_project_name="${3}"

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
            if gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
          exit 1
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }

        set -o errexit
        set -o nounset
        set -o pipefail

        for ((i=0; i < INPUT_COUNT; i++)); do
          INPUT_VAR="INPUT_${i}"
          INPUT_RECURSIVE="INPUT_RECURSIVE_${i}"
          INPUT_SRC="INPUT_SRC_${i}"
          INPUT_DST="INPUT_DST_${i}"

          log_info "Localizing ${!INPUT_VAR}"
          if [[ "${!INPUT_RECURSIVE}" -eq "1" ]]; then
            gsutil_rsync "${!INPUT_SRC}" "${!INPUT_DST}" "${USER_PROJECT}"
          else
            gsutil_cp "${!INPUT_SRC}" "${!INPUT_DST}" "" "${USER_PROJECT}"
          fi
        done

      containerName: localization
      entrypoint: /bin/bash
      environment:
        INPUT_0: file
        INPUT_COUNT: '1'
        INPUT_DST_0: /mnt/data/input/gs/bucket/path/prevalent_disease_af_or_nonnsr/bolt.imputed.tsv.gz
        INPUT_RECURSIVE_0: '0'
        INPUT_SRC_0: gs://bucket/path/prevalent_disease_af_or_nonnsr/bolt.imputed.tsv.gz
        USER_PROJECT: ''
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
      mounts:
      - disk: datadisk
        path: /mnt/data
    - commands:
      - bash
      - -c
      - |
        export TMPDIR="/mnt/data/tmp"
        cd /mnt/data/workingdir

        "/mnt/data/script/command.sh"
      containerName: user-command
      entrypoint: /usr/bin/env
      environment:
        MAF: '1E-2'
        PVAL: '1'
        count_file: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.n_snps
        file: /mnt/data/input/gs/bucket/path/prevalent_disease_af_or_nonnsr/bolt.imputed.tsv.gz
        filtered: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz
        filtered_tbi: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz.tbi
      imageUri: gcr.io/ukbb-analyses/ubuntu:2020-12-22
      mounts:
      - disk: datadisk
        path: /mnt/data
    - commands:
      - -c
      - |+
        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_rsync() {
          local src="${1}"
          local dst="${2}"
          local user_project_name="${3}"

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
            if gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
          exit 1
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }

        set -o errexit
        set -o nounset
        set -o pipefail

        for ((i=0; i < OUTPUT_COUNT; i++)); do
          OUTPUT_VAR="OUTPUT_${i}"
          OUTPUT_RECURSIVE="OUTPUT_RECURSIVE_${i}"
          OUTPUT_SRC="OUTPUT_SRC_${i}"
          OUTPUT_DST="OUTPUT_DST_${i}"

          log_info "Delocalizing ${!OUTPUT_VAR}"
          if [[ "${!OUTPUT_RECURSIVE}" -eq "1" ]]; then
            gsutil_rsync "${!OUTPUT_SRC}" "${!OUTPUT_DST}" "${USER_PROJECT}"
          else
            gsutil_cp "${!OUTPUT_SRC}" "${!OUTPUT_DST}" "" "${USER_PROJECT}"
          fi
        done

      containerName: delocalization
      entrypoint: /bin/bash
      environment:
        OUTPUT_0: filtered
        OUTPUT_1: filtered_tbi
        OUTPUT_2: count_file
        OUTPUT_COUNT: '3'
        OUTPUT_DST_0: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz
        OUTPUT_DST_1: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz.tbi
        OUTPUT_DST_2: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.n_snps
        OUTPUT_RECURSIVE_0: '0'
        OUTPUT_RECURSIVE_1: '0'
        OUTPUT_RECURSIVE_2: '0'
        OUTPUT_SRC_0: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz
        OUTPUT_SRC_1: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz.tbi
        OUTPUT_SRC_2: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.n_snps
        USER_PROJECT: ''
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
      mounts:
      - disk: datadisk
        path: /mnt/data
    - alwaysRun: true
      commands:
      - -c
      - |+
        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }

        mkdir -p /tmp/logging_action
        log_cp /google/logs/action/4/stdout "${STDOUT_PATH}" /tmp/logging_action "true" "${USER_PROJECT}" &
        STDOUT_PID=$!
        log_cp /google/logs/action/4/stderr "${STDERR_PATH}" /tmp/logging_action "true" "${USER_PROJECT}" &
        STDERR_PID=$!

        log_cp /google/logs/output "${LOGGING_PATH}" /tmp/logging_action "false" "${USER_PROJECT}" &
        LOG_PID=$!

        wait "${STDOUT_PID}"
        wait "${STDERR_PID}"
        wait "${LOG_PID}"

      containerName: final_logging
      entrypoint: /bin/bash
      environment:
        LOGGING_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6.log
        STDERR_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6-stderr.log
        STDOUT_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6-stdout.log
        USER_PROJECT: ''
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
    resources:
      regions:
      - us-central1
      - us-east1
      - us-west1
      virtualMachine:
        bootDiskSizeGb: 10
        bootImage: projects/cloud-lifesciences/global/images/w20210322-0300-rc0060586ad3-0000-2431-9437-089e0828bff8stable
        disks:
        - name: datadisk
          sizeGb: 50
          type: pd-ssd
        labels:
          dsub-version: v0-4-4
          goog-pipelines-worker: 'true'
          job-id: maf-filter--james--210331-151350-57
          job-name: maf-filter
          task-id: '6'
          user-id: james
        machineType: custom-4-8192
        network:
          usePrivateAddress: true
        nvidiaDriverVersion: 450.51.06
        serviceAccount:
          email: default
          scopes:
          - https://www.googleapis.com/auth/bigquery
          - https://www.googleapis.com/auth/compute
          - https://www.googleapis.com/auth/devstorage.full_control
          - https://www.googleapis.com/auth/genomics
          - https://www.googleapis.com/auth/logging.write
          - https://www.googleapis.com/auth/monitoring.write
          - https://www.googleapis.com/auth/cloud-platform
        volumes:
        - persistentDisk:
            sizeGb: 50
            type: pd-ssd
          volume: datadisk
    timeout: 604800s
  startTime: '2021-03-31T19:14:03.234988572Z'
name: projects/783282864357/locations/us-central1/operations/4679186047174961621
mbookman commented 3 years ago

Hi James,

I agree. The error message does not make sense in the context of that operation. All actions use a gcr.io imageUri, either:

And yet it would seem that the Pipelines API is trying to pull the bash docker image for dockerhub. I have filed a bug with the HCLS team.

carbocation commented 3 years ago

Thanks!

mbookman commented 3 years ago

I heard back from the Pipelines team that they are aware of the problem and are looking to remove the dependency. No specific timelines.

Do you have a sense of frequency in which you hit this?

carbocation commented 3 years ago

Hard to say though it seems to be becoming more frequent. For example, I'm hitting it now (well, not waiting for all the retries - but noting that machines seem to be stuck in state VM starting (awaiting worker checkin) while trying to run a 'hello world' program on a 4-core machine). VM starting (awaiting worker checkin) seems to be the state that precedes the retry failure.

carbocation commented 3 years ago

For what it's worth, I am not seeing a difference between google-v2 and google-cls-v2 providers. I also don't see any errors on the GCP status page to indicate that there is some global problem. I've tried E2 and N1 machine types with similar results. Currently I don't seem to be able to launch jobs unless I remove the --use-private-address flag, but a couple of days ago I was eventually able to. These aren't systematic observations, just anecdote, but sending them along in case they make something 'click' for someone else.

pbilling commented 3 years ago

I am also consistently experiencing this issue running jobs with internal IPs, pulling from gcr.io. I just launched a batch of 20 jobs and all failed with this same error. Thanks to @carbocation for raising the issue.

wnojopra commented 3 years ago

Latest message from the Pipelines team is that they've heard from a few people the very same issue, and are working to have it fixed as soon as possible.

wnojopra commented 3 years ago

From Pipelines team: A fix is in the process of rollout. us-central1 and europe-west2 are fixed, while rollouts to remaining locations are in progress.

cblacknygc commented 3 years ago

We were hitting this issue thru cromwell in us-central1 and can confirm we haven't seen it since Google told us there was a fix April 9th.

mbookman commented 3 years ago

Going to close this as the Google team has fixed the issue.