broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
996 stars 360 forks source link

GCPBATCH: accessing private gcr.io docker for callcaching raises error: unauthorized #7356

Closed Lipastomies closed 4 months ago

Lipastomies commented 10 months ago

Hi!

We've been looking into migrating from PAPIv2 backend to GCPBATCH backend. Callcaching fails on GCPBATCH but not on PAPIv2 when using a private docker image in gcr.io. Is this a missing feature or a bug? The documentation on the subject could go either way, depending on whether GCPBATCH is part of the other backends or a subset of the pipelines backend (https://cromwell.readthedocs.io/en/latest/cromwell_features/CallCaching/). I do not think this is a configuration error, since the same config works with PAPIv2 backend, but if it is, what configuration options would be necessary for configuring gcr.io authentication when using GCPBATCH?

Errors from cromwell logs when task is being callcached:

cromwell_1  | 2024-01-11 11:09:38 pool-9-thread-9 INFO  - Manifest request failed for docker manifest V2, falling back to OCI manifest. Image: DockerImageIdentifierWithoutHash(Some(eu.gcr.io),Some(project),image_name,tag)
cromwell_1  | cromwell.docker.registryv2.DockerRegistryV2Abstract$Unauthorized: 401 Unauthorized {"errors":[{"code":"UNAUTHORIZED","message":"You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication"}]}
cromwell_1  |   at cromwell.docker.registryv2.DockerRegistryV2Abstract.$anonfun$getDigestFromResponse$1(DockerRegistryV2Abstract.scala:321)
cromwell_1  |   at map @ fs2.internal.CompileScope.$anonfun$close$9(CompileScope.scala:246)
cromwell_1  |   at flatMap @ fs2.internal.CompileScope.$anonfun$close$6(CompileScope.scala:245)
cromwell_1  |   at map @ fs2.internal.CompileScope.fs2$internal$CompileScope$$traverseError(CompileScope.scala:222)
cromwell_1  |   at flatMap @ fs2.internal.CompileScope.$anonfun$close$4(CompileScope.scala:244)
cromwell_1  |   at map @ fs2.internal.CompileScope.fs2$internal$CompileScope$$traverseError(CompileScope.scala:222)
cromwell_1  |   at flatMap @ fs2.internal.CompileScope.$anonfun$close$2(CompileScope.scala:242)
cromwell_1  |   at flatMap @ fs2.internal.CompileScope.close(CompileScope.scala:241)
cromwell_1  |   at unsafeRunAsyncAndForget @ cromwell.docker.DockerInfoActor.$anonfun$startAndRegisterStream$2(DockerInfoActor.scala:163)
cromwell_1  |   at flatMap @ fs2.internal.CompileScope.$anonfun$openAncestor$2(CompileScope.scala:261)
cromwell_1  |   at flatMap @ fs2.internal.FreeC$.$anonfun$compile$17(Algebra.scala:545)
cromwell_1  |   at map @ fs2.internal.CompileScope.$anonfun$close$9(CompileScope.scala:246)
cromwell_1  |   at flatMap @ fs2.internal.CompileScope.$anonfun$close$6(CompileScope.scala:245)
cromwell_1  |   at map @ fs2.internal.CompileScope.fs2$internal$CompileScope$$traverseError(CompileScope.scala:222)
cromwell_1  |   at flatMap @ fs2.internal.CompileScope.$anonfun$close$4(CompileScope.scala:244)
cromwell_1  |   at map @ fs2.internal.CompileScope.fs2$internal$CompileScope$$traverseError(CompileScope.scala:222)
cromwell_1  | 2024-01-11 11:09:38 cromwell-system-akka.dispatchers.engine-dispatcher-33 WARN  - BackendPreparationActor_for_0845428a:myworkflow.mytask:-1:1 [UUID(0845428a)]: Docker lookup failed
cromwell_1  | java.lang.Exception: Unauthorized to get docker hash eu.gcr.io/project/image_name:tag
cromwell_1  |   at cromwell.engine.workflow.WorkflowDockerLookupActor.cromwell$engine$workflow$WorkflowDockerLookupActor$$handleLookupFailure(WorkflowDockerLookupActor.scala:279)
cromwell_1  |   at cromwell.engine.workflow.WorkflowDockerLookupActor$$anonfun$3.applyOrElse(WorkflowDockerLookupActor.scala:93)
cromwell_1  |   at cromwell.engine.workflow.WorkflowDockerLookupActor$$anonfun$3.applyOrElse(WorkflowDockerLookupActor.scala:78)
cromwell_1  |   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
cromwell_1  |   at akka.actor.FSM.processEvent(FSM.scala:707)
cromwell_1  |   at akka.actor.FSM.processEvent$(FSM.scala:704)
cromwell_1  |   at cromwell.engine.workflow.WorkflowDockerLookupActor.akka$actor$LoggingFSM$$super$processEvent(WorkflowDockerLookupActor.scala:45)
cromwell_1  |   at akka.actor.LoggingFSM.processEvent(FSM.scala:847)
cromwell_1  |   at akka.actor.LoggingFSM.processEvent$(FSM.scala:829)
cromwell_1  |   at cromwell.engine.workflow.WorkflowDockerLookupActor.processEvent(WorkflowDockerLookupActor.scala:45)
cromwell_1  |   at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:701)
cromwell_1  |   at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:695)
cromwell_1  |   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
cromwell_1  |   at cromwell.docker.DockerClientHelper$$anonfun$dockerResponseReceive$1.applyOrElse(DockerClientHelper.scala:16)
cromwell_1  |   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:269)
cromwell_1  |   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
cromwell_1  |   at akka.actor.Actor.aroundReceive(Actor.scala:539)
cromwell_1  |   at akka.actor.Actor.aroundReceive$(Actor.scala:537)
cromwell_1  |   at cromwell.engine.workflow.WorkflowDockerLookupActor.aroundReceive(WorkflowDockerLookupActor.scala:45)
cromwell_1  |   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614)
cromwell_1  |   at akka.actor.ActorCell.invoke(ActorCell.scala:583)
cromwell_1  |   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
cromwell_1  |   at akka.dispatch.Mailbox.run(Mailbox.scala:229)
cromwell_1  |   at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
cromwell_1  |   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
cromwell_1  |   at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
cromwell_1  |   at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
cromwell_1  |   at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
cromwell_1  | 
cromwell_1  | 2024-01-11 11:09:38 cromwell-system-akka.dispatchers.engine-dispatcher-38 INFO  - BT-322 0845428a:myworkflow.mytask:-1:1 is not eligible for call caching

Used backend: GCPBATCH. Callcaching works with PAPIv2, not on GCPBATCH.

workflow used for testing:

workflow myworkflow {
    call mytask
}

task mytask {
    String str = "!"
    command <<<
        echo "hello world ${str}"
    >>>
    output {
        String out = read_string(stdout())
    }

    runtime{
        docker: "eu.gcr.io/project/image_name:tag"
        cpu: "1"
        memory: "500 MB"
        disks: "local-disk 5 HDD"
        zones: "europe-west1-b europe-west1-c europe-west1-d"
        preemptible: 2
        noAddress: true
    }
}

We are using cromwell through broadinstitute/cromwell:87-ecd44b6 image. cromwell configuration:

include required(classpath("application"))

system.new-workflow-poll-rate=1

// increase timeout for http requests..... getting meta-data can timeout for large workflows.
akka.http.server.request-timeout=600s

# Maximum number of input file bytes allowed in order to read each type.
# If exceeded a FileSizeTooBig exception will be thrown.
system {
    job-rate-control {
        jobs = 100
        per = 1 second
    }
  input-read-limits {
      lines = 128000000
      bool = 7
      int = 19
      float = 50
      string = 1280000
      json = 12800000
      tsv = 1280000000
      map = 128000000
      object = 128000000
  }

 # If 'true', a SIGTERM or SIGINT will trigger Cromwell to attempt to gracefully shutdown in server mode,
  # in particular clearing up all queued database writes before letting the JVM shut down.
  # The shutdown is a multi-phase process, each phase having its own configurable timeout. See the Dev Wiki for more details.
    graceful-server-shutdown = true
  max-concurrent-workflows = 5000

  io {
      throttle {
      # # Global Throttling - This is mostly useful for GCS and can be adjusted to match
      # # the quota availble on the GCS API
      number-of-requests = 100000
      per = 100 seconds
      }
  }
}

akka {
  # Optionally set / override any akka settings
  http {
    server {
      # Increasing these timeouts allow rest api responses for very large jobs
      # to be returned to the user. When the timeout is reached the server would respond
      # `The server was not able to produce a timely response to your request.`
      # https://gatkforums.broadinstitute.org/wdl/discussion/10209/retrieving-metadata-for-large-workflows
       request-timeout = 600s
       idle-timeout = 600s
    }
  }
}

services {
  MetadataService {
    #class = "cromwell.services.metadata.impl.MetadataServiceActor"
    config {
      metadata-read-row-number-safety-threshold = 2000000
          #   #   For normal usage the default value of 200 should be fine but for larger/production environments we recommend a
    #   #   value of at least 500. There'll be no one size fits all number here so we recommend benchmarking performance and
    #   #   tuning the value to match your environment.
        db-batch-size = 700
    }
  }
}

google {

  application-name = "cromwell"
  auths = [
    {
      name = "application-default"
      scheme = "application_default"
    }
  ]
}

docker {
    hash-lookup {
    method = "remote"
    }
}

engine {
  filesystems {
     gcs {
       auth = "application-default"
     }
  }
}

call-caching {
  enabled = true
}

backend {
  default = GCPBATCH
  providers {
    GCPBATCH {
    // life sciences
      actor-factory = "cromwell.backend.google.batch.GcpBatchBackendLifecycleActorFactory"
      config {
        ## Google project
        project = "$PROJECT"

        ## Base bucket for workflow executions
        root = "$BUCKET"
        name-for-call-caching-purposes: PAPI
        #60000/min in google
        ##genomics-api-queries-per-100-seconds = 90000
        virtual-private-cloud {
            network-name = "$NET"
            subnetwork-name = "$SUBNET"
        }
        // Polling for completion backs-off gradually for slower-running jobs.
        // This is the maximum polling interval (in seconds):
        maximum-polling-interval = 600
        request-workers = 4
        batch-timeout = 7 days
        # Emit a warning if jobs last longer than this amount of time. This might indicate that something got stuck in PAPI.
        slow-job-warning-time: 24 hours
              genomics {
                // A reference to an auth defined in the `google` stanza at the top.  This auth is used to create
                // Pipelines and manipulate auth JSONs.
                auth = "application-default"
                compute-service-account = "default"
                # Restrict access to VM metadata. Useful in cases when untrusted containers are running under a service
                # account not owned by the submitting user
                restrict-metadata-access = false
                ## Location
                location = "europe-west1"

              }

        filesystems {
              gcs {
                // A reference to a potentially different auth for manipulating files via engine functions.
                auth = "application-default"
                project = "$PROJECT"
                caching {
                  # When a cache hit is found, the following duplication strategy will be followed to use the cached outputs
                  # Possible values: "copy", "reference". Defaults to "copy"
                  # "copy": Copy the output files
                  # "reference": DO NOT copy the output files but point to the original output files instead.
                  #              Will still make sure than all the original output files exist and are accessible before
                  #              going forward with the cache hit.
                    duplication-strategy = "reference"
                }
              }
         }

        default-runtime-attributes {
         cpu: 1
         failOnStderr: false
         continueOnReturnCode: 0
         memory: "2 GB"
         bootDiskSizeGb: 10
         # Allowed to be a String, or a list of Strings
         disks: "local-disk 10 HDD"
         noAddress: false
         preemptible: 1
         zones: ["europe-west1-b"]
        }
    }
  }
 }
}

database {
  ...
}
aednichols commented 5 months ago

I believe that in Life Sciences and its predecessors, pull access to private GCR images was granted by the credentials on the job VM. Since Batch is a much larger step change, it could be that this behavior no longer holds true.

@Lipastomies what steps do you take to configure your system to use those private images?

Lipastomies commented 5 months ago

Hi, we have Cromwell running in docker on a GCP VM, and the service account of the GCP VM has access to the image registry. I don't think we are doing anything else to gain access to the private registry.

aednichols commented 5 months ago

Yeah I gotcha, I meant the VM the job actually runs on, which is what actually pulls the image.

Lipastomies commented 5 months ago

I don't think we do anything else than give the service account required permissions. The VMs have been able to pull the images fine, that hasn't been a problem when running GCPBatch.

aednichols commented 5 months ago

I see, yes, for some reason my brain dropped the part about it only being an issue for Call Caching. I'll brainstorm a bit on why this could be.