broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
997 stars 361 forks source link

while pulling docker from GCS: Bucket is a requester pays bucket but no user project provided #6235

Open freeseek opened 3 years ago

freeseek commented 3 years ago

While trying to pull a docker with Cromwell 58, I get the following error:

"message": "Task xxx.xxx:0:1 failed. The job was stopped before the command finished. PAPI error code 2. Execution failed: generic::unknown: pulling image: docker pull: running [\"docker\" \"pull\" \"us.gcr.io/xxx/xxx@sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\"]: exit status 1 (standard error: \"error pulling image configuration: error parsing HTTP 400 response body: invalid character '<' looking for beginning of value: \\\"<?xml version='1.0' encoding='UTF-8'?><Error><Code>UserProjectMissing</Code><Message>Bucket is a requester pays bucket but no user project provided.</Message><Details>Bucket is Requester Pays bucket but no billing project id provided for non-owner.</Details></Error>\\\"\\n\")",

I understand that the issue is that the Google bucket where the docker is located is requester pays and Cromwell does not know what to do in this case, but it is not immediately clear what I should do to fix it. It would be a great improvement if Cromwell could interpret this response and provide a more informative error message so that the user could immediately know what needs to be addressed.

In particular, I am not fully sure what I should be doing. These are excerpts from my configuration file:

...
engine {
  filesystems {
    gcs {
      auth = "service-account"
      project = "xxx"
    }
  }
}
...
services {
  MetadataService {
    ...
    config {
      carbonite-metadata-service {
        filesystems {
          gcs {
            auth = "service-account"
          }
        }
        ...
      }
    }
  }
}
...
backend {
  default = PAPIv2

  providers {
    PAPIv2 {
      actor-factory = "cromwell.backend.google.pipelines.v2beta.PipelinesApiLifecycleActorFactory"
      config {
        project = "xxx"
        ...
        filesystems {
          gcs {
            auth = "service-account"
            project = "xxx"
            ...
          }
        }
      ...
      }
    }
  }
}
...

Where should the configuration for telling Cromwell which project to use when pulling dockers be?

I also do not understand why this issue arises at all as the Google bucket with the dockers is a us multi-region bucket and the computation is in us-central1, so there should be no egress costs when pulling the docker and therefore no need for a billing project.

Clearly I am not understanding this problem entirely. I would be grateful for a clarification. Thank you!

aednichols commented 3 years ago

I've never heard of pulling a Docker from a bucket. I don't know if we support this.

freeseek commented 3 years ago

When you have a docker such as us.gcr.io/broad-gatk/gatk:4.2.0.0, while it lives in a "Google registry", there is still a special bucket associated with it (and you can set Requester Pays permissions). My scenario is like this where I have dockers living in a us.gcr.io/mccarroll-mocha Google registry.

cjllanwarne commented 3 years ago

@freeseek when you set that bucket setting, are you able to pull the image from anywhere? The advice I got from google suggests that requester pays on images is not supported and not on their roadmap. It sounds like maybe by setting your GCR bucket to RP, you've found an supported loophole in how GCR works behind the scenes?

freeseek commented 3 years ago

I honestly have not really pulled docker images without Cromwell before, other than on my laptop for minimal testing. If I try to pull a docker manually I do get the same error, as you suggested, even if the Google VM and the GCR bucket are both running on the same Google Cloud network. Isn't this a bad design from Google though? How do I make my dockers available for my WDLs and on Terra while at the same time preventing actors running the same WDLs in Google Clouds in other continents from forcing me to incur egress charges? I must be missing something.

I see two possible alternative partial solutions for this issue:

(i) is there a way to write a WDL so that it automatically detects whether it should use us.gcr.io, or eu.gcr.io or asia.gcr.io and so that it would automatically select the one that is closer (and free)? I suppose not, as this would be outside the specification of WDL. Curios what you think though.

(ii) is there a way to prevent Cromwell running with PAPIv2 from having to pull a docker image multiple time? I wrote WDLs that run on large cohorts (biobank size) and they can scatter task arrays with ~1,000 shards. If this resulted in pulling a docker once, absorbing the cost would likely still be scalable, but as it is now it is very inefficient and it makes the cost of running the WDL almost dominated by the pulling of the dockers if egress costs are involved. [Notice also that someone from the VA run my WDL but I think that, since the computation was performed on an LSF HPC cluster, the docker image was pulled only once and then reused within the LSF HPC cluster, as I did not notice any significant egress costs when this happened]

@cjllanwarne thank you for reaching out to Google. I hope this spurs a broader discussion. I am not in urgent need for a fix, but I very much hope a solution is available in the long term.

cwhelan commented 3 years ago

Has there been any further discussion about this issue? Our team was also recently hit by a large egress charge for inter-continent docker image pulls by Cromwell -- we'd really like to be able set our image repositories to requester-pays to prevent that.

Having Cromwell/PAPI cache images would also really help to mitigate the problem -- similarly to @freeseek our workflow is structured to scatter some steps quite widely, so one relatively small workflow run can currently result in hundreds of docker pulls of the same image.

aednichols commented 3 years ago

Based on Chris's comment, it is up to Google to implement and they do not currently intend to.

I would suggest reaching out to them to advocate for the feature.

In the meantime, it seems like either (1) making your images private or (2) replicating images across regions could prevent cost incidents such as the one you describe.

freeseek commented 3 years ago

@aednichols I agree with your point regarding Google. However, I feel like there is a huge conflict of interest here: how can Google motivate itself to fix something that could potentially allow them to make a lot of money? How does Google suggest users should fix this problem? It seems a huge financial risk to include docker images in us.gcr.io, eu.gcr.io, and asia.gcr.io as the corresponding buckets need to be public and cannot be set as Requester Pays, so anybody can download them at will. Do you have advice for how to best reach out to them to advocate for this?

Replicating images across regions is currently not very sustainable as it would rely on users' good will and understanding of this complicated problem, as Cromwell does not have a framework to automatically understand within a workflow which docker images it should pull.

If Google does not get their act together, I suppose that ultimately the Cromwell team has to come to terms with the fact that the us.gcr.io, eu.gcr.io, and asia.gcr.io repository solutions are not sustainable and an alternative will need to be engineered and provided to those writing WDL pipelines. Not sure what the easiest solution would be though.

Cromwell currently has some framework for dealing deferentially with Files with optional localization when a WDL is run on Google Cloud. Could something be included in Cromwell to allow the WDL to know in which Google cloud the tasks are being run so that at least the best repository could be automatically selected?

cwhelan commented 3 years ago

We can't really make our images private because we want our workflows to be publicly accessible, especially for Terra users.

We can make mirrors of our GCR image repositories across regions -- hopefully that will eliminate this type of event for the most part. But we'll still be dependent on our users to to use the right mirrors (as @freeseek just noted above).

aednichols commented 3 years ago

Yeah, I think you're right @freeseek. I suggest filing a new issue, because we would have to think of a new solution than the one discussed here. This issue's title and thrust is, "Cromwell should do GCR requester pays" not "Cromwell should provide a solution to problem X".

cwhelan commented 3 years ago

@aednichols I created https://github.com/broadinstitute/cromwell/issues/6442 to continue this discussion and track a solution.