broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
987 stars 358 forks source link

GCPBATCH issue with private VPC network #7500

Open yihming opened 3 weeks ago

yihming commented 3 weeks ago

Hello,

I'm working on making our cromwell server work with GCP Batch and running in our private VPC network.

However, after following this tutorial, I encounter the following error:

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: network field is invalid. network: projects/${project_id}/global/networks/${network_id}/ is not matching the expected format: global/networks/([a-z]([-a-z0-9]*[a-z0-9])?)$
    at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:92)
    at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
    at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
    at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
    at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
    at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1133)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1277)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
    at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:574)
    at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544)
    at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
    at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:541)
    at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:489)
    at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:453)
    at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:486)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
    Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
        at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
        at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
        at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.$anonfun$submit$1(GcpBatchApiRequestHandler.scala:11)
        at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.withClient(GcpBatchApiRequestHandler.scala:29)
        at cromwell.backend.google.batch.api.GcpBatchApiRequestHandler.submit(GcpBatchApiRequestHandler.scala:9)
        at cromwell.backend.google.batch.actors.GcpBatchBackendSingletonActor$$anonfun$normalReceive$1.$anonfun$applyOrElse$1(GcpBatchBackendSingletonActor.scala:65)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:678)
        at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

It seems that Cromwell only accepts public VPC network with names starting as global/networks/..., while my actual network name was automatically attached by prefix projects/${projectId}/global/networks/ (as shown in Line 1 of the error message above).

I just wonder if this is because I have something wrong in my conf file, or I missed some setup at GCP Batch side. Thanks!

I'm using Cromwell v87. And my conf file is

...
backend {
    ...
    providers {
        GCPBATCH {
            actor-factory = "cromwell.backend.google.batch.GcpBatchBackendLifecycleActorFactory"
            config {
                ...
                virtual-private-cloud {
                    network-label-key = "my-private-network"
                    subnetwork-label-key = "my-private-subnetwork"
                    auth = "application-default"
                }
                ...
        }
}

where my-private-network and my-private-subnetwork are GCP project labels.

dspeck1 commented 3 weeks ago

Hi @yihming - thanks for providing the detail and log message. Please try removing the trailing / from the network url. so use projects/gred-cumulus-sb-01-991a49c4/global/networks/vpc-cumulus-sb-01 instead.

yihming commented 3 weeks ago

Hi @dspeck1 ,

Thank you for your immediate help!

I checked the my-private-network and my-private-subnetwork labels in my project (by running gcloud projects describe command), and neither of them has the trailing / (please see attached screenshot).

And actually this same settings in virtual-private-config stanza worked with Genomics API in the past 3 years. Then recently when I migrate to GCP Batch, it broke.

Screenshot 2024-08-20 at 14 21 36
dspeck1 commented 3 weeks ago

Thanks! Sorry I was looking at it incorrectly. The GCP Batch backend adds the trailing slash. The Genomics API backend added a trailing slash as well. Google must have change the validation of the format. We will push a change that fixes it. In the interim if setting the network via the literal option instead of the label should fix it.

yihming commented 3 weeks ago

Thanks! I did see the trailing / is added by Cromwell: https://github.com/broadinstitute/cromwell/blob/develop/supportedBackends/google/batch/src/main/scala/cromwell/backend/google/batch/models/VpcAndSubnetworkProjectLabelValues.scala#L15.

I tried to set by literals as the following:

virtual-private-cloud {
                    network-name = "$NETWORK-NAME"
                    subnetwork-name = "$SUBNETWORK-NAME"
                    auth = "application-default"
}

where $NETWORK-NAME and $SUBNETWORK-NAME are replaced by the values of my-private-network and my-private-subnetwork labels, and hidden here.

but my server failed immediately when starting:

2024-08-20 21:43:02 main WARN  - Failed to build GcpBatchConfigurationAttributes on attempt 1 of 3, retrying.
cromwell.backend.google.batch.models.GcpBatchConfigurationAttributes$$anon$1: Google Cloud Batch configuration is not valid: Errors:
Virtual Private Cloud configuration is invalid. Missing keys: `network-label-key`.

It looks like the GCP Batch config requires network-label-key, which is not optional...

yihming commented 3 weeks ago

I then set network-label-key to a non-existing label name, hoping that cromwell could fall back to using literals at runtime:

virtual-private-cloud {
                    network-name = "projects/.../global/networks/$NETWORK-NAME"
                    subnetwork-name = "regions/.../subnetworks/$SUBNETWORK-NAME"
                    network-label-key = "dummy",
                    auth = "application-default"
}

Then it did.

yihming commented 3 weeks ago

@dspeck1 Can I confirm with you if the subnetwork name specified in subnetwork-name should follow regions/${region-name}/subnetworks/${subnetwork-name} pattern? I just cannot find how Cromwell adds prefix for subnetwork name in the source code. Thanks!

yihming commented 3 weeks ago

I can confirm that using the literal approach instead of project labels works in this case. One just need to:

If the cromwell team can confirm that this is some inconsistency/bug corresponding to GCP Batch, I'd hope this issue could be fixed so that:

  1. Users don't have to always specify full names of private VPC network and subnetwork names. Namely, remove the trailing / when cromwell automatically attaches prefixes.
  2. When using the literal approach, just make network-label-key not required.

Thanks!

dspeck1 commented 3 weeks ago

We are working on updating the code to fix the bugs describe above and will provide an update when complete.

yihming commented 3 weeks ago

Thank you @dspeck1 so much for your help!

dspeck1 commented 3 weeks ago

Adding notes to issue re: PAPIv2 behavior: