apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.85k stars 4.25k forks source link

[Feature Request][Go SDK]: Have the Dataflow runner use the cloned GCR images instead if the image is unset. #23453

Open lostluck opened 2 years ago

lostluck commented 2 years ago

What would you like to happen?

Java and Python SDKs use cloned external images from Dockerhub cached in GCR by default instead.

https://github.com/apache/beam/search?q=v1beta3%2F

Containers live here: https://console.cloud.google.com/gcr/images/cloud-dataflow/global/v1beta3

The Go SDK should do the same, to reduce startup time. Container paths would be like gcr.io/cloud-dataflow/v1beta3/beam_go_sdk:2.37.0 and similar for each version.

The Go SDK container is significantly smaller than Java and Python's due to the static compilation nature of Go, so this likely reduced issues requiring the proximity caching. However, for parity this would be useful.

~~Images are also hosted on Google Artifact Registry on a Regional basis: eg. https://console.cloud.google.com/artifacts/docker/google.com:dataflow-containers/us-west1/worker/v1beta3%2Fbeam_go_sdk?project=google.com:dataflow-containers

With paths like us-west1-docker.pkg.dev/google.com/dataflow-containers/worker/v1beta3/beam_go_sdk:2.42.0

But this also appears to require certain additional gcloud authentications to work 100%, so unless dataflow service accounts/workers are automatically authenticated for fetching images, it may not yet be worthwhile to implement this.~~

GAR containers are substituted by the dataflow service assuming the GCR paths are used.

Issue Priority

Priority: 3

Issue Component

Component: sdk-go

lostluck commented 2 years ago

Open Question: What's the correct behavior here for Cross Language containers? If it's the "standard" path for a beam release container, but running on dataflow, should we make the substitution for them as well. I'd assume yes.

lostluck commented 2 years ago

Answer from @chamikaramj : for Xlang, the host SDK needs to do the translation of paths. Expansion services have no idea if a request is to be used on Dataflow or not (or whatever runner is executing their reponses).

So the Go Dataflow runner package will need to simply take the default beam image paths, and make the substitutions itself uniformly for all environments in the pipeline.

Around here: https://github.com/apache/beam/blob/56d10f3f9cf1292d214d585b59b4fee9d1bc8f3e/sdks/go/pkg/beam/runners/dataflow/dataflowlib/job.go#L109

chamikaramj commented 2 years ago

We do this for Python SDK here: https://github.com/apache/beam/blob/3c7a4e0f44d27ddac006c1ec83fbdfe157b2cf1a/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py#L830

robertwb commented 2 years ago

We really need to get away from having dataflow-specific containers for anything.

lostluck commented 2 years ago

For Go, the containers are not Dataflow specific. They are the same release containers, cloned to GCR.

It makes sense for reduced user costs to keep things in region/network so there's less likelyhood of ingress costs for them, and possibly reduced startup latency.

chamikaramj commented 2 years ago

They are the same containers for Java/Python as well. We just copy the released Beam SDK harness containers to GCR (for improving reliability/latency I believe).