apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.68k stars 4.19k forks source link

[Bug]: Calling a Java custom PTransform from Go using xlang fails on Dataflow Runner #23189

Open ronoaldo opened 1 year ago

ronoaldo commented 1 year ago

What happened?

As mentioned in #22931 - I am testing xlang support for the Go Runtime. I first was doing it wrong, using the Go direct runner, but then I moved on to test with the Dataflow runner.

I noticed that on Dataflow Runner, I could call a custom Java Ptransform from Python:

Screenshot_20220912_154742

However,I could not call the same PTransform from Go. To submit the job, I started the Java expansion service as I did for the Python run, and called my Go pipeline. I see that the expansions service is called from Go, that the Go runtime stages the .jar files into Cloud Storage, but then the pipeline fails:

image

The only error message available is:

S03:e5/ProcessElementAndRestrictionWithSizing+External/FlatMapElements/FlatMap/ParMultiDo(Anonymous)+External/Filter/ParDo(Anonymous)/ParMultiDo(Anonymous)+stats.Count/stats.keyedCountFn+stats.Count/stats.SumPerKey/CombinePerKey/CoGBK+stats.Count/stats.SumPerKey/CombinePerKey/stats.sumIntFn/Partial+stats.Count/stats.SumPerKey/CombinePerKey/CoGBK/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: 

  go-job-1-1662153285384063-09021416-tqko-harness-x8dv
      Root cause: The worker lost contact with the service.,

  go-job-1-1662153285384063-09021416-tqko-harness-x8dv
      Root cause: The worker lost contact with the service.,

  go-job-1-1662153285384063-09021416-tqko-harness-x8dv
      Root cause: The worker lost contact with the service.,

  go-job-1-1662153285384063-09021416-tqko-harness-x8dv
      Root cause: The worker lost contact with the service.

Here is a full pipeline log and worker logs downloaded from Logs explorer.

Is this an actual bug or a unsupported workflow? I am trying to follow the Multi language pipelines section of the Beam Programming Model docs, regarding expose the Java PTransform (which I assume is correct since calling from Python works) and consuming it from Go, which I'm not sure if I missed any important steps.

Issue Priority

Priority: 2

Issue Component

Component: sdk-go

lostluck commented 1 year ago

I can confirm that the same should have worked for both, but I don't have the experience with Xlang to understand what's going wrong with what you have here.

The logs indicate that both the Go and the Java containers are starting up, but then the workers simply crash/die.

It would be useful to confirm whether the Python pipeline and the Go pipeline are uploading the same java artifacts however. Dumps of the pipeline protos (which should also be uploaded to the GCS folders) would also be valuable.

chamikaramj commented 1 year ago

I don't see workitems actually executing in the worker log you provided.

Can you check other logs (for example, kubelet) to see if there are any errors related to pipeline setup ?