Open phoerious opened 10 months ago
The logging from boot.go goes to the Logging service endpoint: https://github.com/apache/beam/blob/12a4f677c6949856e4de303c2254acb1abee42f2/sdks/python/container/boot.go#L148
It sounds like the next step is to see how that endpoint is configured when SparkRunner launches SDK containers.
im getting the same with the typescript direct runner (which also uses the python logging server)
2024-01-03 22:45:28 Jan 03, 2024 8:45:28 PM org.apache.beam.fn.harness.FnHarness main
2024-01-03 22:45:28 SEVERE: Shutting down harness due to exception
2024-01-03 22:45:28 java.util.concurrent.ExecutionException: org.apache.beam.vendor.grpc.v1p54p0.io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason
2024-01-03 22:45:28 at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
2024-01-03 22:45:28 at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2005)
2024-01-03 22:45:28 at org.apache.beam.fn.harness.FnHarness.main(FnHarness.java:381)
2024-01-03 22:45:28 at org.apache.beam.fn.harness.FnHarness.main(FnHarness.java:203)
2024-01-03 22:45:28 at org.apache.beam.fn.harness.FnHarness.main(FnHarness.java:165)
2024-01-03 22:45:28 at org.apache.beam.fn.harness.FnHarness.main(FnHarness.java:111)
2024-01-03 22:45:28 Suppressed: java.lang.RuntimeException: java.lang.IllegalStateException: Logging stream terminated unexpectedly before it was closed by the client with error: org.apache.beam.vendor.grpc.v1p54p0.io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason
2024-01-03 22:45:28 Jan 03, 2024 8:45:28 PM org.apache.beam.fn.harness.FnHarness main
2024-01-03 22:45:28 INFO: Shutting SDK harness down.
2024-01-03 22:45:28 2024/01/03 20:45:28 boot.go: error logging message over FnAPI. endpoint host.docker.internal:60714 error: EOF message follows
2024-01-03 22:45:28 2024/01/03 20:45:28 CRITICAL Java exited: <nil>
FlinkRunner seems to have the same problem:
2024/03/01 16:31:26 boot.go: error logging message over FnAPI. endpoint localhost:42919 error: EOF message follows
2024/03/01 16:31:26 DEBUG Received signal: terminated
Not sure where this originates from, if the SDK harness terminates (for some unknown reason) and then there is error in the log client, ot it is vice versa. The result is either way that the Pipeline cannot start-up.
Attaching log from Flink's TM (the oppsite side of the channel):
2024-03-01 16:38:15,559 INFO org.apache.beam.runners.fnexecution.logging.GrpcLoggingService [] - Beam Fn Logging client connected.
2024-03-01 16:38:15,561 WARN org.apache.beam.runners.fnexecution.logging.GrpcLoggingService [] - Logging client failed unexpectedly.
org.apache.beam.vendor.grpc.v1p60p1.io.grpc.StatusRuntimeException: CANCELLED: client cancelled
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.Status.asRuntimeException(Status.java:529) ~[blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onCancel(ServerCalls.java:291) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.PartialForwardingServerCallListener.onCancel(PartialForwardingServerCallListener.java:40) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.ForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:23) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:40) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.Contexts$ContextualizedServerCallListener.onCancel(Contexts.java:96) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closedInternal(ServerCallImpl.java:370) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closed(ServerCallImpl.java:359) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1Closed.runInContext(ServerImpl.java:910) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at org.apache.beam.vendor.grpc.v1p60p1.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [blob_p-76321efed52f117dff8c6e3700bd80f1a67c6953-384e464b9966599441f52d255f73b6bc:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
@robertwb is there anything to look into? I don't see any obvious reason why this happens, but it seems that the log client is terminated by the worker_pool
(the process receives SIGTERM based on the log).
In my case something seems to be sending SIGTERM to the forked worker after roughly 10 seconds from start. I was unable to figure out why this happens, yet. The log looks precisely as in the case of the description of this issue.
What happened?
I'm trying to get the SparkRunner working properly with pre-compiled JARs from my Python job (created with
--output_executable=job.jar
) that can be run withspark-submit
on Kubernetes. Unfortunately, there seems to be a problem with the log routing / plumbing, so most of the job logs just get lost in the weeds.I first tried the
PROCESS
environment type in a custom Spark image, but any output from theboot
binary in the container gets lost entirely. Hence the job either fails or succeeds, but I have no idea why, unless I attach to the executor container and strace theboot
process.I then tried
EXTERNAL
, because there at least I can check the logs of the sidecar SDK container. The example job is running, but the stdout/stderr of the job itself isn't fed back to the Spark driver, so even in client, mode I have no idea what's going on.There seems to be some sort of problem with the GRPC log server. The Spark executor container shows a bunch of these stacktraces:
and the SDK container shows
This is repeated several times with different port numbers (I guess for different stages).
With the fruit sampling example from the website, I finally get
in one of the (two) SDK container outputs, but it's never fed back to the Spark executor or the Spark driver, probably due to this log endpoint error.
Here's the job code:
I compile the JAR with
and then
spark-submit
that to the cluster.Issue Priority
Priority: 2 (default)
Issue Components