GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
659 stars 265 forks source link

Failed to submit JobGraph and the exception detail was not enough to detect the reason #405

Open jiamo opened 3 years ago

jiamo commented 3 years ago

With latest master build create example session cluster and job cluster using flink:1.12.1-scala_2.12-java11

In test docker env.

/opt/flink/bin/flink run -m flinksessioncluster-sample-jobmanager:8081 /opt/flink/examples/myfault-1.0-SNAPSHOT.jar
2021-02-04 02:31:03,798 INFO  org.apache.flink.client.cli.CliFrontend                      [] - --------------------------------------------------------------------------------
2021-02-04 02:31:03,801 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Starting Command Line Client (Version: 1.12.1, Scala: 2.12, Rev:dc404e2, Date:2021-01-09T14:46:36+01:00)
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  OS current user: root
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Current Hadoop/Kerberos user: <no hadoop dependency found>
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 11/11.0.10+9
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Maximum heap size: 709 MiBytes
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  JAVA_HOME: /usr/local/openjdk-11
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  No Hadoop Dependency available
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  JVM Options:
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -Dlog.file=/opt/flink/log/flink--client-myfault-run-cn9xv.log
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-cli.properties
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-cli.properties
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
2021-02-04 02:31:03,804 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Program Arguments:
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     run
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -m
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     flinksessioncluster-sample-jobmanager:8081
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     /opt/flink/examples/myfault-1.0-SNAPSHOT.jar
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Classpath: /opt/flink/lib/flink-csv-1.12.1.jar:/opt/flink/lib/flink-json-1.12.1.jar:/opt/flink/lib/flink-shaded-zookeeper-3.4.14.jar:/opt/flink/lib/flink-table-blink_2.12-1.12.1.jar:/opt/flink/lib/flink-table_2.12-1.12.1.jar:/opt/flink/lib/log4j-1.2-api-2.12.1.jar:/opt/flink/lib/log4j-api-2.12.1.jar:/opt/flink/lib/log4j-core-2.12.1.jar:/opt/flink/lib/log4j-slf4j-impl-2.12.1.jar:/opt/flink/lib/flink-dist_2.12-1.12.1.jar:::
2021-02-04 02:31:03,807 INFO  org.apache.flink.client.cli.CliFrontend                      [] - --------------------------------------------------------------------------------
2021-02-04 02:31:03,811 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.rpc.address, myfault-run-cn9xv
2021-02-04 02:31:03,812 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.rpc.port, 6123
2021-02-04 02:31:03,812 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.memory.process.size, 1600m
2021-02-04 02:31:03,812 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: taskmanager.memory.process.size, 1728m
2021-02-04 02:31:03,812 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2021-02-04 02:31:03,813 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: parallelism.default, 1
2021-02-04 02:31:03,813 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.execution.failover-strategy, region
2021-02-04 02:31:03,814 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: blob.server.port, 6124
2021-02-04 02:31:03,814 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: query.server.port, 6125
2021-02-04 02:31:03,848 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Loading FallbackYarnSessionCli
2021-02-04 02:31:03,945 INFO  org.apache.flink.core.fs.FileSystem                          [] - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2021-02-04 02:31:04,068 INFO  org.apache.flink.runtime.security.modules.HadoopModuleFactory [] - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2021-02-04 02:31:04,082 INFO  org.apache.flink.runtime.security.modules.JaasModule         [] - Jaas file will be created as /tmp/jaas-5146463234971937258.conf.
2021-02-04 02:31:04,093 INFO  org.apache.flink.runtime.security.contexts.HadoopSecurityContextFactory [] - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2021-02-04 02:31:04,095 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Running 'run' command.
2021-02-04 02:31:04,230 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Building program from JAR file
2021-02-04 02:31:04,325 INFO  org.apache.flink.client.ClientUtils                          [] - Starting program (detached: false)
2021-02-04 02:31:16,070 WARN  org.apache.flink.util.ExecutorUtils                          [] - ExecutorService did not terminate in time. Shutting it down now.
2021-02-04 02:31:16,074 ERROR org.apache.flink.client.cli.CliFrontend                      [] - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Failed to execute job 'Fraud Detection'.
    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:360) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:213) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:816) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:248) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1058) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1136) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1136) [flink-dist_2.12-1.12.1.jar:1.12.1]
Caused by: org.apache.flink.util.FlinkException: Failed to execute job 'Fraud Detection'.
    at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1918) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:135) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1782) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.allstoalls.FraudDetectionJob.main(FraudDetectionJob.java:48) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
    at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:343) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    ... 8 more
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
    at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:400) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
    at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:364) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture.postFire(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture$Completion.run(Unknown Source) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
    at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Internal server error: Java heap space]
    at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
    at java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture$Completion.run(Unknown Source) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
    at java.lang.Thread.run(Unknown Source) ~[?:?]

And in the same docker

/opt/flink/bin/flink run -m flinksessioncluster-sample-jobmanager:8081 /opt/flink/examples/batch/WordCount.jar  --input /opt/flink/README.txt

works fine.

So what's the real reason on [Internal server error: Java heap space] The jar can work fine in local flink cluster.

Do we have some methods to debug it?

jiamo commented 3 years ago

figure out : default heap size : jobmanager.memory.heap.size 25165824b is too small.

using this config:

  flinkProperties:
    taskmanager.numberOfTaskSlots: "1"
    jobmanager.heap.size: ""                # set empty value (only for Flink version 1.11 or above)
    jobmanager.memory.heap.size:   150mb
    jobmanager.memory.process.size: 1gb   # job manager memory limit  (only for Flink version 1.11 or above)
    taskmanager.heap.size: ""               # set empty value
    taskmanager.memory.process.size: 1gb    # task manager memory limit

The job can submit now. The error message it is not give the special issue.