apache / incubator-streampark

Make stream processing easier! Easy-to-use streaming application development framework and operation platform.
https://streampark.apache.org/
Apache License 2.0
3.91k stars 1.01k forks source link

[Bug] streampark make k8s zk High availability not work #2807

Open J-dfy opened 1 year ago

J-dfy commented 1 year ago

Search before asking

Java Version

1.8.0

Scala Version

2.12.x

StreamPark Version

2.0.0

Flink Version

1.16.1

deploy mode

kubernetes-application

What happened

  1. use streampark start a job , kill the jobmanager , a new jobmanager will be pulled by ha(zk) , but after seconds the new jobmanager will be killed ,ha not work
  2. use streampark start a job , shutdown streampark , kill the jobmanager , new jobmanager will be pulled by ha(zk)

1.用streampark启动一个任务,然后用kill命令杀死jobmanager,新的jobmanager会被高可用(zk)拉起,但是新的jobmanager很快会被杀死 2.用streampark启动一个任务,然后关闭streampark,用kill命令杀死jobmanager,新的jobmanager会被高可用(zk)拉起,之后无其他异常

Error Exception

2023-06-20 14:12:11,261 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - --------------------------------------------------------------------------------
2023-06-20 14:12:11,267 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  Preconfiguration: 
2023-06-20 14:12:11,268 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - 

RESOURCE_PARAMS extraction logs:
jvm_params: -Xmx1073741824 -Xms1073741824 -XX:MaxMetaspaceSize=268435456
dynamic_configs: -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=201326592b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=1073741824b -D jobmanager.memory.jvm-overhead.max=201326592b
logs: INFO  [] - Loading configuration property: blob.server.port, 6124
INFO  [] - Loading configuration property: state.checkpoints.num-retained, 1
INFO  [] - Loading configuration property: kubernetes.hostnetwork.enabled, true
INFO  [] - Loading configuration property: jobmanager.execution.failover-strategy, region
INFO  [] - Loading configuration property: high-availability.cluster-id, opswaf
INFO  [] - Loading configuration property: jobmanager.rpc.address, localhost
INFO  [] - Loading configuration property: kubernetes.service-account, flink-service-account
INFO  [] - Loading configuration property: kubernetes.cluster-id, opswaf
INFO  [] - Loading configuration property: high-availability.storageDir, hdfs:///user/flink/ha
INFO  [] - Loading configuration property: $internal.application.program-args, --servers;axcloud
INFO  [] - Loading configuration property: kubernetes.container.image, harbor-pre.jijiaban.net/flink/flink/streamparkflinkjob-flink-opswaf
INFO  [] - Loading configuration property: parallelism.default, 1
INFO  [] - Loading configuration property: kubernetes.namespace, flink
INFO  [] - Loading configuration property: taskmanager.numberOfTaskSlots, 1
INFO  [] - Loading configuration property: kubernetes.rest-service.exposed.type, NodePort
INFO  [] - Loading configuration property: high-availability.jobmanager.port, 6123
INFO  [] - Loading configuration property: kubernetes.jobmanager.node-selector, bu:flink
INFO  [] - Loading configuration property: $internal.application.main, com.huixian.flinkops.stream.application.OpsWafApp
INFO  [] - Loading configuration property: taskmanager.memory.process.size, 1728m
INFO  [] - Loading configuration property: jobmanager.archive.fs.dir, hdfs://nameservice1/user/streampark/historyserver/archive
INFO  [] - Loading configuration property: kubernetes.internal.jobmanager.entrypoint.class, org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
INFO  [] - Loading configuration property: pipeline.name, OpsWafApp
INFO  [] - Loading configuration property: classloader.resolve-order, child-first
INFO  [] - Loading configuration property: kubernetes.pod-template-file, /data/flink/flink/conf/flink-pod-template.yaml
INFO  [] - Loading configuration property: execution.target, kubernetes-application
INFO  [] - Loading configuration property: jobmanager.memory.process.size, 1600m
INFO  [] - Loading configuration property: jobmanager.rpc.port, 6123
INFO  [] - Loading configuration property: taskmanager.rpc.port, 6122
INFO  [] - Loading configuration property: kubernetes.container.image.pull-policy, Always
INFO  [] - Loading configuration property: high-availability.zookeeper.quorum, 172.16.122.91:2181,172.16.122.92:2181,172.16.122.93:2181
INFO  [] - Loading configuration property: internal.cluster.execution-mode, NORMAL
INFO  [] - Loading configuration property: $internal.pipeline.job-id, 8eea7f717c7de1b06b9902b159a28b9e
INFO  [] - Loading configuration property: high-availability, ZOOKEEPER
INFO  [] - Loading configuration property: pipeline.jars, local:///opt/flink/usrlib/streampark-flinkjob_OpsWafApp.jar
INFO  [] - Loading configuration property: rest.address, localhost
INFO  [] - Loading configuration property: kubernetes.taskmanager.node-selector, bu:flink
INFO  [] - The derived from fraction jvm overhead memory (160.000mb (167772162 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
INFO  [] - Final Master Memory configuration:
INFO  [] -   Total Process Memory: 1.563gb (1677721600 bytes)
INFO  [] -     Total Flink Memory: 1.125gb (1207959552 bytes)
INFO  [] -       JVM Heap:         1024.000mb (1073741824 bytes)
INFO  [] -       Off-heap:         128.000mb (134217728 bytes)
INFO  [] -     JVM Metaspace:      256.000mb (268435456 bytes)
INFO  [] -     JVM Overhead:       192.000mb (201326592 bytes)

2023-06-20 14:12:11,269 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - --------------------------------------------------------------------------------
2023-06-20 14:12:11,269 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  Starting KubernetesApplicationClusterEntrypoint (Version: 1.16.1, Scala: 2.12, Rev:DeadD0d0, Date:1970-01-01T01:00:00+01:00)
2023-06-20 14:12:11,269 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  OS current user: flink
2023-06-20 14:12:11,881 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  Current Hadoop/Kerberos user: flink
2023-06-20 14:12:11,881 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  JVM: OpenJDK 64-Bit Server VM - Temurin - 1.8/25.362-b09
2023-06-20 14:12:11,882 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  Arch: amd64
2023-06-20 14:12:11,882 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  Maximum heap size: 989 MiBytes
2023-06-20 14:12:11,882 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  JAVA_HOME: /opt/java/openjdk
2023-06-20 14:12:11,885 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  Hadoop version: 3.0.0-cdh6.3.2
2023-06-20 14:12:11,886 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  JVM Options:
2023-06-20 14:12:11,886 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -Xmx1073741824
2023-06-20 14:12:11,886 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -Xms1073741824
2023-06-20 14:12:11,886 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -XX:MaxMetaspaceSize=268435456
2023-06-20 14:12:11,886 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -Dlog.file=/opt/flink/log/flink--kubernetes-application-0-k8s-node-147-26.log
2023-06-20 14:12:11,886 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2023-06-20 14:12:11,886 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties
2023-06-20 14:12:11,887 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2023-06-20 14:12:11,887 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  Program Arguments:
2023-06-20 14:12:11,888 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -D
2023-06-20 14:12:11,888 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     jobmanager.memory.off-heap.size=134217728b
2023-06-20 14:12:11,888 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -D
2023-06-20 14:12:11,888 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     jobmanager.memory.jvm-overhead.min=201326592b
2023-06-20 14:12:11,889 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -D
2023-06-20 14:12:11,889 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     jobmanager.memory.jvm-metaspace.size=268435456b
2023-06-20 14:12:11,889 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -D
2023-06-20 14:12:11,889 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     jobmanager.memory.heap.size=1073741824b
2023-06-20 14:12:11,889 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     -D
2023-06-20 14:12:11,890 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     jobmanager.memory.jvm-overhead.max=201326592b
2023-06-20 14:12:11,890 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  Classpath: /opt/flink/lib/commons-pool2-2.6.2.jar:/opt/flink/lib/connect-api-2.7.1.jar:/opt/flink/lib/druid-1.1.10.jar:/opt/flink/lib/flink-cep-1.16.1.jar:/opt/flink/lib/flink-connector-files-1.16.1.jar:/opt/flink/lib/flink-connector-hbase-2.2-1.16.1.jar:/opt/flink/lib/flink-connector-jdbc-1.16.1.jar:/opt/flink/lib/flink-connector-kafka-1.16.1.jar:/opt/flink/lib/flink-csv-1.16.1.jar:/opt/flink/lib/flink-json-1.16.1.jar:/opt/flink/lib/flink-queryable-state-runtime-1.16.1.jar:/opt/flink/lib/flink-scala_2.12-1.16.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-3.0.0-cdh6.3.2-10.0.jar:/opt/flink/lib/flink-shaded-zookeeper-3.5.9.jar:/opt/flink/lib/flink-sql-connector-mysql-cdc-2.3.0.jar:/opt/flink/lib/flink-table-api-java-bridge-1.16.1.jar:/opt/flink/lib/flink-table-api-java-uber-1.16.1.jar:/opt/flink/lib/flink-table-planner-loader-1.16.1.jar:/opt/flink/lib/flink-table-runtime-1.16.1.jar:/opt/flink/lib/hadoop-client-3.0.0-cdh6.3.2.jar:/opt/flink/lib/HikariCP-4.0.3.jar:/opt/flink/lib/jedis-3.4.1.jar:/opt/flink/lib/kafka-clients-3.2.3.jar:/opt/flink/lib/log4j-1.2-api-2.17.1.jar:/opt/flink/lib/log4j-api-2.17.1.jar:/opt/flink/lib/log4j-core-2.17.1.jar:/opt/flink/lib/log4j-slf4j-impl-2.17.1.jar:/opt/flink/lib/mysql-connector-java-8.0.27.jar:/opt/flink/lib/taos-jdbcdriver-2.0.38-dist.jar:/opt/flink/lib/flink-dist-1.16.1.jar:::/opt/hadoop/conf:
2023-06-20 14:12:11,890 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - --------------------------------------------------------------------------------
2023-06-20 14:12:11,892 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Registered UNIX signal handlers for [TERM, HUP, INT]
2023-06-20 14:12:11,968 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: blob.server.port, 6124
2023-06-20 14:12:11,968 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: state.checkpoints.num-retained, 1
2023-06-20 14:12:11,968 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.hostnetwork.enabled, true
2023-06-20 14:12:11,969 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.execution.failover-strategy, region
2023-06-20 14:12:11,969 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: high-availability.cluster-id, opswaf
2023-06-20 14:12:11,969 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.rpc.address, localhost
2023-06-20 14:12:11,969 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.service-account, flink-service-account
2023-06-20 14:12:11,970 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.cluster-id, opswaf
2023-06-20 14:12:11,970 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: high-availability.storageDir, hdfs:///user/flink/ha
2023-06-20 14:12:11,970 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: $internal.application.program-args, --servers;axcloud
2023-06-20 14:12:11,970 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.container.image, harbor-pre.jijiaban.net/flink/flink/streamparkflinkjob-flink-opswaf
2023-06-20 14:12:11,970 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: parallelism.default, 1
2023-06-20 14:12:11,970 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.namespace, flink
2023-06-20 14:12:11,971 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2023-06-20 14:12:11,971 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.rest-service.exposed.type, NodePort
2023-06-20 14:12:11,971 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: high-availability.jobmanager.port, 6123
2023-06-20 14:12:11,971 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.jobmanager.node-selector, bu:flink
2023-06-20 14:12:11,971 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: $internal.application.main, com.huixian.flinkops.stream.application.OpsWafApp
2023-06-20 14:12:11,972 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: taskmanager.memory.process.size, 1728m
2023-06-20 14:12:11,972 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.archive.fs.dir, hdfs://nameservice1/user/streampark/historyserver/archive
2023-06-20 14:12:11,972 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.internal.jobmanager.entrypoint.class, org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
2023-06-20 14:12:11,972 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: pipeline.name, OpsWafApp
2023-06-20 14:12:11,973 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: classloader.resolve-order, child-first
2023-06-20 14:12:11,973 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.pod-template-file, /data/flink/flink/conf/flink-pod-template.yaml
2023-06-20 14:12:11,973 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: execution.target, kubernetes-application
2023-06-20 14:12:11,973 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.memory.process.size, 1600m
2023-06-20 14:12:11,974 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.rpc.port, 6123
2023-06-20 14:12:11,974 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: taskmanager.rpc.port, 6122
2023-06-20 14:12:11,974 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.container.image.pull-policy, Always
2023-06-20 14:12:11,974 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: high-availability.zookeeper.quorum, 172.16.122.91:2181,172.16.122.92:2181,172.16.122.93:2181
2023-06-20 14:12:11,975 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: internal.cluster.execution-mode, NORMAL
2023-06-20 14:12:11,975 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: $internal.pipeline.job-id, 8eea7f717c7de1b06b9902b159a28b9e
2023-06-20 14:12:11,975 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: high-availability, ZOOKEEPER
2023-06-20 14:12:11,975 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: pipeline.jars, local:///opt/flink/usrlib/streampark-flinkjob_OpsWafApp.jar
2023-06-20 14:12:11,975 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: rest.address, localhost
2023-06-20 14:12:11,976 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: kubernetes.taskmanager.node-selector, bu:flink
2023-06-20 14:12:11,976 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading dynamic configuration property: jobmanager.memory.off-heap.size, 134217728b
2023-06-20 14:12:11,976 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading dynamic configuration property: jobmanager.memory.jvm-overhead.min, 201326592b
2023-06-20 14:12:11,976 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading dynamic configuration property: jobmanager.memory.jvm-metaspace.size, 268435456b
2023-06-20 14:12:11,977 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading dynamic configuration property: jobmanager.memory.heap.size, 1073741824b
2023-06-20 14:12:11,977 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading dynamic configuration property: jobmanager.memory.jvm-overhead.max, 201326592b
2023-06-20 14:12:12,608 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

Screenshots

No response

Are you willing to submit PR?

Code of Conduct

J-dfy commented 1 year ago

image 不理解为什么pod终止时要删除deployment 暂时改成这样,高可用就能用了

wolfboys commented 1 year ago

cc @Al-assad @MonsterChenzhuo