Closed Nikhil-Devisetti closed 4 months ago
@dgomezleon Hi, Do you got a chance to check this out? It would be really helpful if you can provide any update on this since it is effecting our prod env.
@dgomezleon Can you help with an update on this.
Hi @Nikhil-Devisetti,
Sorry for the delay here. We are going to check this and update the ticket once we have more information.
Hi @Nikhil-Devisetti,
I can't reproduce the issue when using the latest version of the chart. I edited the values.yaml file
diff --git a/bitnami/flink/values.yaml b/bitnami/flink/values.yaml
index af5c9cc0bb..ae756275dc 100644
--- a/bitnami/flink/values.yaml
+++ b/bitnami/flink/values.yaml
@@ -124,7 +124,10 @@ jobmanager:
## - name: FOO
## value: BAR
##
- extraEnvVars: []
+ extraEnvVars:
+ - name: FLINK_PROPERTIES
+ value: |
+ jobmanager.memory.process.size: 1g
## @param jobmanager.extraEnvVarsCM Name of existing ConfigMap containing extra env vars
##
extraEnvVarsCM: ""
and installed the Bitnami Chart. Once the pods were ready
$ k get pods
NAME READY STATUS RESTARTS AGE
flink-jobmanager-5bdfb7b457-xcxlw 1/1 Running 0 66s
flink-taskmanager-0 1/1 Running 0 66s
I confirmed that there were no errors in the log
$ k logs flink-jobmanager-5bdfb7b457-xcxlw | tail -n 10
2024-05-06 14:08:24,749 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Successfully recovered 0 persisted job graphs.
2024-05-06 14:08:24,880 INFO org.apache.flink.runtime.rpc.pekko.PekkoRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at pekko://flink/user/rpc/dispatcher_0 .
2024-05-06 14:08:24,979 INFO org.apache.flink.runtime.rpc.pekko.PekkoRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at pekko://flink/user/rpc/resourcemanager_1 .
2024-05-06 14:08:25,161 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Starting the resource manager.
2024-05-06 14:08:25,250 INFO org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] - Starting the slot manager.
2024-05-06 14:08:25,252 INFO org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - Starting tokens update task
2024-05-06 14:08:25,253 WARN org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - No tokens obtained so skipping notifications
2024-05-06 14:08:25,253 WARN org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - Tokens update task not started because either no tokens obtained or none of the tokens specified its renewal date
2024-05-06 14:08:36,460 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local:6122-7d9c19 (pekko.tcp://flink@flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local:6122/user/rpc/taskmanager_0) at ResourceManager
2024-05-06 14:08:36,552 INFO org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] - Registering task executor flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local:6122-7d9c19 under 5fe9d0f3507d60d754ea49a8fdaf4764 at the slot manager.
and the conf file included the new parameter
$ k exec flink-jobmanager-5bdfb7b457-xcxlw -- cat /opt/bitnami/flink/conf/config.yaml | tail -n 10
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
jobmanager.memory.process.size: 1g
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.bind-port: 6123
jobmanager.rpc.port: 6123
rest.address: flink-jobmanager
rest.bind-address: 0.0.0.0
rest.port: 8081
@jotamartos Hi, Thanks for checking. Pods won't be going into crashloop not immediately after deploying. Even in our case they were running without any issues for few days and all of sudden they're crashing.
I'll try to redeploy with latest helm chart and will observe for few days and accordingly will update here.
Thanks! I tried to obtain more info but I couldn't reproduce the issue. I deleted the pod and waited for the deployment to recreate it but it worked as expected
$ k get pods
NAME READY STATUS RESTARTS AGE
flink-jobmanager-5bdfb7b457-6npqf 1/1 Running 0 2m47s
flink-taskmanager-0 1/1 Running 0 7m52s
$ k exec flink-jobmanager-5bdfb7b457-6npqf -- cat /opt/bitnami/flink/conf/config.yaml | tail -n 10
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
jobmanager.memory.process.size: 1g
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.bind-port: 6123
jobmanager.rpc.port: 6123
rest.address: flink-jobmanager
rest.bind-address: 0.0.0.0
rest.port: 8081
Please let us know if you find something relevant.
@jotamartos Hi, I have deployed latest chart (flink-1.1.1) in dev cluster with overrides values through helm and it got deployed hence i tried to replicate the same in qa cluster, jobmanager pod got deployed with override values but taskmanager is failing with duplicate key error.
If i remove the below override values for taskmanager then pods are getting deployed and running.
taskmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
taskmanager.memory.process.size: 12g
taskmanager.memory.flink.size: 10g
taskmanager.memory.jvm-metaspace.size: 3g
Below details are from qa cluster.
k get po -n flink
NAME READY STATUS RESTARTS AGE
flink-ushur-jobmanager-5ff8d57684-l8cjx 1/1 Running 0 21m
flink-ushur-taskmanager-0 1/1 Running 0 16m
flink-ushur-taskmanager-1 1/1 Running 0 16m
flink-ushur-taskmanager-2 0/1 CrashLoopBackOff 5 (2m21s ago) 5m43s
helm ls -n flink
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
flink-ushur flink 3 2024-05-08 14:14:34.245067644 +0000 UTC deployed flink-1.1.1 1.19.0
k logs flink-ushur-taskmanager-2 -n flink
flink 14:15:31.67 INFO ==>
flink 14:15:31.67 INFO ==> Welcome to the Bitnami flink container
flink 14:15:31.67 INFO ==> Subscribe to project updates by watching https://github.com/bitnami/containers
flink 14:15:31.67 INFO ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
flink 14:15:31.67 INFO ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit https://bitnami.com/enterprise
flink 14:15:31.67 INFO ==>
flink 14:15:31.67 INFO ==> ** Starting Apache Flink taskmanager setup **
flink 14:15:31.83 INFO ==> ** FLINK taskmanager setup finished! **
flink 14:15:31.84 INFO ==> ** Starting Apache Flink Task Manager
[ERROR] The execution result is empty.
[ERROR] Could not get JVM parameters and dynamic configurations properly.
[ERROR] Raw output from BashJavaUtils:
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
INFO [] - Using standard YAML parser to load flink configuration file from /opt/bitnami/flink/conf/config.yaml.
ERROR [] - Failed to parse YAML configuration
org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
in reader, line 21, column 1
found duplicate key taskmanager.memory.process.size
in reader, line 315, column 1
at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123) ~[flink-dist-1.19.0.jar:1.19.0]
at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100) [flink-dist-1.19.0.jar:1.19.0]
at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347) [flink-dist-1.19.0.jar:1.19.0]
at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163) [flink-dist-1.19.0.jar:1.19.0]
at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154) [flink-dist-1.19.0.jar:1.19.0]
at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41) [flink-dist-1.19.0.jar:2.17.1]
at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66) [bash-java-utils.jar:2.17.1]
at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56) [bash-java-utils.jar:2.17.1]
Exception in thread "main" java.lang.RuntimeException: Error parsing YAML configuration.
at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:352)
at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163)
at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154)
at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41)
at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66)
at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56)
Caused by: org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
in reader, line 21, column 1
found duplicate key taskmanager.memory.process.size
in reader, line 315, column 1
at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90)
at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70)
at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278)
at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79)
at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111)
at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123)
at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100)
at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347)
... 5 more
After removing taskmanager override values and pods are deployed.
taskmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
taskmanager.memory.process.size: 12g
taskmanager.memory.flink.size: 10g
taskmanager.memory.jvm-metaspace.size: 3g
k get po -n flink
NAME READY STATUS RESTARTS AGE
flink-ushur-jobmanager-5ff8d57684-l8cjx 1/1 Running 0 28m
flink-ushur-taskmanager-0 1/1 Running 0 81s
flink-ushur-taskmanager-1 1/1 Running 0 81s
flink-ushur-taskmanager-2 1/1 Running 0 3m56s
Hi @Nikhil-Devisetti,
Everything worked for me as expected. I installed version 1.1.1 of the chart in my cluster with the following changes
diff --git a/bitnami/flink/values.yaml b/bitnami/flink/values.yaml
index af5c9cc0bb..006f1aedae 100644
--- a/bitnami/flink/values.yaml
+++ b/bitnami/flink/values.yaml
@@ -124,7 +124,10 @@ jobmanager:
## - name: FOO
## value: BAR
##
- extraEnvVars: []
+ extraEnvVars:
+ - name: FLINK_PROPERTIES
+ value: |
+ jobmanager.memory.process.size: 1g
## @param jobmanager.extraEnvVarsCM Name of existing ConfigMap containing extra env vars
##
extraEnvVarsCM: ""
@@ -506,7 +509,12 @@ taskmanager:
## - name: FOO
## value: BAR
##
- extraEnvVars: []
+ extraEnvVars:
+ - name: FLINK_PROPERTIES
+ value: |
+ taskmanager.memory.process.size: 3g
+ taskmanager.memory.flink.size: 1g
+ taskmanager.memory.jvm-metaspace.size: 1g
## @param taskmanager.extraEnvVarsCM Name of existing ConfigMap containing extra env vars
##
extraEnvVarsCM: ""
The env vars were configured properly
PATH=/opt/bitnami/java/bin:/opt/bitnami/flink/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=flink-taskmanager-0
HOME=/
OS_ARCH=amd64
OS_FLAVOUR=debian-12
OS_NAME=linux
APP_VERSION=1.19.0
BITNAMI_APP_NAME=flink
FLINK_HOME=/opt/bitnami/flink
JAVA_HOME=/opt/bitnami/java
MY_POD_NAME=flink-taskmanager-0
FLINK_CFG_TASKMANAGER_DATA_PORT=6121
FLINK_CFG_METRICS_INTERNAL_QUERY__SERVICE_PORT=6126
BITNAMI_DEBUG=false
FLINK_CFG_TASKMANAGER_BIND__HOST=0.0.0.0
FLINK_PROPERTIES=taskmanager.memory.process.size: 3g
taskmanager.memory.flink.size: 1g
taskmanager.memory.jvm-metaspace.size: 1g
FLINK_MODE=taskmanager
FLINK_CFG_JOBMANAGER_RPC_ADDRESS=flink-jobmanager
FLINK_CFG_JOBMANAGER_RPC_PORT=6123
FLINK_CFG_JOBMANAGER_BIND__HOST=0.0.0.0
FLINK_CFG_TASKMANAGER_RPC_PORT=6122
FLINK_CFG_TASKMANAGER_HOST=flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local
FLINK_JOBMANAGER_PORT_8081_TCP_PROTO=tcp
FLINK_TASKMANAGER_PORT_6121_TCP=tcp://10.184.250.252:6121
FLINK_TASKMANAGER_PORT_6121_TCP_PROTO=tcp
KUBERNETES_SERVICE_PORT=443
FLINK_TASKMANAGER_PORT_6121_TCP_PORT=6121
FLINK_TASKMANAGER_PORT_6126_TCP=tcp://10.184.250.252:6126
KUBERNETES_PORT_443_TCP_PORT=443
FLINK_JOBMANAGER_SERVICE_PORT_HTTP=8081
FLINK_JOBMANAGER_PORT_6123_TCP_PROTO=tcp
FLINK_JOBMANAGER_PORT_6123_TCP_ADDR=10.184.247.241
FLINK_TASKMANAGER_SERVICE_PORT_TCP_DATA=6121
FLINK_TASKMANAGER_PORT_6122_TCP_PORT=6122
FLINK_TASKMANAGER_PORT_6122_TCP_ADDR=10.184.250.252
KUBERNETES_PORT=tcp://10.184.240.1:443
FLINK_JOBMANAGER_SERVICE_PORT=6123
FLINK_JOBMANAGER_PORT_6124_TCP=tcp://10.184.247.241:6124
FLINK_JOBMANAGER_PORT_6124_TCP_PORT=6124
FLINK_TASKMANAGER_SERVICE_PORT_TCP_RPC=6122
FLINK_JOBMANAGER_PORT_6124_TCP_PROTO=tcp
FLINK_JOBMANAGER_PORT=tcp://10.184.247.241:6123
FLINK_JOBMANAGER_PORT_6123_TCP=tcp://10.184.247.241:6123
FLINK_JOBMANAGER_PORT_8081_TCP=tcp://10.184.247.241:8081
FLINK_JOBMANAGER_PORT_8081_TCP_ADDR=10.184.247.241
FLINK_JOBMANAGER_SERVICE_HOST=10.184.247.241
FLINK_JOBMANAGER_SERVICE_PORT_TCP_RPC=6123
FLINK_TASKMANAGER_SERVICE_PORT=6121
FLINK_TASKMANAGER_PORT_6126_TCP_PROTO=tcp
KUBERNETES_SERVICE_HOST=10.184.240.1
KUBERNETES_PORT_443_TCP_PROTO=tcp
FLINK_JOBMANAGER_PORT_6123_TCP_PORT=6123
FLINK_JOBMANAGER_PORT_6124_TCP_ADDR=10.184.247.241
FLINK_TASKMANAGER_SERVICE_PORT_TCP_INTERNAL_METRICS=6126
FLINK_TASKMANAGER_PORT_6121_TCP_ADDR=10.184.250.252
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP_ADDR=10.184.240.1
FLINK_TASKMANAGER_SERVICE_HOST=10.184.250.252
FLINK_TASKMANAGER_PORT=tcp://10.184.250.252:6121
FLINK_TASKMANAGER_PORT_6122_TCP_PROTO=tcp
FLINK_TASKMANAGER_PORT_6126_TCP_PORT=6126
KUBERNETES_PORT_443_TCP=tcp://10.184.240.1:443
FLINK_JOBMANAGER_SERVICE_PORT_TCP_BLOB=6124
FLINK_JOBMANAGER_PORT_8081_TCP_PORT=8081
FLINK_TASKMANAGER_PORT_6122_TCP=tcp://10.184.250.252:6122
FLINK_TASKMANAGER_PORT_6126_TCP_ADDR=10.184.250.252
TERM=xterm
The configuration file as well
# dir: hdfs:///completed-jobs/
# # Interval in milliseconds for refreshing the monitored directories.
# fs.refresh-interval: 10000
blob.server.port: 6124
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.process.size: 3g
taskmanager.memory.flink.size: 1g
taskmanager.memory.jvm-metaspace.size: 1g
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
metrics.internal.query-service.port: 6126
rest.port: 8081
taskmanager.bind-host: 0.0.0.0
taskmanager.data.port: 6121
taskmanager.host: flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local
taskmanager.rpc.port: 6122
The pods were running
NAME READY STATUS RESTARTS AGE
flink-jobmanager-5bdfb7b457-hzdxk 1/1 Running 0 46s
flink-taskmanager-0 1/1 Running 0 45s
I'm using kubectl 1.29
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1-gke.1589018
@jotamartos Thanks for checking from your end. But still i'm seeing same error for taskmanager pods in qa eks cluster even after removing container images from worker nodes and redeploying the flink helm chart.
By any chance, does bitnami team came across the mentioned duplicate error in any of your testing phase? If yes, can you help me to understand how it is resolved.
Hi @Nikhil-Devisetti,
I'm sorry but our tests worked properly and couldn't reproduce the issue. Please ensure you are using a supported version of Kubernetes and that there is not an old PV/PVC that is affecting the deployment.
If you continue running into the issue, I suggest you try a different cluster to check if the problem persists there.
hi @jotamartos ,
we're using k8s 1.28 and there is no PV/PVC mounted for config.yaml since it is coming from flink bitnami image.
Could you check and let me know is there any change in the sequence the flink container reads the configs? Does values from helm or values from image take precedence?
I have debugged this issue and I am now understanding why it is happening.
Indirect root cause of issue came with flink 1.19 which has some major changes how configuration is handled. Under link we have major behaviour change:
Duplicated keys:
flink-conf.yaml: Allows duplicated keys and takes the last key-value pair for the corresponding key that appears in the file.
config.yaml: Does not allow duplicated keys, and an error will be reported when loading the configuration.
So error:
flink 07:20:57.96 INFO ==> ** Starting Apache Flink Task Manager
[ERROR] The execution result is empty.
[ERROR] Could not get JVM parameters and dynamic configurations properly.
[ERROR] Raw output from BashJavaUtils:
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
INFO [] - Using standard YAML parser to load flink configuration file from /opt/bitnami/flink/conf/config.yaml.
ERROR [] - Failed to parse YAML configuration
org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
in reader, line 21, column 1
found duplicate key taskmanager.memory.process.size
in reader, line 313, column 1
at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111) ~[flink-dist-1.19.0.jar:1.19.0]
at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123) ~[flink-dist-1.19.0.jar:1.19.0]
at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100) [flink-dist-1.19.0.jar:1.19.0]
at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347) [flink-dist-1.19.0.jar:1.19.0]
at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163) [flink-dist-1.19.0.jar:1.19.0]
at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154) [flink-dist-1.19.0.jar:1.19.0]
comes from change to yaml file config and duplicated configuration keys.
Using values:
$ cat values.yaml
jobmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
jobmanager.memory.process.size: 1g
taskmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
taskmanager.memory.process.size: 2g
we cannot duplicate this issue. This is becuase how Bitnami Docker image is handling configuration injection, for this it is used script https://github.com/bitnami/containers/blob/main/bitnami/flink/1/debian-12/rootfs/opt/bitnami/scripts/flink-env.sh which is taking ENV variables and default conf-default to merge all configurations (eliminate also overlaps if there are any) and inject into /opt/bitnami/flink/conf/config.yaml which will be used by Flink itself.
Process of configuration is bundled into container entrypoint: https://github.com/bitnami/containers/blob/fe7ac75eb9e200ed06efeeb6a11bdef77f922cd7/bitnami/flink/1/debian-12/Dockerfile#L60
At this moment issue doesnt appear and looking into config file everything is great and pods are starting and running as expected, but ...
Containers can crash or go out of memory, in such cases Kubernetes scheduler (by implemented healthcheck will discover such bad situation) will perform restart and this is whole problem in this process, becuase during restart Kubernetes is not recreating pod but restarting entrypoint. After second run of entrypoint we again rerun configuration builder and injecting incremental changes into confiug.yaml
in result configuration is changing from correct one into corrupted (as related to changes in flink 1.19 and movement to yaml syntax).
.............
blob.server.port: 6124
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.process.size: 6g
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
metrics.internal.query-service.port: 6126
rest.port: 8081
taskmanager.bind-host: 0.0.0.0
taskmanager.data.port: 6121
taskmanager.host: flink-taskmanager-0.flink-taskmanager-headless.flink.svc.cluster.local
taskmanager.rpc.port: 6122
taskmanager.memory.process.size: 6g
so in result taskmanager.memory.process.size: 6g
is appearing two times and this is corruption of YAML syntax.
Reason of such operation is flag -n
in https://github.com/bitnami/containers/blob/fe7ac75eb9e200ed06efeeb6a11bdef77f922cd7/bitnami/flink/1/debian-12/rootfs/opt/bitnami/scripts/flink/entrypoint.sh#L25
-n, --no-clobber
do not overwrite an existing file (overrides a -u or
previous -i option). See also --update
So entrypoint is rereading again ENV variables and injecting them second time into config.
So this is explaining why issue arriving from time to time. It require restarting a pod (restart doesnt cleanup ephemeral container data, where pod deletion will clean it and solve issue).
Simplest solution will be probably remove -n
from https://github.com/bitnami/containers/blob/fe7ac75eb9e200ed06efeeb6a11bdef77f922cd7/bitnami/flink/1/debian-12/rootfs/opt/bitnami/scripts/flink/entrypoint.sh#L25.
Harder solution which will be also required in future is to rewrite a entrypoint configuration scripts into pure yaml and make some duplicate merging mechanism there.
To repelicate issue just try to run a container with some huge taskmanager.memory.process.size
to crash java process during start and then it will trigger mentioned problem.
In my case I replicated by running:
jobmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
jobmanager.memory.process.size: 1g
taskmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
taskmanager.memory.process.size: 9g
on worker nodes which have 8GB physical memory so Java was not able to reserve memory, crashed and then issue was triggered.
I have also opened https://github.com/bitnami/containers/issues/67010 for this
Hi @daroga0002,
Thank you for taking the time to create the PR but I'm pretty sure that the issue is not that one. I just tried to reproduce the issue again, but we couldn't do so.
We created a custom.yaml file with the values you just mentioned above
jobmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
jobmanager.memory.process.size: 1g
taskmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
taskmanager.memory.process.size: 1g
And installed the chart using that file
helm install flink -f values_custom.yaml bitnami/flink
Once pods were running, we confirmed that the configuration file looked correct
$ k exec flink-taskmanager-0 -- cat /opt/bitnami/flink/conf/config.yaml | tail -n 20
Warning: Use tokens from the TokenRequest API or manually created secret-based tokens instead of auto-generated secret-based tokens.
# fs:
# # Comma separated list of directories to monitor for completed jobs.
# dir: hdfs:///completed-jobs/
# # Interval in milliseconds for refreshing the monitored directories.
# fs.refresh-interval: 10000
blob.server.port: 6124
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.process.size: 1g
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
metrics.internal.query-service.port: 6126
rest.port: 8081
taskmanager.bind-host: 0.0.0.0
taskmanager.data.port: 6121
taskmanager.host: flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local
taskmanager.rpc.port: 6122
and deleted the pod
k delete pod flink-taskmanager-0
The pod was recreated and it worked as expected
$ k get pods
NAME READY STATUS RESTARTS AGE
flink-jobmanager-6c4fc5fd77-l7fpw 1/1 Running 0 3m2s
flink-taskmanager-0 1/1 Running 0 22s
$ k exec flink-taskmanager-0 -- cat /opt/bitnami/flink/conf/config.yaml | tail -n 20
# fs:
# # Comma separated list of directories to monitor for completed jobs.
# dir: hdfs:///completed-jobs/
# # Interval in milliseconds for refreshing the monitored directories.
# fs.refresh-interval: 10000
blob.server.port: 6124
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.process.size: 1g
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
metrics.internal.query-service.port: 6126
rest.port: 8081
taskmanager.bind-host: 0.0.0.0
taskmanager.data.port: 6121
taskmanager.host: flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local
taskmanager.rpc.port: 6122
Are these the steps you are following when reproducing the issue? What version of Kubernetes and the Bitnami Flink chart are you using?
$ k version
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1-gke.1589020
Regarding your PR, I can't accept it because it'll break the use-case when the user provides a custom configuration file. The action to copy will overwrite the entire customization.
It is not about deleting pod, but situation when pod restarts itself.
Deletion of pod (even in statefulset) removing pod ephemeral disk, restart of pod (becuase panic reason or crash) is preserving it.
To replicate it just try to run a pod with higher memory setup (taskmanager.memory.process.size: 1g) than your workers have (for example worker 8GB, setup memory value 9GB) then java will panic becuase missing memory and kubernetes will restart pod and you will get this issue replicated
Ok @daroga0002,
I just reproduced the issue when using "9g". Let us review the problem and take a look at your PR again
Hi @daroga0002,
We just found the issue and we are going to provide you with different solutions to solve it:
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Good day
The situation is repeating itself, although I took the last helm by flink
INFO [] - Using standard YAML parser to load flink configuration file from /opt/bitnami/flink/conf/config.yaml.
ERROR [] - Failed to parse YAML configuration
org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
in reader, line 21, column 1
found duplicate key security.kerberos.login.keytab
in reader, line 319, column 1
at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111) ~[flink-dist-1.19.1.jar:1.19.1]
at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123) ~[flink-dist-1.19.1.jar:1.19.1]
at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100) [flink-dist-1.19.1.jar:1.19.1]
at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347) [flink-dist-1.19.1.jar:1.19.1]
at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163) [flink-dist-1.19.1.jar:1.19.1]
at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154) [flink-dist-1.19.1.jar:1.19.1]
at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41) [flink-dist-1.19.1.jar:2.17.1]
at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66) [bash-java-utils.jar:2.17.1]
at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56) [bash-java-utils.jar:2.17.1]
Exception in thread "main" java.lang.RuntimeException: Error parsing YAML configuration.
at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:352)
at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163)
at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154)
at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41)
at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66)
at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56)
Caused by: org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
in reader, line 21, column 1
found duplicate key security.kerberos.login.keytab
in reader, line 319, column 1
at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90)
at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70)
at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278)
at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92)
at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79)
at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111)
at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123)
at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100)
at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347)
... 5 more
values.yaml
taskmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
security.kerberos.login.keytab: /vault/secrets/name.keytab
security.kerberos.login.principal: name@name.name.RU
security.kerberos.login.use-ticket-cache: false
javax.security.auth.useSubjectCredsOnly: false
sun.security.krb5.debug: true
java.security.krb5.conf: /opt/bitnami/flink/krb5krb5.conf
java.security.auth.login.config: /opt/bitnami/flink/krb5/jaas.conf
jobmanager:
extraEnvVars:
- name: FLINK_PROPERTIES
value: |
security.kerberos.login.contexts: Client,KafkaClient
security.kerberos.login.keytab: /vault/secrets/name.keytab
security.kerberos.login.principal: name@name.name.RU
security.kerberos.login.use-ticket-cache: false
javax.security.auth.useSubjectCredsOnly: false
sun.security.krb5.debug: true
java.security.krb5.conf: /opt/bitnami/flink/krb5krb5.conf
java.security.auth.login.config: /opt/bitnami/flink/krb5/jaas.conf
deployment.yaml
volumeMounts:
- name: empty-dir
mountPath: /tmp
subPath: tmp-dir
- name: empty-dir
mountPath: /opt/bitnami/flink/conf
subPath: app-conf-dir
- name: empty-dir
mountPath: /opt/bitnami/flink/log
subPath: app-logs-dir
# HACK: Workaround to bypass the libflink.sh persist_app logic
- name: empty-dir
mountPath: /bitnami/flink/conf
subPath: app-conf-dir
- name: flink-config
mountPath: /opt/bitnami/flink/krb5
statefulset.yaml
volumeMounts:
- name: empty-dir
mountPath: /tmp
subPath: tmp-dir
- name: empty-dir
mountPath: /opt/bitnami/flink/conf
subPath: app-conf-dir
- name: empty-dir
mountPath: /opt/bitnami/flink/log
subPath: app-logs-dir
# HACK: Workaround to bypass the libflink.sh persist_app logic
- name: empty-dir
mountPath: /bitnami/flink/conf
subPath: app-conf-dir
is it possible to fix it not fitting into the source? Or what am I doing wrong?
Caused by: java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is /tmp/jaas-11710686897277381671.conf
at org.apache.kafka.common.security.JaasContext.defaultContext(JaasContext.java:133) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
at org.apache.kafka.common.security.JaasContext.load(JaasContext.java:98) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
at org.apache.kafka.common.security.JaasContext.loadClientContext(JaasContext.java:84) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:124) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
at org.apache.kafka.clients.producer.KafkaProducer.newSender(KafkaProducer.java:450) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:421) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
... 22 more
2024-08-30 09:54:05,173 INFO org.apache.flink.runtime.taskmanager.Task [] - Freeing task resources for Sink: Writer -> Sink: Committer (1/1)#0 (f9161ca24c3aba2ce7ac4bb83c111093_20ba6b65f97481d5570070de90e4e791_0_0).
2024-08-30 09:54:05,264 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Un-registering task and sending final execution state FAILED to JobManager for task Sink: Writer -> Sink: Committer (1/1)#0 f9161ca24c3aba2ce7ac4bb83c111093_20ba6b65f97481d5570070de90e4e791_0_0.
2024-08-30 09:54:06,368 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:40, state:ACTIVE, resource profile: ResourceProfile{cpuCores=1, taskHeapMemory=9.600mb (10066329 bytes), taskOffHeapMemory=0 bytes, managedMemory=12.800mb (13421773 bytes), networkMemory=3.200mb (3355443 bytes)}, allocationId: 4c61ce397e4580ad4c4fdfac502b138a, jobId: f986637fb3f57f923e8e6027d40dd994).
2024-08-30 09:54:06,371 INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job f986637fb3f57f923e8e6027d40dd994 from job leader monitoring.
2024-08-30 09:54:06,371 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Close JobManager connection for job f986637fb3f57f923e8e6027d40dd994.
Name and Version
bitnami/flink 1.0.1
What architecture are you using?
amd64
What steps will reproduce the bug?
We used flink-1.0.1 helm chart to deploy flink in our eks cluster. In our helm values, we're overriding the below values for jobmanager and taskmanager.
When we deploy the helm chart, the pods are running without any issues but after few days all of sudden pods are going into crashloop due to container inside is failing with below errors.
Jobmanager:
Taskmanager:
Could someone check and help to resolve the issue at the earliest. This is blocking our activities.
Are you using any custom parameters or values?
What is the expected behavior?
Flink pods should up and running without any issues.
What do you see instead?
Pods are crashlooping after some days.