bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
9k stars 9.22k forks source link

Flink containers are crashlooping due to duplicate key error #25457

Closed Nikhil-Devisetti closed 4 months ago

Nikhil-Devisetti commented 6 months ago

Name and Version

bitnami/flink 1.0.1

What architecture are you using?

amd64

What steps will reproduce the bug?

We used flink-1.0.1 helm chart to deploy flink in our eks cluster. In our helm values, we're overriding the below values for jobmanager and taskmanager.

jobmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      jobmanager.memory.process.size: 12g
taskmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      taskmanager.memory.process.size: 12g`

When we deploy the helm chart, the pods are running without any issues but after few days all of sudden pods are going into crashloop due to container inside is failing with below errors.

Jobmanager:

flink 07:25:41.69 INFO  ==>
flink 07:25:41.69 INFO  ==> Welcome to the Bitnami flink container
flink 07:25:41.70 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
flink 07:25:41.70 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
flink 07:25:41.70 INFO  ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit https://bitnami.com/enterprise
flink 07:25:41.70 INFO  ==>
flink 07:25:41.70 INFO  ==> ** Starting Apache Flink jobmanager setup **
flink 07:25:41.84 INFO  ==> ** FLINK jobmanager setup finished! **

flink 07:25:41.86 INFO  ==> ** Starting Apache Flink Job Manager
[ERROR] The execution result is empty.
[ERROR] Could not get JVM parameters and dynamic configurations properly.
[ERROR] Raw output from BashJavaUtils:
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
INFO  [] - Using standard YAML parser to load flink configuration file from /opt/bitnami/flink/conf/config.yaml.
ERROR [] - Failed to parse YAML configuration
org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
 in reader, line 21, column 1
found duplicate key jobmanager.memory.process.size
 in reader, line 316, column 1

    at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41) [flink-dist-1.19.0.jar:2.17.1]
    at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:68) [bash-java-utils.jar:2.17.1]
    at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56) [bash-java-utils.jar:2.17.1]
Exception in thread "main" java.lang.RuntimeException: Error parsing YAML configuration.
    at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:352)
    at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163)
    at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154)
    at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41)
    at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:68)
    at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56)
Caused by: org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
 in reader, line 21, column 1
found duplicate key jobmanager.memory.process.size
 in reader, line 316, column 1

    at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90)
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70)
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278)
    at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79)
    at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111)
    at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123)
    at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100)
    at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347)
    ... 5 more

Taskmanager:

flink 07:31:51.69 INFO  ==>
flink 07:31:51.69 INFO  ==> Welcome to the Bitnami flink container
flink 07:31:51.69 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
flink 07:31:51.69 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
flink 07:31:51.69 INFO  ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit https://bitnami.com/enterprise
flink 07:31:51.69 INFO  ==>
flink 07:31:51.70 INFO  ==> ** Starting Apache Flink taskmanager setup **

flink 07:31:51.83 INFO  ==> ** FLINK taskmanager setup finished! **
flink 07:31:51.84 INFO  ==> ** Starting Apache Flink Task Manager
[ERROR] The execution result is empty.
[ERROR] Could not get JVM parameters and dynamic configurations properly.
[ERROR] Raw output from BashJavaUtils:
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
INFO  [] - Using standard YAML parser to load flink configuration file from /opt/bitnami/flink/conf/config.yaml.
ERROR [] - Failed to parse YAML configuration
org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
 in reader, line 21, column 1
found duplicate key taskmanager.memory.process.size
 in reader, line 313, column 1

    at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41) [flink-dist-1.19.0.jar:2.17.1]
    at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66) [bash-java-utils.jar:2.17.1]
    at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56) [bash-java-utils.jar:2.17.1]
Exception in thread "main" java.lang.RuntimeException: Error parsing YAML configuration.
    at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:352)
    at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163)
    at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154)
    at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41)
    at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66)
    at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56)
Caused by: org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
 in reader, line 21, column 1
found duplicate key taskmanager.memory.process.size
 in reader, line 313, column 1

    at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90)
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70)
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278)
    at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79)
    at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111)
    at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123)
    at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100)
    at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347)
    ... 5 more`

Could someone check and help to resolve the issue at the earliest. This is blocking our activities.

Are you using any custom parameters or values?

jobmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      jobmanager.memory.process.size: 12g
taskmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      taskmanager.memory.process.size: 12g`

What is the expected behavior?

Flink pods should up and running without any issues.

What do you see instead?

Pods are crashlooping after some days.

Nikhil-Devisetti commented 6 months ago

@dgomezleon Hi, Do you got a chance to check this out? It would be really helpful if you can provide any update on this since it is effecting our prod env.

Nikhil-Devisetti commented 6 months ago

@dgomezleon Can you help with an update on this.

jotamartos commented 6 months ago

Hi @Nikhil-Devisetti,

Sorry for the delay here. We are going to check this and update the ticket once we have more information.

jotamartos commented 6 months ago

Hi @Nikhil-Devisetti,

I can't reproduce the issue when using the latest version of the chart. I edited the values.yaml file

diff --git a/bitnami/flink/values.yaml b/bitnami/flink/values.yaml
index af5c9cc0bb..ae756275dc 100644
--- a/bitnami/flink/values.yaml
+++ b/bitnami/flink/values.yaml
@@ -124,7 +124,10 @@ jobmanager:
   ##  - name: FOO
   ##    value: BAR
   ##
-  extraEnvVars: []
+  extraEnvVars:
+  - name: FLINK_PROPERTIES
+    value: |
+      jobmanager.memory.process.size: 1g
   ## @param jobmanager.extraEnvVarsCM Name of existing ConfigMap containing extra env vars
   ##
   extraEnvVarsCM: ""

and installed the Bitnami Chart. Once the pods were ready

$ k get pods
NAME                                READY   STATUS    RESTARTS   AGE
flink-jobmanager-5bdfb7b457-xcxlw   1/1     Running   0          66s
flink-taskmanager-0                 1/1     Running   0          66s

I confirmed that there were no errors in the log

$ k logs flink-jobmanager-5bdfb7b457-xcxlw | tail -n 10
2024-05-06 14:08:24,749 INFO  org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Successfully recovered 0 persisted job graphs.
2024-05-06 14:08:24,880 INFO  org.apache.flink.runtime.rpc.pekko.PekkoRpcService           [] - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at pekko://flink/user/rpc/dispatcher_0 .
2024-05-06 14:08:24,979 INFO  org.apache.flink.runtime.rpc.pekko.PekkoRpcService           [] - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at pekko://flink/user/rpc/resourcemanager_1 .
2024-05-06 14:08:25,161 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Starting the resource manager.
2024-05-06 14:08:25,250 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] - Starting the slot manager.
2024-05-06 14:08:25,252 INFO  org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - Starting tokens update task
2024-05-06 14:08:25,253 WARN  org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - No tokens obtained so skipping notifications
2024-05-06 14:08:25,253 WARN  org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - Tokens update task not started because either no tokens obtained or none of the tokens specified its renewal date
2024-05-06 14:08:36,460 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local:6122-7d9c19 (pekko.tcp://flink@flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local:6122/user/rpc/taskmanager_0) at ResourceManager
2024-05-06 14:08:36,552 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] - Registering task executor flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local:6122-7d9c19 under 5fe9d0f3507d60d754ea49a8fdaf4764 at the slot manager.

and the conf file included the new parameter

$ k exec flink-jobmanager-5bdfb7b457-xcxlw -- cat /opt/bitnami/flink/conf/config.yaml | tail -n 10
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
jobmanager.memory.process.size: 1g
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.bind-port: 6123
jobmanager.rpc.port: 6123
rest.address: flink-jobmanager
rest.bind-address: 0.0.0.0
rest.port: 8081
Nikhil-Devisetti commented 6 months ago

@jotamartos Hi, Thanks for checking. Pods won't be going into crashloop not immediately after deploying. Even in our case they were running without any issues for few days and all of sudden they're crashing.

I'll try to redeploy with latest helm chart and will observe for few days and accordingly will update here.

jotamartos commented 6 months ago

Thanks! I tried to obtain more info but I couldn't reproduce the issue. I deleted the pod and waited for the deployment to recreate it but it worked as expected

$ k get pods
NAME                                READY   STATUS    RESTARTS   AGE
flink-jobmanager-5bdfb7b457-6npqf   1/1     Running   0          2m47s
flink-taskmanager-0                 1/1     Running   0          7m52s

$ k exec flink-jobmanager-5bdfb7b457-6npqf -- cat /opt/bitnami/flink/conf/config.yaml | tail -n 10
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
jobmanager.memory.process.size: 1g
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.bind-port: 6123
jobmanager.rpc.port: 6123
rest.address: flink-jobmanager
rest.bind-address: 0.0.0.0
rest.port: 8081

Please let us know if you find something relevant.

Nikhil-Devisetti commented 6 months ago

@jotamartos Hi, I have deployed latest chart (flink-1.1.1) in dev cluster with overrides values through helm and it got deployed hence i tried to replicate the same in qa cluster, jobmanager pod got deployed with override values but taskmanager is failing with duplicate key error.

If i remove the below override values for taskmanager then pods are getting deployed and running.

taskmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      taskmanager.memory.process.size: 12g
      taskmanager.memory.flink.size: 10g
      taskmanager.memory.jvm-metaspace.size: 3g

Below details are from qa cluster.

k get po -n flink
NAME                                      READY   STATUS             RESTARTS        AGE
flink-ushur-jobmanager-5ff8d57684-l8cjx   1/1     Running            0               21m
flink-ushur-taskmanager-0                 1/1     Running            0               16m
flink-ushur-taskmanager-1                 1/1     Running            0               16m
flink-ushur-taskmanager-2                 0/1     CrashLoopBackOff   5 (2m21s ago)   5m43s
helm ls -n flink
NAME        NAMESPACE   REVISION    UPDATED                                 STATUS      CHART       APP VERSION
flink-ushur flink       3           2024-05-08 14:14:34.245067644 +0000 UTC deployed    flink-1.1.1 1.19.0
k logs flink-ushur-taskmanager-2 -n flink
flink 14:15:31.67 INFO  ==>
flink 14:15:31.67 INFO  ==> Welcome to the Bitnami flink container
flink 14:15:31.67 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
flink 14:15:31.67 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
flink 14:15:31.67 INFO  ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit https://bitnami.com/enterprise
flink 14:15:31.67 INFO  ==>
flink 14:15:31.67 INFO  ==> ** Starting Apache Flink taskmanager setup **
flink 14:15:31.83 INFO  ==> ** FLINK taskmanager setup finished! **

flink 14:15:31.84 INFO  ==> ** Starting Apache Flink Task Manager
[ERROR] The execution result is empty.
[ERROR] Could not get JVM parameters and dynamic configurations properly.
[ERROR] Raw output from BashJavaUtils:
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
INFO  [] - Using standard YAML parser to load flink configuration file from /opt/bitnami/flink/conf/config.yaml.
ERROR [] - Failed to parse YAML configuration
org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
 in reader, line 21, column 1
found duplicate key taskmanager.memory.process.size
 in reader, line 315, column 1

    at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123) ~[flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154) [flink-dist-1.19.0.jar:1.19.0]
    at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41) [flink-dist-1.19.0.jar:2.17.1]
    at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66) [bash-java-utils.jar:2.17.1]
    at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56) [bash-java-utils.jar:2.17.1]
Exception in thread "main" java.lang.RuntimeException: Error parsing YAML configuration.
    at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:352)
    at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163)
    at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154)
    at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41)
    at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66)
    at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56)
Caused by: org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
 in reader, line 21, column 1
found duplicate key taskmanager.memory.process.size
 in reader, line 315, column 1

    at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90)
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70)
    at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278)
    at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92)
    at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79)
    at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111)
    at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123)
    at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100)
    at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347)
    ... 5 more

After removing taskmanager override values and pods are deployed.

taskmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      taskmanager.memory.process.size: 12g
      taskmanager.memory.flink.size: 10g
      taskmanager.memory.jvm-metaspace.size: 3g
 k get po -n flink
NAME                                      READY   STATUS      RESTARTS   AGE
flink-ushur-jobmanager-5ff8d57684-l8cjx   1/1     Running     0          28m
flink-ushur-taskmanager-0                 1/1     Running     0          81s
flink-ushur-taskmanager-1                 1/1     Running     0          81s
flink-ushur-taskmanager-2                 1/1     Running     0          3m56s
jotamartos commented 5 months ago

Hi @Nikhil-Devisetti,

Everything worked for me as expected. I installed version 1.1.1 of the chart in my cluster with the following changes

diff --git a/bitnami/flink/values.yaml b/bitnami/flink/values.yaml
index af5c9cc0bb..006f1aedae 100644
--- a/bitnami/flink/values.yaml
+++ b/bitnami/flink/values.yaml
@@ -124,7 +124,10 @@ jobmanager:
   ##  - name: FOO
   ##    value: BAR
   ##
-  extraEnvVars: []
+  extraEnvVars:
+  - name: FLINK_PROPERTIES
+    value: |
+      jobmanager.memory.process.size: 1g
   ## @param jobmanager.extraEnvVarsCM Name of existing ConfigMap containing extra env vars
   ##
   extraEnvVarsCM: ""
@@ -506,7 +509,12 @@ taskmanager:
   ##  - name: FOO
   ##    value: BAR
   ##
-  extraEnvVars: []
+  extraEnvVars:
+  - name: FLINK_PROPERTIES
+    value: |
+      taskmanager.memory.process.size: 3g
+      taskmanager.memory.flink.size: 1g
+      taskmanager.memory.jvm-metaspace.size: 1g
   ## @param taskmanager.extraEnvVarsCM Name of existing ConfigMap containing extra env vars
   ##
   extraEnvVarsCM: ""

The env vars were configured properly

PATH=/opt/bitnami/java/bin:/opt/bitnami/flink/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=flink-taskmanager-0
HOME=/
OS_ARCH=amd64
OS_FLAVOUR=debian-12
OS_NAME=linux
APP_VERSION=1.19.0
BITNAMI_APP_NAME=flink
FLINK_HOME=/opt/bitnami/flink
JAVA_HOME=/opt/bitnami/java
MY_POD_NAME=flink-taskmanager-0
FLINK_CFG_TASKMANAGER_DATA_PORT=6121
FLINK_CFG_METRICS_INTERNAL_QUERY__SERVICE_PORT=6126
BITNAMI_DEBUG=false
FLINK_CFG_TASKMANAGER_BIND__HOST=0.0.0.0
FLINK_PROPERTIES=taskmanager.memory.process.size: 3g
taskmanager.memory.flink.size: 1g
taskmanager.memory.jvm-metaspace.size: 1g

FLINK_MODE=taskmanager
FLINK_CFG_JOBMANAGER_RPC_ADDRESS=flink-jobmanager
FLINK_CFG_JOBMANAGER_RPC_PORT=6123
FLINK_CFG_JOBMANAGER_BIND__HOST=0.0.0.0
FLINK_CFG_TASKMANAGER_RPC_PORT=6122
FLINK_CFG_TASKMANAGER_HOST=flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local
FLINK_JOBMANAGER_PORT_8081_TCP_PROTO=tcp
FLINK_TASKMANAGER_PORT_6121_TCP=tcp://10.184.250.252:6121
FLINK_TASKMANAGER_PORT_6121_TCP_PROTO=tcp
KUBERNETES_SERVICE_PORT=443
FLINK_TASKMANAGER_PORT_6121_TCP_PORT=6121
FLINK_TASKMANAGER_PORT_6126_TCP=tcp://10.184.250.252:6126
KUBERNETES_PORT_443_TCP_PORT=443
FLINK_JOBMANAGER_SERVICE_PORT_HTTP=8081
FLINK_JOBMANAGER_PORT_6123_TCP_PROTO=tcp
FLINK_JOBMANAGER_PORT_6123_TCP_ADDR=10.184.247.241
FLINK_TASKMANAGER_SERVICE_PORT_TCP_DATA=6121
FLINK_TASKMANAGER_PORT_6122_TCP_PORT=6122
FLINK_TASKMANAGER_PORT_6122_TCP_ADDR=10.184.250.252
KUBERNETES_PORT=tcp://10.184.240.1:443
FLINK_JOBMANAGER_SERVICE_PORT=6123
FLINK_JOBMANAGER_PORT_6124_TCP=tcp://10.184.247.241:6124
FLINK_JOBMANAGER_PORT_6124_TCP_PORT=6124
FLINK_TASKMANAGER_SERVICE_PORT_TCP_RPC=6122
FLINK_JOBMANAGER_PORT_6124_TCP_PROTO=tcp
FLINK_JOBMANAGER_PORT=tcp://10.184.247.241:6123
FLINK_JOBMANAGER_PORT_6123_TCP=tcp://10.184.247.241:6123
FLINK_JOBMANAGER_PORT_8081_TCP=tcp://10.184.247.241:8081
FLINK_JOBMANAGER_PORT_8081_TCP_ADDR=10.184.247.241
FLINK_JOBMANAGER_SERVICE_HOST=10.184.247.241
FLINK_JOBMANAGER_SERVICE_PORT_TCP_RPC=6123
FLINK_TASKMANAGER_SERVICE_PORT=6121
FLINK_TASKMANAGER_PORT_6126_TCP_PROTO=tcp
KUBERNETES_SERVICE_HOST=10.184.240.1
KUBERNETES_PORT_443_TCP_PROTO=tcp
FLINK_JOBMANAGER_PORT_6123_TCP_PORT=6123
FLINK_JOBMANAGER_PORT_6124_TCP_ADDR=10.184.247.241
FLINK_TASKMANAGER_SERVICE_PORT_TCP_INTERNAL_METRICS=6126
FLINK_TASKMANAGER_PORT_6121_TCP_ADDR=10.184.250.252
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP_ADDR=10.184.240.1
FLINK_TASKMANAGER_SERVICE_HOST=10.184.250.252
FLINK_TASKMANAGER_PORT=tcp://10.184.250.252:6121
FLINK_TASKMANAGER_PORT_6122_TCP_PROTO=tcp
FLINK_TASKMANAGER_PORT_6126_TCP_PORT=6126
KUBERNETES_PORT_443_TCP=tcp://10.184.240.1:443
FLINK_JOBMANAGER_SERVICE_PORT_TCP_BLOB=6124
FLINK_JOBMANAGER_PORT_8081_TCP_PORT=8081
FLINK_TASKMANAGER_PORT_6122_TCP=tcp://10.184.250.252:6122
FLINK_TASKMANAGER_PORT_6126_TCP_ADDR=10.184.250.252
TERM=xterm

The configuration file as well

#       dir: hdfs:///completed-jobs/
#       # Interval in milliseconds for refreshing the monitored directories.
#       fs.refresh-interval: 10000

blob.server.port: 6124
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.process.size: 3g
taskmanager.memory.flink.size: 1g
taskmanager.memory.jvm-metaspace.size: 1g

jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
metrics.internal.query-service.port: 6126
rest.port: 8081
taskmanager.bind-host: 0.0.0.0
taskmanager.data.port: 6121
taskmanager.host: flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local
taskmanager.rpc.port: 6122

The pods were running

NAME                                READY   STATUS    RESTARTS   AGE
flink-jobmanager-5bdfb7b457-hzdxk   1/1     Running   0          46s
flink-taskmanager-0                 1/1     Running   0          45s

I'm using kubectl 1.29

Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1-gke.1589018
Nikhil-Devisetti commented 5 months ago

@jotamartos Thanks for checking from your end. But still i'm seeing same error for taskmanager pods in qa eks cluster even after removing container images from worker nodes and redeploying the flink helm chart.

By any chance, does bitnami team came across the mentioned duplicate error in any of your testing phase? If yes, can you help me to understand how it is resolved.

jotamartos commented 5 months ago

Hi @Nikhil-Devisetti,

I'm sorry but our tests worked properly and couldn't reproduce the issue. Please ensure you are using a supported version of Kubernetes and that there is not an old PV/PVC that is affecting the deployment.

If you continue running into the issue, I suggest you try a different cluster to check if the problem persists there.

Nikhil-Devisetti commented 5 months ago

hi @jotamartos ,

we're using k8s 1.28 and there is no PV/PVC mounted for config.yaml since it is coming from flink bitnami image.

Could you check and let me know is there any change in the sequence the flink container reads the configs? Does values from helm or values from image take precedence?

daroga0002 commented 5 months ago

I have debugged this issue and I am now understanding why it is happening.

Indirect root cause of issue came with flink 1.19 which has some major changes how configuration is handled. Under link we have major behaviour change:

Duplicated keys:

flink-conf.yaml: Allows duplicated keys and takes the last key-value pair for the corresponding key that appears in the file.
config.yaml: Does not allow duplicated keys, and an error will be reported when loading the configuration.

So error:

flink 07:20:57.96 INFO  ==> ** Starting Apache Flink Task Manager

[ERROR] The execution result is empty.
[ERROR] Could not get JVM parameters and dynamic configurations properly.
[ERROR] Raw output from BashJavaUtils:
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
INFO  [] - Using standard YAML parser to load flink configuration file from /opt/bitnami/flink/conf/config.yaml.
ERROR [] - Failed to parse YAML configuration
org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
 in reader, line 21, column 1
found duplicate key taskmanager.memory.process.size
 in reader, line 313, column 1

        at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123) ~[flink-dist-1.19.0.jar:1.19.0]
        at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100) [flink-dist-1.19.0.jar:1.19.0]
        at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347) [flink-dist-1.19.0.jar:1.19.0]
        at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163) [flink-dist-1.19.0.jar:1.19.0]
        at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154) [flink-dist-1.19.0.jar:1.19.0]

comes from change to yaml file config and duplicated configuration keys.

Using values:

$ cat values.yaml
jobmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      jobmanager.memory.process.size: 1g
taskmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      taskmanager.memory.process.size: 2g

we cannot duplicate this issue. This is becuase how Bitnami Docker image is handling configuration injection, for this it is used script https://github.com/bitnami/containers/blob/main/bitnami/flink/1/debian-12/rootfs/opt/bitnami/scripts/flink-env.sh which is taking ENV variables and default conf-default to merge all configurations (eliminate also overlaps if there are any) and inject into /opt/bitnami/flink/conf/config.yaml which will be used by Flink itself.

Process of configuration is bundled into container entrypoint: https://github.com/bitnami/containers/blob/fe7ac75eb9e200ed06efeeb6a11bdef77f922cd7/bitnami/flink/1/debian-12/Dockerfile#L60

At this moment issue doesnt appear and looking into config file everything is great and pods are starting and running as expected, but ...

Containers can crash or go out of memory, in such cases Kubernetes scheduler (by implemented healthcheck will discover such bad situation) will perform restart and this is whole problem in this process, becuase during restart Kubernetes is not recreating pod but restarting entrypoint. After second run of entrypoint we again rerun configuration builder and injecting incremental changes into confiug.yaml in result configuration is changing from correct one into corrupted (as related to changes in flink 1.19 and movement to yaml syntax).

.............
blob.server.port: 6124
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.process.size: 6g

jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
metrics.internal.query-service.port: 6126
rest.port: 8081
taskmanager.bind-host: 0.0.0.0
taskmanager.data.port: 6121
taskmanager.host: flink-taskmanager-0.flink-taskmanager-headless.flink.svc.cluster.local
taskmanager.rpc.port: 6122
taskmanager.memory.process.size: 6g

so in result taskmanager.memory.process.size: 6g is appearing two times and this is corruption of YAML syntax.

Reason of such operation is flag -n in https://github.com/bitnami/containers/blob/fe7ac75eb9e200ed06efeeb6a11bdef77f922cd7/bitnami/flink/1/debian-12/rootfs/opt/bitnami/scripts/flink/entrypoint.sh#L25

       -n, --no-clobber
              do not overwrite an existing file (overrides a -u or
              previous -i option). See also --update

So entrypoint is rereading again ENV variables and injecting them second time into config.

So this is explaining why issue arriving from time to time. It require restarting a pod (restart doesnt cleanup ephemeral container data, where pod deletion will clean it and solve issue).

Simplest solution will be probably remove -n from https://github.com/bitnami/containers/blob/fe7ac75eb9e200ed06efeeb6a11bdef77f922cd7/bitnami/flink/1/debian-12/rootfs/opt/bitnami/scripts/flink/entrypoint.sh#L25.

Harder solution which will be also required in future is to rewrite a entrypoint configuration scripts into pure yaml and make some duplicate merging mechanism there.

To repelicate issue just try to run a container with some huge taskmanager.memory.process.size to crash java process during start and then it will trigger mentioned problem.

In my case I replicated by running:

jobmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      jobmanager.memory.process.size: 1g
taskmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      taskmanager.memory.process.size: 9g

on worker nodes which have 8GB physical memory so Java was not able to reserve memory, crashed and then issue was triggered.

I have also opened https://github.com/bitnami/containers/issues/67010 for this

jotamartos commented 5 months ago

Hi @daroga0002,

Thank you for taking the time to create the PR but I'm pretty sure that the issue is not that one. I just tried to reproduce the issue again, but we couldn't do so.

We created a custom.yaml file with the values you just mentioned above

jobmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      jobmanager.memory.process.size: 1g
taskmanager:
  extraEnvVars:
  - name: FLINK_PROPERTIES
    value: |
      taskmanager.memory.process.size: 1g

And installed the chart using that file

helm install flink -f values_custom.yaml bitnami/flink

Once pods were running, we confirmed that the configuration file looked correct

$ k exec flink-taskmanager-0 -- cat /opt/bitnami/flink/conf/config.yaml | tail -n 20
Warning: Use tokens from the TokenRequest API or manually created secret-based tokens instead of auto-generated secret-based tokens.
#     fs:
#       # Comma separated list of directories to monitor for completed jobs.
#       dir: hdfs:///completed-jobs/
#       # Interval in milliseconds for refreshing the monitored directories.
#       fs.refresh-interval: 10000

blob.server.port: 6124
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.process.size: 1g

jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
metrics.internal.query-service.port: 6126
rest.port: 8081
taskmanager.bind-host: 0.0.0.0
taskmanager.data.port: 6121
taskmanager.host: flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local
taskmanager.rpc.port: 6122

and deleted the pod

k delete pod flink-taskmanager-0

The pod was recreated and it worked as expected

$ k get pods
NAME                                READY   STATUS    RESTARTS   AGE
flink-jobmanager-6c4fc5fd77-l7fpw   1/1     Running   0          3m2s
flink-taskmanager-0                 1/1     Running   0          22s

$ k exec flink-taskmanager-0 -- cat /opt/bitnami/flink/conf/config.yaml | tail -n 20
#     fs:
#       # Comma separated list of directories to monitor for completed jobs.
#       dir: hdfs:///completed-jobs/
#       # Interval in milliseconds for refreshing the monitored directories.
#       fs.refresh-interval: 10000

blob.server.port: 6124
query.server.port: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.process.size: 1g

jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
metrics.internal.query-service.port: 6126
rest.port: 8081
taskmanager.bind-host: 0.0.0.0
taskmanager.data.port: 6121
taskmanager.host: flink-taskmanager-0.flink-taskmanager-headless.test.svc.cluster.local
taskmanager.rpc.port: 6122

Are these the steps you are following when reproducing the issue? What version of Kubernetes and the Bitnami Flink chart are you using?

$ k version

Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1-gke.1589020

Regarding your PR, I can't accept it because it'll break the use-case when the user provides a custom configuration file. The action to copy will overwrite the entire customization.

daroga0002 commented 5 months ago

It is not about deleting pod, but situation when pod restarts itself.

Deletion of pod (even in statefulset) removing pod ephemeral disk, restart of pod (becuase panic reason or crash) is preserving it.

To replicate it just try to run a pod with higher memory setup (taskmanager.memory.process.size: 1g) than your workers have (for example worker 8GB, setup memory value 9GB) then java will panic becuase missing memory and kubernetes will restart pod and you will get this issue replicated

jotamartos commented 5 months ago

Ok @daroga0002,

I just reproduced the issue when using "9g". Let us review the problem and take a look at your PR again

jotamartos commented 5 months ago

Hi @daroga0002,

We just found the issue and we are going to provide you with different solutions to solve it:

https://github.com/bitnami/containers/blob/main/bitnami/sonarqube/10/debian-12/rootfs/opt/bitnami/scripts/libsonarqube.sh#L134

https://github.com/bitnami/containers/blob/main/bitnami/flink/1/debian-12/rootfs/opt/bitnami/scripts/libflink.sh#L82

github-actions[bot] commented 5 months ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 4 months ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

Ilya201091 commented 2 months ago

Good day

The situation is repeating itself, although I took the last helm by flink

INFO  [] - Using standard YAML parser to load flink configuration file from /opt/bitnami/flink/conf/config.yaml.
ERROR [] - Failed to parse YAML configuration
org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
 in reader, line 21, column 1
found duplicate key security.kerberos.login.keytab
 in reader, line 319, column 1

        at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123) ~[flink-dist-1.19.1.jar:1.19.1]
        at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100) [flink-dist-1.19.1.jar:1.19.1]
        at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347) [flink-dist-1.19.1.jar:1.19.1]
        at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163) [flink-dist-1.19.1.jar:1.19.1]
        at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154) [flink-dist-1.19.1.jar:1.19.1]
        at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41) [flink-dist-1.19.1.jar:2.17.1]
        at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66) [bash-java-utils.jar:2.17.1]
        at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56) [bash-java-utils.jar:2.17.1]
Exception in thread "main" java.lang.RuntimeException: Error parsing YAML configuration.
        at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:352)
        at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:163)
        at org.apache.flink.runtime.util.ConfigurationParserUtils.loadCommonConfiguration(ConfigurationParserUtils.java:154)
        at org.apache.flink.runtime.util.bash.FlinkConfigLoader.loadConfiguration(FlinkConfigLoader.java:41)
        at org.apache.flink.runtime.util.bash.BashJavaUtils.runCommand(BashJavaUtils.java:66)
        at org.apache.flink.runtime.util.bash.BashJavaUtils.main(BashJavaUtils.java:56)
Caused by: org.snakeyaml.engine.v2.exceptions.YamlEngineException: while constructing a mapping
 in reader, line 21, column 1
found duplicate key security.kerberos.login.keytab
 in reader, line 319, column 1

        at org.snakeyaml.engine.v2.constructor.StandardConstructor.processDuplicateKeys(StandardConstructor.java:90)
        at org.snakeyaml.engine.v2.constructor.StandardConstructor.flattenMapping(StandardConstructor.java:70)
        at org.snakeyaml.engine.v2.constructor.StandardConstructor.constructMapping2ndStep(StandardConstructor.java:119)
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructMapping(BaseConstructor.java:278)
        at org.snakeyaml.engine.v2.constructor.StandardConstructor$ConstructYamlMap.construct(StandardConstructor.java:203)
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObjectNoCheck(BaseConstructor.java:153)
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructObject(BaseConstructor.java:133)
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.construct(BaseConstructor.java:92)
        at org.snakeyaml.engine.v2.constructor.BaseConstructor.constructSingleDocument(BaseConstructor.java:79)
        at org.snakeyaml.engine.v2.api.Load.loadOne(Load.java:111)
        at org.snakeyaml.engine.v2.api.Load.loadFromInputStream(Load.java:123)
        at org.apache.flink.configuration.YamlParserUtils.loadYamlFile(YamlParserUtils.java:100)
        at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:347)
        ... 5 more

values.yaml

taskmanager:
 extraEnvVars:
    - name: FLINK_PROPERTIES
      value: |
          security.kerberos.login.keytab: /vault/secrets/name.keytab
          security.kerberos.login.principal: name@name.name.RU
          security.kerberos.login.use-ticket-cache: false 
          javax.security.auth.useSubjectCredsOnly: false
          sun.security.krb5.debug: true
          java.security.krb5.conf: /opt/bitnami/flink/krb5krb5.conf
          java.security.auth.login.config: /opt/bitnami/flink/krb5/jaas.conf

jobmanager:
extraEnvVars:
    - name: FLINK_PROPERTIES
      value: |
          security.kerberos.login.contexts: Client,KafkaClient
          security.kerberos.login.keytab: /vault/secrets/name.keytab
          security.kerberos.login.principal: name@name.name.RU
          security.kerberos.login.use-ticket-cache: false 
          javax.security.auth.useSubjectCredsOnly: false
          sun.security.krb5.debug: true
          java.security.krb5.conf: /opt/bitnami/flink/krb5krb5.conf
          java.security.auth.login.config: /opt/bitnami/flink/krb5/jaas.conf

deployment.yaml

 volumeMounts:
            - name: empty-dir
              mountPath: /tmp
              subPath: tmp-dir
            - name: empty-dir
              mountPath: /opt/bitnami/flink/conf
              subPath: app-conf-dir
            - name: empty-dir
              mountPath: /opt/bitnami/flink/log
              subPath: app-logs-dir
            # HACK: Workaround to bypass the libflink.sh persist_app logic
            - name: empty-dir
              mountPath: /bitnami/flink/conf
              subPath: app-conf-dir
            - name: flink-config
              mountPath: /opt/bitnami/flink/krb5

statefulset.yaml

 volumeMounts:
            - name: empty-dir
              mountPath: /tmp
              subPath: tmp-dir
            - name: empty-dir
              mountPath: /opt/bitnami/flink/conf
              subPath: app-conf-dir
            - name: empty-dir
              mountPath: /opt/bitnami/flink/log
              subPath: app-logs-dir
            # HACK: Workaround to bypass the libflink.sh persist_app logic
            - name: empty-dir
              mountPath: /bitnami/flink/conf
              subPath: app-conf-dir

is it possible to fix it not fitting into the source? Or what am I doing wrong?

Ilya201091 commented 2 months ago
Caused by: java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is /tmp/jaas-11710686897277381671.conf
        at org.apache.kafka.common.security.JaasContext.defaultContext(JaasContext.java:133) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
        at org.apache.kafka.common.security.JaasContext.load(JaasContext.java:98) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
        at org.apache.kafka.common.security.JaasContext.loadClientContext(JaasContext.java:84) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
        at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:124) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
        at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
        at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
        at org.apache.kafka.clients.producer.KafkaProducer.newSender(KafkaProducer.java:450) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
        at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:421) ~[blob_p-258eac3f3889e5e999b10b7e084a31232ef0747a-1396b03ecf51fac0ee16593172a4a69b:?]
        ... 22 more
2024-08-30 09:54:05,173 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Freeing task resources for Sink: Writer -> Sink: Committer (1/1)#0 (f9161ca24c3aba2ce7ac4bb83c111093_20ba6b65f97481d5570070de90e4e791_0_0).
2024-08-30 09:54:05,264 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Un-registering task and sending final execution state FAILED to JobManager for task Sink: Writer -> Sink: Committer (1/1)#0 f9161ca24c3aba2ce7ac4bb83c111093_20ba6b65f97481d5570070de90e4e791_0_0.
2024-08-30 09:54:06,368 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:40, state:ACTIVE, resource profile: ResourceProfile{cpuCores=1, taskHeapMemory=9.600mb (10066329 bytes), taskOffHeapMemory=0 bytes, managedMemory=12.800mb (13421773 bytes), networkMemory=3.200mb (3355443 bytes)}, allocationId: 4c61ce397e4580ad4c4fdfac502b138a, jobId: f986637fb3f57f923e8e6027d40dd994).
2024-08-30 09:54:06,371 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job f986637fb3f57f923e8e6027d40dd994 from job leader monitoring.
2024-08-30 09:54:06,371 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close JobManager connection for job f986637fb3f57f923e8e6027d40dd994.