Open kinderyj opened 4 years ago
The Flink docker-entrypoint.sh might try to edit flink-config.yaml with runtime values in some cases, when using the operator, we should avoid relying on these runtime values, instead declare them in the Flink properties of the FlinkCluster CR.
I'm thinking about making /opt/flink/conf/flink-conf.yaml editable, one solution could be: mount the configmap to another location (say /tmp/flink-conf) first, then use an init container to copy it over to the /opt/flink/conf, then it becomes writable. But I'm not sure about the value of this change, is it necessary?
@elanv what do you think?
docker-entrypoint.sh in official Flink Docker image tries to update configurations related to connections between job managers and task managers. Basically all sed error is related to connection config updates and operator does it also. Therefore those errors don't affect actually.
But official image allows additional changes to flink-conf.yaml with environment variable FLINK_PROPERTIES. The entrypoint script doesn't seem to be good for k8s, but if users want to use it, we can provide mount option for config file as you said. @functicons
Alternatively, we could make flink-docker use FLINK_CONF_DIR.
Flink upstream init script uses FLINK_CONF_DIR. https://github.com/apache/flink/blob/ab37867c474a9e5754e79045fa06c9ac145a3787/flink-dist/src/main/flink-bin/bin/config.sh#L315
But flink-docker doesn't use it. https://github.com/apache/flink-docker/blob/master/1.10/scala_2.12-debian/docker-entrypoint.sh#L23
I think flink-docker should be fixed, otherwise use provided FLINK_CONF_DIR does not work. I'll try to make change it to flink-docker first.
Great, thanks!
Even if FLINK_CONF_DIR is enabled, JM/TM should have initcontainers to support this. But CRD doesn't have it now. @functicons What do you think about adding initcontainers to JM/TM?
I think it is okay to add a default init container to JM/TM.
The flink community seems to be discussing to improve the docker image overall. We need to watch the progress first.
@elanv do you have a suggested workaround for now? This issue seems to break starting a job session cluster with images later than 1.9
@abuckenheimer This issue does not seem to break Flink start because Flink configuration is provided by Kubernetes configMap. For Flink version 1.10 or above, I think the failure may be caused by new memory configuration constraint. If so, you can try a workaround described in https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/issues/288#issuecomment-673810397
@elanv sorry I'm a flink newbie so I'm not entirely sure where this is breaking but my symptoms seem to match this issue more than the one your pointing to (not trying to set any flinkProperties
at all). Following the instructions here I set up the operator and then when I try to deploy a slightly modified(attached below) from sample FlinkCluster
the task manager fails on startup. With the logs:
# kubectl logs flinksessioncluster-taskmanager-59fc4b4554-xssvr
Starting Task Manager
sed: couldn't open temporary file /opt/flink/conf/sed9XFLof: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedEJRCaf: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedVrYL7f: Read-only file system
/docker-entrypoint.sh: 72: /docker-entrypoint.sh: cannot create /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: 120: /docker-entrypoint.sh: cannot create /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
[ERROR] The execution result is empty.
[ERROR] Could not get JVM parameters and dynamic configurations properly.
full FlinkCluster
definition:
apiVersion: flinkoperator.k8s.io/v1beta1
kind: FlinkCluster
metadata:
name: flinksessioncluster
spec:
image:
# name: flink:1.8 <- works
# name: flink:1.9 <- works
# name: flink:1.10 <- fails
name: flink:1.11 # <- fails
pullPolicy: Always
jobManager:
accessScope: Cluster
ports:
ui: 8081
resources:
limits:
memory: "1024Mi"
cpu: "200m"
taskManager:
replicas: 1
resources:
limits:
memory: "1024Mi"
cpu: "200m"
volumes:
- name: cache-volume
emptyDir: {}
volumeMounts:
- mountPath: /cache
name: cache-volume
@abuckenheimer The error producing sed logs does not really affect Flink run failure. It seems that my explanation about this problem was insufficient. Flink operator automatically configures {jobmanager,taskmanager}.heap.size
to the container's memory limit minus a certain margin so that the Flink process runs stably. This feature of the flink operator is causing problems as it interferes with Flink's new memory management feature. The message Could not get JVM parameters and dynamic configurations properly.
indicates the problem is related to memory misconfiguration.
So if you follow the workaround in comment https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/issues/288#issuecomment-673810397 I think your issue will be resolved.
I'm thinking about making /opt/flink/conf/flink-conf.yaml editable, one solution could be: mount the configmap to another location (say /tmp/flink-conf) first, then use an init container to copy it over to the /opt/flink/conf, then it becomes writable. But I'm not sure about the value of this change, is it necessary?
@elanv what do you think?
Hi @functicons I have a use case where I think making conf/flink-conf.yaml
editable is necessary:
We are using a S3 compatible storage for Save/CheckPointStorage, so-called WOS (Whatever Object Storage). WOS doesn't support IAM but AccessKey and SecretKey only. If we put AK/SK in Spec.FlinkProperties
like below, whoever can kubectl describe flinkcluster
can steal AKSK very easily.
spec:
[...]
flinkProperties:
s3.access-key: "my-access-key"
s3.secret-key: "my-secret-key"
Though, if we make conf/flink-conf.yaml
editable, we can make use of environment variables and Secret to secure our AKSK, the official flink image already supports envsubst https://github.com/apache/flink-docker/blob/master/1.13/scala_2.12-java8-debian/docker-entrypoint.sh#L88
spec:
envFrom:
- secretRef:
name: flink-conf-secret
flinkProperties:
s3.access-key: ${S3_ACCESS_KEY}
s3.secret-key: ${S3_SECRET_KEY}
Please kindly share with me your thoughts. Thanks
The script docker-entrypoint.sh
tries to set a deprecated setting query.server.port
:
set_config_option query.server.port 6125
The setting query.server.port
is deprecated in favor of queryable-state.server.ports
:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/queryable_state/#state-server
When updating docker-entrypoint.sh
, we have to update the script to the up-to-date settings.
I deployed the flink-on-k8s-operator in my k8s cluster successfully, but when I created a cluster by kubectl apply -f flinkoperator_v1beta1_flinksessioncluster.yaml, there's some error logs in the docker-entrypoint.sh of taskmanager\jobmanager, see as below,
As we know, the /opt/flink/conf/flink-conf.yaml is a configmap file, it can't be edited by the users, so it seems that the docker-entrypoint.sh is not compatible with the k8s?