GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

docker-entrypoint.sh: cannot create /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system #213

Open kinderyj opened 4 years ago

kinderyj commented 4 years ago

I deployed the flink-on-k8s-operator in my k8s cluster successfully, but when I created a cluster by kubectl apply -f flinkoperator_v1beta1_flinksessioncluster.yaml, there's some error logs in the docker-entrypoint.sh of taskmanager\jobmanager, see as below,

kubectl logs flinksessioncluster-sample-taskmanager-b7d7849f8-6z7zr --all-containers
Starting Task Manager
sed: couldn't open temporary file /opt/flink/conf/sedmzkeY2: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedW8t3K3: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sededxmy3: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedh7Y2L6: Read-only file system
/docker-entrypoint.sh: 126: /docker-entrypoint.sh: cannot create /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
kubectl logs flinksessioncluster-sample-jobmanager-64656cbb89-bnfhw
Starting Job Manager
sed: couldn't open temporary file /opt/flink/conf/sedOd2K7J: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedCKKXFI: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedi1j8NL: Read-only file system
/docker-entrypoint.sh: 88: /docker-entrypoint.sh: cannot create /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system

As we know, the /opt/flink/conf/flink-conf.yaml is a configmap file, it can't be edited by the users, so it seems that the docker-entrypoint.sh is not compatible with the k8s?

functicons commented 4 years ago

The Flink docker-entrypoint.sh might try to edit flink-config.yaml with runtime values in some cases, when using the operator, we should avoid relying on these runtime values, instead declare them in the Flink properties of the FlinkCluster CR.

functicons commented 4 years ago

I'm thinking about making /opt/flink/conf/flink-conf.yaml editable, one solution could be: mount the configmap to another location (say /tmp/flink-conf) first, then use an init container to copy it over to the /opt/flink/conf, then it becomes writable. But I'm not sure about the value of this change, is it necessary?

@elanv what do you think?

elanv commented 4 years ago

docker-entrypoint.sh in official Flink Docker image tries to update configurations related to connections between job managers and task managers. Basically all sed error is related to connection config updates and operator does it also. Therefore those errors don't affect actually.

note: https://github.com/apache/flink-docker/blob/master/1.10/scala_2.12-debian/docker-entrypoint.sh#L62-L130

But official image allows additional changes to flink-conf.yaml with environment variable FLINK_PROPERTIES. The entrypoint script doesn't seem to be good for k8s, but if users want to use it, we can provide mount option for config file as you said. @functicons

Alternatively, we could make flink-docker use FLINK_CONF_DIR.

Flink upstream init script uses FLINK_CONF_DIR. https://github.com/apache/flink/blob/ab37867c474a9e5754e79045fa06c9ac145a3787/flink-dist/src/main/flink-bin/bin/config.sh#L315

But flink-docker doesn't use it. https://github.com/apache/flink-docker/blob/master/1.10/scala_2.12-debian/docker-entrypoint.sh#L23

I think flink-docker should be fixed, otherwise use provided FLINK_CONF_DIR does not work. I'll try to make change it to flink-docker first.

functicons commented 4 years ago

Great, thanks!

elanv commented 4 years ago

Even if FLINK_CONF_DIR is enabled, JM/TM should have initcontainers to support this. But CRD doesn't have it now. @functicons What do you think about adding initcontainers to JM/TM?

functicons commented 4 years ago

I think it is okay to add a default init container to JM/TM.

elanv commented 4 years ago

The flink community seems to be discussing to improve the docker image overall. We need to watch the progress first.

https://cwiki.apache.org/confluence/display/FLINK/FLIP-111%3A+Docker+image+unification?src=contextnavpagetreemode

abuckenheimer commented 4 years ago

@elanv do you have a suggested workaround for now? This issue seems to break starting a job session cluster with images later than 1.9

elanv commented 4 years ago

@abuckenheimer This issue does not seem to break Flink start because Flink configuration is provided by Kubernetes configMap. For Flink version 1.10 or above, I think the failure may be caused by new memory configuration constraint. If so, you can try a workaround described in https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/issues/288#issuecomment-673810397

abuckenheimer commented 4 years ago

@elanv sorry I'm a flink newbie so I'm not entirely sure where this is breaking but my symptoms seem to match this issue more than the one your pointing to (not trying to set any flinkProperties at all). Following the instructions here I set up the operator and then when I try to deploy a slightly modified(attached below) from sample FlinkCluster the task manager fails on startup. With the logs:

# kubectl logs flinksessioncluster-taskmanager-59fc4b4554-xssvr
Starting Task Manager
sed: couldn't open temporary file /opt/flink/conf/sed9XFLof: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedEJRCaf: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedVrYL7f: Read-only file system
/docker-entrypoint.sh: 72: /docker-entrypoint.sh: cannot create /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: 120: /docker-entrypoint.sh: cannot create /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
[ERROR] The execution result is empty.
[ERROR] Could not get JVM parameters and dynamic configurations properly.

full FlinkCluster definition:

apiVersion: flinkoperator.k8s.io/v1beta1
kind: FlinkCluster
metadata:
  name: flinksessioncluster
spec:
  image:
    #  name: flink:1.8  <- works
    #  name: flink:1.9  <- works
    #  name: flink:1.10  <- fails
    name: flink:1.11  # <- fails
    pullPolicy: Always
  jobManager:
    accessScope: Cluster
    ports:
      ui: 8081
    resources:
      limits:
        memory: "1024Mi"
        cpu: "200m"
  taskManager:
    replicas: 1
    resources:
      limits:
        memory: "1024Mi"
        cpu: "200m"
    volumes:
      - name: cache-volume
        emptyDir: {}
    volumeMounts:
      - mountPath: /cache
        name: cache-volume
elanv commented 4 years ago

@abuckenheimer The error producing sed logs does not really affect Flink run failure. It seems that my explanation about this problem was insufficient. Flink operator automatically configures {jobmanager,taskmanager}.heap.size to the container's memory limit minus a certain margin so that the Flink process runs stably. This feature of the flink operator is causing problems as it interferes with Flink's new memory management feature. The message Could not get JVM parameters and dynamic configurations properly. indicates the problem is related to memory misconfiguration.

So if you follow the workaround in comment https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/issues/288#issuecomment-673810397 I think your issue will be resolved.

briannqc commented 2 years ago

I'm thinking about making /opt/flink/conf/flink-conf.yaml editable, one solution could be: mount the configmap to another location (say /tmp/flink-conf) first, then use an init container to copy it over to the /opt/flink/conf, then it becomes writable. But I'm not sure about the value of this change, is it necessary?

@elanv what do you think?

Hi @functicons I have a use case where I think making conf/flink-conf.yaml editable is necessary:

We are using a S3 compatible storage for Save/CheckPointStorage, so-called WOS (Whatever Object Storage). WOS doesn't support IAM but AccessKey and SecretKey only. If we put AK/SK in Spec.FlinkProperties like below, whoever can kubectl describe flinkcluster can steal AKSK very easily.

spec:
    [...]
    flinkProperties:
        s3.access-key: "my-access-key"
        s3.secret-key: "my-secret-key"

Though, if we make conf/flink-conf.yaml editable, we can make use of environment variables and Secret to secure our AKSK, the official flink image already supports envsubst https://github.com/apache/flink-docker/blob/master/1.13/scala_2.12-java8-debian/docker-entrypoint.sh#L88

spec:
    envFrom:
        - secretRef:
                name: flink-conf-secret
    flinkProperties:
        s3.access-key: ${S3_ACCESS_KEY}
        s3.secret-key: ${S3_SECRET_KEY}

Please kindly share with me your thoughts. Thanks

tashoyan commented 2 years ago

The script docker-entrypoint.sh tries to set a deprecated setting query.server.port:

set_config_option query.server.port 6125

https://github.com/apache/flink-docker/blob/master/1.15/scala_2.12-java8-debian/docker-entrypoint.sh#L80

The setting query.server.port is deprecated in favor of queryable-state.server.ports: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/queryable_state/#state-server

When updating docker-entrypoint.sh, we have to update the script to the up-to-date settings.