Open fil1o opened 7 years ago
Thanks, will take a look!
On Tue, Oct 10, 2017 at 4:34 AM, fil1o notifications@github.com wrote:
I'm using DC/OS Version 1.10.0 with dcos-flink-2-11:1.3.1-1.1 Flink is configured to run in HA mode.
high-availability=zookeeper high-availability.zookeeper.quorum=ctrl1.filio.bg:2181,ctrl2.filio.bg:2181 ,ctrl3.filio:2181 high-availability.zookeeper.storageDir=/mnt/glusterfs/ flink zookeeper.sasl.disable=true state.checkpoints.dir=file:///mnt/glusterfs/flink/checkpoints state.savepoints.dir=file:///mnt/glusterfs/flink/savepoints
Flink runs fine until Marathon tries to restart it (after a crash or a manual restart). This is the error. I1010 10:37:57.033221 199 sched.cpp:1187] Got error 'Framework has been removed' I1010 10:37:57.033287 199 sched.cpp:2055] Asked to abort the driver I1010 10:37:57.033898 199 sched.cpp:1233] Aborting framework 33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002
33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002 is the ID of the previous instance of Flink found in Zookeeper/flink/default/mesos- workers/frameworkId If I delete the key from Zookeeper Flink starts normally and all previously running tasks are restored.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mesosphere/dcos-flink-service/issues/44, or mute the thread https://github.com/notifications/unsubscribe-auth/AKTwZmp7_icFplgtSj2FyeEAXZv225X5ks5sq1ZIgaJpZM4PzyWH .
Yeah, I've been running into the same problem recently.
We got the same exception in our setup. :(
We still get this error after upgrading to version 1.4.0. We cannot go into production until this issue has been resolved. There is no high availability this way. We have to manually delete entries from Zookeeper to get Flink running after a crash.
The error message means that failover timeout has been exceeded. When a framework fails, Mesos keeps that framework's tasks running to allow it time to recover. Eventually a timeout is reached and Mesos kills the tasks. If the framework later reconnects, Mesos responds with "Framework has been removed" - a problem that can be solved by cleaning out the ZK state.
The framework timeout value is configured with the Flink config option: mesos.failover-timeout
(default: 10 minutes).
A possible improvement would be to have Flink automatically clean the ZK state and retry in this situation.
@fil1o, @todormazgalov, and @rradnev - does this characterization agree with your observations? Thanks.
It seem about right. I will make one clarification. Even if I suspend Flink manually -Suspend (set instances to 0) -Wait for wait for all task managers to shut down -Resume Flink I still get "Framework has been removed" unless I clean up ZK .
@fil1o what you describe is consistent with my explanation. Mesos shuts down the task managers once the failover timeout is reached. When the job manager is later resumed, it receives the 'removed' (due to timeout) message.
The remaining question is whether the failover timeout value is being configured properly.
Yep we're hitting this as well with DCOS 1.10 and dcos-flink-2-11:1.3.1-1.1 and Flink is configured to run in HA mode. So, to understand things:
We implemented our own workaround. We use zkCli.sh to clean up frameworkId from zookeeper on every start. You will lose all task managers on job restart or job manager failure, but at least Flink can start up and resume the job.
#! /bin/bash
[[ ":$PATH:" != *":/opt/mesosphere/bin:"* ]] && PATH="${PATH}:/opt/mesosphere/bin"
/usr/share/zookeeper/bin/zkCli.sh -server zookeeper.address.here:2181 <<EOF
delete /flink/default/mesos-workers/frameworkId
quit
EOF
Hint. zkCli.sh can be found inside Flink container.
Thanks @fil1o. I have a few questions:
It is obviously quite unfortunate to lose the TMs in the case of JM failover, but that workaround does make sense. The ultimate fix is to have Flink respond to this error by clearing out its own state such that it registers as a new framework. Mesos isn't prescriptive about how the framework should react to this situation. I will open a ticket and contribute a fix for 1.5.0 (will update this comment with a bug #).
Opened FLINK-8541.
Just following up after we've done more digging around.
It seems we were getting this error when a task manager died, as opposed to what I have written in my previous message. What happened then was that the job manager was also killed (Flink Mesos framework was removed) because the minimum number of task managers available was no longer met. When the job manager is restarted it cannot re-submit the framework under the same id (which it recovered from zookeeper) since it was removed previously.
Setting mesos.maximum-failed-tasks to -1 and increasing mesos.failover-timeout to 1hr seems to have fixed things for us in case of both job manager container loss or task manager container loss or service restart, without having to add the above delete script.
By the way, the above delete script is leaking task managers (with mesos.failover-timeout set to 1hr and mesos.maximum-failed-tasks left as the default). In case the job manager container gets killed when it comes back up it will create a new framework with a new id and thus a new task managers but the old ones are still around and reconnect to the cluster.
I'm using DC/OS Version 1.10.0 with dcos-flink-2-11:1.3.1-1.1 Flink is configured to run in HA mode.
high-availability=zookeeper high-availability.zookeeper.quorum=ctrl1.filio.bg:2181,ctrl2.filio.bg:2181,ctrl3.filio:2181 high-availability.zookeeper.storageDir=/mnt/glusterfs/flink zookeeper.sasl.disable=true state.checkpoints.dir=file:///mnt/glusterfs/flink/checkpoints state.savepoints.dir=file:///mnt/glusterfs/flink/savepoints
Flink runs fine until Marathon tries to restart it (after a crash or a manual restart).
This is the error.
I1010 10:37:57.033221 199 sched.cpp:1187] Got error 'Framework has been removed' I1010 10:37:57.033287 199 sched.cpp:2055] Asked to abort the driver I1010 10:37:57.033898 199 sched.cpp:1233] Aborting framework 33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002
33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002 is the ID of the previous instance of Flink found in Zookeeper/flink/default/mesos-workers/frameworkId If I delete the key from Zookeeper Flink starts normally and all previously running tasks are restored.