d2iq-archive / dcos-flink-service

11 stars 17 forks source link

Got error 'Framework has been removed' on restart #44

Open fil1o opened 7 years ago

fil1o commented 7 years ago

I'm using DC/OS Version 1.10.0 with dcos-flink-2-11:1.3.1-1.1 Flink is configured to run in HA mode.

high-availability=zookeeper high-availability.zookeeper.quorum=ctrl1.filio.bg:2181,ctrl2.filio.bg:2181,ctrl3.filio:2181 high-availability.zookeeper.storageDir=/mnt/glusterfs/flink zookeeper.sasl.disable=true state.checkpoints.dir=file:///mnt/glusterfs/flink/checkpoints state.savepoints.dir=file:///mnt/glusterfs/flink/savepoints

Flink runs fine until Marathon tries to restart it (after a crash or a manual restart).

This is the error.

I1010 10:37:57.033221 199 sched.cpp:1187] Got error 'Framework has been removed' I1010 10:37:57.033287 199 sched.cpp:2055] Asked to abort the driver I1010 10:37:57.033898 199 sched.cpp:1233] Aborting framework 33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002

33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002 is the ID of the previous instance of Flink found in Zookeeper/flink/default/mesos-workers/frameworkId If I delete the key from Zookeeper Flink starts normally and all previously running tasks are restored.

joerg84 commented 7 years ago

Thanks, will take a look!

On Tue, Oct 10, 2017 at 4:34 AM, fil1o notifications@github.com wrote:

I'm using DC/OS Version 1.10.0 with dcos-flink-2-11:1.3.1-1.1 Flink is configured to run in HA mode.

high-availability=zookeeper high-availability.zookeeper.quorum=ctrl1.filio.bg:2181,ctrl2.filio.bg:2181 ,ctrl3.filio:2181 high-availability.zookeeper.storageDir=/mnt/glusterfs/ flink zookeeper.sasl.disable=true state.checkpoints.dir=file:///mnt/glusterfs/flink/checkpoints state.savepoints.dir=file:///mnt/glusterfs/flink/savepoints

Flink runs fine until Marathon tries to restart it (after a crash or a manual restart). This is the error. I1010 10:37:57.033221 199 sched.cpp:1187] Got error 'Framework has been removed' I1010 10:37:57.033287 199 sched.cpp:2055] Asked to abort the driver I1010 10:37:57.033898 199 sched.cpp:1233] Aborting framework 33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002

33f4dcca-592a-4821-b150-eb1cd8e1c8f2-0002 is the ID of the previous instance of Flink found in Zookeeper/flink/default/mesos- workers/frameworkId If I delete the key from Zookeeper Flink starts normally and all previously running tasks are restored.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mesosphere/dcos-flink-service/issues/44, or mute the thread https://github.com/notifications/unsubscribe-auth/AKTwZmp7_icFplgtSj2FyeEAXZv225X5ks5sq1ZIgaJpZM4PzyWH .

todormazgalov commented 6 years ago

Yeah, I've been running into the same problem recently.

rradnev commented 6 years ago

We got the same exception in our setup. :(

fil1o commented 6 years ago

We still get this error after upgrading to version 1.4.0. We cannot go into production until this issue has been resolved. There is no high availability this way. We have to manually delete entries from Zookeeper to get Flink running after a crash.

EronWright commented 6 years ago

The error message means that failover timeout has been exceeded. When a framework fails, Mesos keeps that framework's tasks running to allow it time to recover. Eventually a timeout is reached and Mesos kills the tasks. If the framework later reconnects, Mesos responds with "Framework has been removed" - a problem that can be solved by cleaning out the ZK state.

The framework timeout value is configured with the Flink config option: mesos.failover-timeout (default: 10 minutes).

A possible improvement would be to have Flink automatically clean the ZK state and retry in this situation.

@fil1o, @todormazgalov, and @rradnev - does this characterization agree with your observations? Thanks.

fil1o commented 6 years ago

It seem about right. I will make one clarification. Even if I suspend Flink manually -Suspend (set instances to 0) -Wait for wait for all task managers to shut down -Resume Flink I still get "Framework has been removed" unless I clean up ZK .

EronWright commented 6 years ago

@fil1o what you describe is consistent with my explanation. Mesos shuts down the task managers once the failover timeout is reached. When the job manager is later resumed, it receives the 'removed' (due to timeout) message.

The remaining question is whether the failover timeout value is being configured properly.

asicoe commented 6 years ago

Yep we're hitting this as well with DCOS 1.10 and dcos-flink-2-11:1.3.1-1.1 and Flink is configured to run in HA mode. So, to understand things:

  1. There is no current workaround except for increasing the mesos.failover-timeout to something extremely high? Any other ways people solved this?
  2. Is there work done to solve this issue? PR etc?
fil1o commented 6 years ago

We implemented our own workaround. We use zkCli.sh to clean up frameworkId from zookeeper on every start. You will lose all task managers on job restart or job manager failure, but at least Flink can start up and resume the job.

#! /bin/bash
[[ ":$PATH:" != *":/opt/mesosphere/bin:"* ]] && PATH="${PATH}:/opt/mesosphere/bin"
/usr/share/zookeeper/bin/zkCli.sh -server zookeeper.address.here:2181 <<EOF
delete /flink/default/mesos-workers/frameworkId
quit
EOF

Hint. zkCli.sh can be found inside Flink container.

asicoe commented 6 years ago

Thanks @fil1o. I have a few questions:

  1. When do you call the script? I mean in order to make sure it gets executed when a job manager fails? Are u using the vanilla dcos service?
  2. Above you are implying you need to call the script on job restart. You mean automatic restart? Doesn't flink handle that for us? In our case, we only get the problem mentioned in this issue if the job manager gets restarted. Once the JobManager is up and the task managers are up and the jobs are deployed it seems it recovers from any job automatic restart or task manager failure.
  3. When u say it can start-up and resume the job, it will be the same job with the same state as before right?
fil1o commented 6 years ago
  1. We use the original dcos service, but we edit the configuration. Edit->Service->Command set to rm_framework_id.sh && /sbin/init.sh rm_framework_id.sh need to be on the same path on every worker. Best use a destributed file system and mount the folder to flink container.
  2. The above will call the script every time a new job manager is started.
  3. Right. As long as you have a checkpoint to resume from.
EronWright commented 6 years ago

It is obviously quite unfortunate to lose the TMs in the case of JM failover, but that workaround does make sense. The ultimate fix is to have Flink respond to this error by clearing out its own state such that it registers as a new framework. Mesos isn't prescriptive about how the framework should react to this situation. I will open a ticket and contribute a fix for 1.5.0 (will update this comment with a bug #).

EronWright commented 6 years ago

Opened FLINK-8541.

asicoe commented 6 years ago

Just following up after we've done more digging around.

It seems we were getting this error when a task manager died, as opposed to what I have written in my previous message. What happened then was that the job manager was also killed (Flink Mesos framework was removed) because the minimum number of task managers available was no longer met. When the job manager is restarted it cannot re-submit the framework under the same id (which it recovered from zookeeper) since it was removed previously.

Setting mesos.maximum-failed-tasks to -1 and increasing mesos.failover-timeout to 1hr seems to have fixed things for us in case of both job manager container loss or task manager container loss or service restart, without having to add the above delete script.

By the way, the above delete script is leaking task managers (with mesos.failover-timeout set to 1hr and mesos.maximum-failed-tasks left as the default). In case the job manager container gets killed when it comes back up it will create a new framework with a new id and thus a new task managers but the old ones are still around and reconnect to the cluster.