Framework id is being written as a full protobuf object, not a string

philwinder commented 8 years ago

Sometimes there is state left over in zookeeper when shutting down a framework. When a new framework starts, it thinks there are three running tasks, when in fact there are none.

To replicate (confirmed using Kibana on real life Mesos cluster on AWS):

Start framework with marathon.
Use marathon to destroy framework
Start framework with marathon. The new framework will then kill all previous tasks and not start any new ones.

Work around: Delete the /${framework_name}/tasks zNode in zookeeper.

mwl commented 8 years ago

Any clue wether Use marathon to destroy framework results in a SIGTERM or SIGKILL. Sounds like it's a SIGKILL.

mwl commented 8 years ago

On SIGTERM you should see the following line at the end of the log

2016-03-21 13:04:04.731  INFO 27256 --- [       Thread-1] c.c.mesos.scheduler.UniversalScheduler   : Scheduler stopped

philwinder commented 8 years ago

I think it is a SIGKILL. Also, interestingly, when you restart the framework, when it is up and running, it actually kills all the old instances of the framework. Updating original comment.

mwl commented 8 years ago

Another workaround could be to use the shutdown endpoint in Spring Boot, https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-endpoints.html

philwinder commented 8 years ago

Definitely no SIGTERM:

2016-03-21 13:02:55.018  INFO 1 --- [      Thread-65] c.c.mesos.scheduler.UniversalScheduler   : Finished evaluating 4 offers. Accepted 0 offers and rejected 4
Killing docker task
Shutting down
<EOF>

philwinder commented 8 years ago

I worry that in a real failure case, it will remove all the tasks and restart none. Testing.

mwl commented 8 years ago

Killing docker task says it all.

How are you stopping the application in Marathon? Curl on some endpoint?

philwinder commented 8 years ago

Using the GUI. Click the cog icon, hit destroy. Is usual practice when messing with Marathon.

mwl commented 8 years ago

There's a failovertimeout set to 60 seconds. Any chance you hit that?

Not to self: Raise it and make it configurable

philwinder commented 8 years ago

What does that setting mean? And how would it affect killing and not-restarting tasks?

Scratch that. It fails. The restarted scheduler kills all other tasks then never restarts any.

philwinder commented 8 years ago

So I think the bug is actually nothing to do with marathon. I think it's something to do with the reaping of tasks when it shouldn't.

mwl commented 8 years ago

It kills the tasks associated with the framework ID if a new scheduler doesn't show up and take over before the timeout. Zookeeper state isn't being flushed so that could explain the behaviour?

philwinder commented 8 years ago

Ah right. That explains the killing behaviour. But I definitely restarted within this time, and then I can watch the tasks get killed a few tens of seconds later.

philwinder commented 8 years ago

Start framework in docker mode

2016-03-21 13:23:30.466  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0004

Kill scheduler container.
Scheduler restarts:

2016-03-21 13:24:29.402  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=��sr7com.google.protobuf.GeneratedMessageLite$SerializedForm[asBytest[BLmessageClassNametLjava/lang/String;xpur[B���T�xp+
)68728969-b184-41b6-944f-15606e6b14ce-0004t#org.apache.mesos.Protos$FrameworkID

Message indicates that a full protobuf instance is being written to zookeeper, where an ID is expected.

mwl commented 8 years ago

@philwinder Could you take a look at this to verify if it solves the issue?

philwinder commented 8 years ago

LGTM. Tested fixed: So, First scheduler:

2016-03-21 14:51:05.680  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005

docker kill... etc. Second scheduler:

2016-03-21 14:56:27.641  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005

screen shot 2016-03-21 at 14 59 09

ContainerSolutions / mesos-starter

Framework id is being written as a full protobuf object, not a string #47