ContainerSolutions / mesos-starter

https://container-solutions.com/mesos-starter/
45 stars 10 forks source link

Framework id is being written as a full protobuf object, not a string #47

Closed philwinder closed 8 years ago

philwinder commented 8 years ago

Sometimes there is state left over in zookeeper when shutting down a framework. When a new framework starts, it thinks there are three running tasks, when in fact there are none.

To replicate (confirmed using Kibana on real life Mesos cluster on AWS):

  1. Start framework with marathon.
  2. Use marathon to destroy framework
  3. Start framework with marathon. The new framework will then kill all previous tasks and not start any new ones.

Work around: Delete the /${framework_name}/tasks zNode in zookeeper.

mwl commented 8 years ago

Any clue wether Use marathon to destroy framework results in a SIGTERM or SIGKILL. Sounds like it's a SIGKILL.

mwl commented 8 years ago

On SIGTERM you should see the following line at the end of the log

2016-03-21 13:04:04.731  INFO 27256 --- [       Thread-1] c.c.mesos.scheduler.UniversalScheduler   : Scheduler stopped
philwinder commented 8 years ago

I think it is a SIGKILL. Also, interestingly, when you restart the framework, when it is up and running, it actually kills all the old instances of the framework. Updating original comment.

mwl commented 8 years ago

Another workaround could be to use the shutdown endpoint in Spring Boot, https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-endpoints.html

philwinder commented 8 years ago

Definitely no SIGTERM:

2016-03-21 13:02:55.018  INFO 1 --- [      Thread-65] c.c.mesos.scheduler.UniversalScheduler   : Finished evaluating 4 offers. Accepted 0 offers and rejected 4
Killing docker task
Shutting down
<EOF>
philwinder commented 8 years ago

I worry that in a real failure case, it will remove all the tasks and restart none. Testing.

mwl commented 8 years ago

Killing docker task says it all.

How are you stopping the application in Marathon? Curl on some endpoint?

philwinder commented 8 years ago

Using the GUI. Click the cog icon, hit destroy. Is usual practice when messing with Marathon.

mwl commented 8 years ago

There's a failovertimeout set to 60 seconds. Any chance you hit that?

Not to self: Raise it and make it configurable

philwinder commented 8 years ago

What does that setting mean? And how would it affect killing and not-restarting tasks?

Scratch that. It fails. The restarted scheduler kills all other tasks then never restarts any.

philwinder commented 8 years ago

So I think the bug is actually nothing to do with marathon. I think it's something to do with the reaping of tasks when it shouldn't.

mwl commented 8 years ago

It kills the tasks associated with the framework ID if a new scheduler doesn't show up and take over before the timeout. Zookeeper state isn't being flushed so that could explain the behaviour?

philwinder commented 8 years ago

Ah right. That explains the killing behaviour. But I definitely restarted within this time, and then I can watch the tasks get killed a few tens of seconds later.

philwinder commented 8 years ago
  1. Start framework in docker mode
2016-03-21 13:23:30.466  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0004
  1. Kill scheduler container.
  2. Scheduler restarts:
2016-03-21 13:24:29.402  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=��sr7com.google.protobuf.GeneratedMessageLite$SerializedForm[asBytest[BLmessageClassNametLjava/lang/String;xpur[B���T�xp+
)68728969-b184-41b6-944f-15606e6b14ce-0004t#org.apache.mesos.Protos$FrameworkID

Message indicates that a full protobuf instance is being written to zookeeper, where an ID is expected.

mwl commented 8 years ago

@philwinder Could you take a look at this to verify if it solves the issue?

philwinder commented 8 years ago

LGTM. Tested fixed: So, First scheduler:

2016-03-21 14:51:05.680  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005

docker kill... etc. Second scheduler:

2016-03-21 14:56:27.641  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005

screen shot 2016-03-21 at 14 59 09

Approved with PullApprove