Closed philwinder closed 8 years ago
Any clue wether Use marathon to destroy framework results in a SIGTERM
or SIGKILL
. Sounds like it's a SIGKILL
.
On SIGTERM
you should see the following line at the end of the log
2016-03-21 13:04:04.731 INFO 27256 --- [ Thread-1] c.c.mesos.scheduler.UniversalScheduler : Scheduler stopped
I think it is a SIGKILL. Also, interestingly, when you restart the framework, when it is up and running, it actually kills all the old instances of the framework. Updating original comment.
Another workaround could be to use the shutdown endpoint in Spring Boot, https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-endpoints.html
Definitely no SIGTERM:
2016-03-21 13:02:55.018 INFO 1 --- [ Thread-65] c.c.mesos.scheduler.UniversalScheduler : Finished evaluating 4 offers. Accepted 0 offers and rejected 4
Killing docker task
Shutting down
<EOF>
I worry that in a real failure case, it will remove all the tasks and restart none. Testing.
Killing docker task
says it all.
How are you stopping the application in Marathon? Curl on some endpoint?
Using the GUI. Click the cog icon, hit destroy. Is usual practice when messing with Marathon.
There's a failovertimeout set to 60 seconds. Any chance you hit that?
Not to self: Raise it and make it configurable
What does that setting mean? And how would it affect killing and not-restarting tasks?
Scratch that. It fails. The restarted scheduler kills all other tasks then never restarts any.
So I think the bug is actually nothing to do with marathon. I think it's something to do with the reaping of tasks when it shouldn't.
It kills the tasks associated with the framework ID if a new scheduler doesn't show up and take over before the timeout. Zookeeper state isn't being flushed so that could explain the behaviour?
Ah right. That explains the killing behaviour. But I definitely restarted within this time, and then I can watch the tasks get killed a few tens of seconds later.
2016-03-21 13:23:30.466 INFO 1 --- [ Thread-5] c.c.mesos.scheduler.UniversalScheduler : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0004
2016-03-21 13:24:29.402 INFO 1 --- [ Thread-5] c.c.mesos.scheduler.UniversalScheduler : Framework registrered with frameworkId=��sr7com.google.protobuf.GeneratedMessageLite$SerializedForm[asBytest[BLmessageClassNametLjava/lang/String;xpur[B���T�xp+
)68728969-b184-41b6-944f-15606e6b14ce-0004t#org.apache.mesos.Protos$FrameworkID
Message indicates that a full protobuf instance is being written to zookeeper, where an ID is expected.
@philwinder Could you take a look at this to verify if it solves the issue?
LGTM. Tested fixed: So, First scheduler:
2016-03-21 14:51:05.680 INFO 1 --- [ Thread-5] c.c.mesos.scheduler.UniversalScheduler : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005
docker kill... etc. Second scheduler:
2016-03-21 14:56:27.641 INFO 1 --- [ Thread-5] c.c.mesos.scheduler.UniversalScheduler : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005
Sometimes there is state left over in zookeeper when shutting down a framework. When a new framework starts, it thinks there are three running tasks, when in fact there are none.
To replicate (confirmed using Kibana on real life Mesos cluster on AWS):
Work around: Delete the
/${framework_name}/tasks
zNode in zookeeper.