executor shutdown doesn't wait for pods to die before terminating

jdef commented 10 years ago

Upon shutdown killPodForTask is called for each running task, but since killPodForTask doesn't wait for confirmation that the "kill signal" has been processed, there's no guarantee that the pods are actually all shut down upon executor termination. The result is orphaned pods left running on a slave.

jdef commented 9 years ago

started work on this in PR #144

related mesos JIRA would be nice (exposed executor shutdown grace period timeout), but was recently committed then reverted in mesos.

jdef commented 9 years ago

xref 94dd0ef

jdef commented 9 years ago

Just realized that in the executor I have access to the slave pid and that I can use that to:

build an http request to the slave's state.json endpoint
extract the flags property and examine the executor_shutdown_grace_period value

flags example:

{"flags":{
"attributes":"host:development-3273-57b.c.k8s-mesos.internal",
"cgroups_enable_cfs":"false",
"cgroups_hierarchy":"\/sys\/fs\/cgroup",
"cgroups_limit_swap":"false",
"cgroups_root":"mesos",
"checkpoint":"true",
"containerizers":"docker,mesos",
"default_role":"*",
"disk_watch_interval":"1mins",
"docker":"docker",
"docker_remove_delay":"6hrs",
"docker_sandbox_directory":"\/mnt\/mesos\/sandbox",
"executor_registration_timeout":"5mins",
"executor_shutdown_grace_period":"20secs",
"frameworks_home":"",
"gc_delay":"1weeks",
"hadoop_home":"",
"help":"false",
"hostname":"10.150.52.19",
"initialize_driver_logging":"true",
"ip":"10.150.52.19",
"isolation":"posix\/cpu,posix\/mem",
"launcher_dir":"\/usr\/local\/libexec\/mesos",
"log_dir":"\/var\/log\/mesos",
"logbufsecs":"0",
"logging_level":"INFO",
"master":"zk:\/\/10.223.113.227:2181\/mesos",
"perf_duration":"10secs",
"perf_interval":"1mins",
"port":"5051",
"quiet":"false",
"recover":"reconnect",
"recovery_timeout":"15mins",
"registration_backoff_factor":"1secs",
"resource_monitoring_interval":"1secs",
"strict":"true",
"switch_user":"true",
"version":"false",
"work_dir":"\/tmp\/mesos"
}}

jdef commented 8 years ago

we now tag docker containers w/ the executor container UUID to enable GC of orphan containers #739

jdef commented 8 years ago

also, there's some recent work that should land in mesos as part of v0.28 that better communicates the shutdown grace period to executors

d2iq-archive / kubernetes-mesos

executor shutdown doesn't wait for pods to die before terminating #66