Open jdef opened 10 years ago
started work on this in PR #144
related mesos JIRA would be nice (exposed executor shutdown grace period timeout), but was recently committed then reverted in mesos.
xref 94dd0ef
Just realized that in the executor I have access to the slave pid and that I can use that to:
flags
property and examine the executor_shutdown_grace_period
valueflags
example:
{"flags":{
"attributes":"host:development-3273-57b.c.k8s-mesos.internal",
"cgroups_enable_cfs":"false",
"cgroups_hierarchy":"\/sys\/fs\/cgroup",
"cgroups_limit_swap":"false",
"cgroups_root":"mesos",
"checkpoint":"true",
"containerizers":"docker,mesos",
"default_role":"*",
"disk_watch_interval":"1mins",
"docker":"docker",
"docker_remove_delay":"6hrs",
"docker_sandbox_directory":"\/mnt\/mesos\/sandbox",
"executor_registration_timeout":"5mins",
"executor_shutdown_grace_period":"20secs",
"frameworks_home":"",
"gc_delay":"1weeks",
"hadoop_home":"",
"help":"false",
"hostname":"10.150.52.19",
"initialize_driver_logging":"true",
"ip":"10.150.52.19",
"isolation":"posix\/cpu,posix\/mem",
"launcher_dir":"\/usr\/local\/libexec\/mesos",
"log_dir":"\/var\/log\/mesos",
"logbufsecs":"0",
"logging_level":"INFO",
"master":"zk:\/\/10.223.113.227:2181\/mesos",
"perf_duration":"10secs",
"perf_interval":"1mins",
"port":"5051",
"quiet":"false",
"recover":"reconnect",
"recovery_timeout":"15mins",
"registration_backoff_factor":"1secs",
"resource_monitoring_interval":"1secs",
"strict":"true",
"switch_user":"true",
"version":"false",
"work_dir":"\/tmp\/mesos"
}}
we now tag docker containers w/ the executor container UUID to enable GC of orphan containers #739
also, there's some recent work that should land in mesos as part of v0.28 that better communicates the shutdown grace period to executors
Upon shutdown killPodForTask is called for each running task, but since killPodForTask doesn't wait for confirmation that the "kill signal" has been processed, there's no guarantee that the pods are actually all shut down upon executor termination. The result is orphaned pods left running on a slave.