Service update fail even after monitoring period

orenye commented 5 years ago

[x] This is a bug report
[ ] This is a feature request
[x] I searched existing issues before opening this one

Expected behavior

docker service update should not fail if an updated task dies after the monitoring period finished.

Actual behavior

The update fails.

Steps to reproduce the behavior

Create a service with few replicas. Set a healthcheck that returns a failure to ensure that the rolling update does not continue to the next task. Issue a docker service update command that will restart the first task. Wait more than the monitoring period seconds and kill the task. According to the documentation from https://docs.docker.com/engine/swarm/services:

An individual task update is considered to have failed if the task doesn’t start up, or if it stops running within the monitoring period specified with the --update-monitor flag. The default value for --update-monitor is 30 seconds, which means that a task failing in the first 30 seconds after its started counts towards the service update failure threshold, and a failure after that is not counted.

Two issues are wrong:

The default update-monitor is 5s and not 30s.
Even if the task failed after the monitoring period had finished the failure is counted and the update is paused.

Output of docker version:

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:20:16 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:23:58 2018
  OS/Arch:      linux/amd64
  Experimental: true

Output of docker info:

Containers: 22
 Running: 16
 Paused: 0
 Stopped: 6
Images: 58
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: ijr5i8og247xbcramw3rmtrzo
 Is Manager: true
 ClusterID: m0y42wt2pk38n821unml2x57l
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 172.31.205.240
 Manager Addresses:
  172.31.205.240:2377
  172.31.205.242:2377
  172.31.205.243:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.15.9-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 15.67GiB
Name: paas-lehi-1
ID: FDPY:TOBZ:6OG4:Z4PN:FQVH:4HCR:SDWE:3QBQ:5ZGB:WSZ4:ZPBR:UZW5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 globe:5000
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.)

orenye commented 5 years ago

Improved steps to reproduce:

Create a service as follows: docker service create --entrypoint sleep --health-cmd "sh -c ps -ef |grep -v grep |grep 3333" --health-interval 10s --health-retries 5 --name uptest --replicas=3 alpine 3333
Wait for the service to run happily.
Update the service in a way that the healthcheck will fail: docker service update --args 4444 uptest

The first task will be restarted, the healthcheck will fail 5 times which is way more than 'monitoring period' and the task will be restarted again. The service update will move to paused update state.

trapier commented 5 years ago

Thanks for the report @orenye! I've opened a pull request to address the documentation technical error (update-monitor defaults to 5s) and gap ("failure" includes health checks) at docker/docker.github.io#9443.

orenye commented 5 years ago

Thanks

Regards, Oren

From: Trapier Marshall notifications@github.com Sent: Friday, September 13, 2019 7:53 PM To: docker/for-linux for-linux@noreply.github.com Cc: Oren Yekutieli Oren.Yekutieli@ecitele.com; Mention mention@noreply.github.com Subject: Re: [docker/for-linux] Service update fail even after monitoring period (#576)

Thanks for the report @orenyehttps://clicktime.symantec.com/3QYbjrmMy2LcSQs8294qcMw6H2?u=https%3A%2F%2Fgithub.com%2Forenye! I've opened a pull request to address the documentation technical error (update-monitor defaults to 5s) and gap ("failure" includes health checks) at docker/docker.github.io/#9443.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://clicktime.symantec.com/3XcXt8ftRPbiKgViBKQaaBY6H2?u=https%3A%2F%2Fgithub.com%2Fdocker%2Ffor-linux%2Fissues%2F576%3Femail_source%3Dnotifications%26email_token%3DADT6OKDPHOYDWI354FNFK3TQJPAPBA5CNFSM4GTTTJEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6VS4IY%23issuecomment-531312163, or mute the threadhttps://clicktime.symantec.com/3TnbReVArzLHsDtjgb7bnpr6H2?u=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADT6OKFO57HIYKIENPJE2RDQJPAPBANCNFSM4GTTTJEA.

This e-mail message is intended for the recipient only and contains information which is CONFIDENTIAL and which may be proprietary to ECI Telecom. If you have received this transmission in error, please inform us by e-mail, phone or fax, and then delete the original and all copies thereof.

docker / for-linux