2022-06-08T05:12:25,770292688+0000 - INFO - Docker root dir within ephemeral temp disk: /mnt/resource/docker
2022-06-08T05:12:25,771988035+0000 - INFO - Checking for Nvidia Hardware
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
2022-06-08T05:12:26,119040957+0000 - INFO - No Nvidia card(s) detected!
2022-06-08T05:12:26+0000 - DEBUG - Logging into 1 Docker registry servers...
2022-06-08T05:12:26+0000 - DEBUG - Logging into Docker registry: scrumsalesdockerprd.azurecr.io with user: scrumsalesdockerprd
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /mnt/resource/batch/tasks/startup/wd/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
2022-06-08T05:12:26+0000 - INFO - Docker registry logins completed.
2022-06-08T05:12:26+0000 - WARNING - No Singularity registry servers found.
2022-06-08T05:12:26,348386782+0000 - DEBUG - VM size standard_a2_v2 does not have IB RDMA
2022-06-08T05:12:26,349782721+0000 - DEBUG - Not an RDMA capable VM size, skipping IB detection/setup
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:22:48:e8:38:ff brd ff:ff:ff:ff:ff:ff
inet 10.6.3.23/25 brd 10.6.3.127 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::222:48ff:fee8:38ff/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:bf:43:8c:ee brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
2022-06-08T05:12:26,354430051+0000 - INFO - Batch Insights disabled.
2022-06-08T05:12:26,355991295+0000 - INFO - Prometheus node exporter disabled.
2022-06-08T05:12:26,360116311+0000 - INFO - Prometheus cAdvisor disabled.
2022-06-08T05:12:26,361768457+0000 - DEBUG - Pulling Docker Image: mcr.microsoft.com/blobxfer:1.9.4 (fallback: 0)
1.9.4: Pulling from blobxfer
89d9c30c1d48: Pulling fs layer
6de18253c5d3: Pulling fs layer
89d9c30c1d48: Verifying Checksum
89d9c30c1d48: Download complete
6de18253c5d3: Verifying Checksum
6de18253c5d3: Download complete
89d9c30c1d48: Pull complete
6de18253c5d3: Pull complete
Digest: sha256:94192812382de05b77d8766720cf4c22cc84fd15c86a838a043366fa2047af83
Status: Downloaded newer image for mcr.microsoft.com/blobxfer:1.9.4
mcr.microsoft.com/blobxfer:1.9.4
2022-06-08T05:12:34,397626166+0000 - DEBUG - Pulling Docker Image: mcr.microsoft.com/azure-batch/shipyard:3.9.1-cargo (fallback: 0)
2022-06-08T05:13:04,792495588+0000 - ERROR - Error response from daemon: Head "https://mcr.microsoft.com/v2/azure-batch/shipyard/manifests/3.9.1-cargo": dial tcp 204.79.197.219:443: i/o timeout
2022-06-08T05:13:04,794030532+0000 - ERROR - No fallback registry specified, terminating
Additonal Comments
This issue happens around 1 time per month.
we added the following parameter in pool.yaml file. But this parameter did not work as expected.
reboot_on_start_task_failed: true
Problem Description
2022-06-08T05:12:34,397626166+0000 - DEBUG - Pulling Docker Image: mcr.microsoft.com/azure-batch/shipyard:3.9.1-cargo (fallback: 0) 2022-06-08T05:13:04,792495588+0000 - ERROR - Error response from daemon: Head "https://mcr.microsoft.com/v2/azure-batch/shipyard/manifests/3.9.1-cargo": dial tcp 204.79.197.219:443: i/o timeout 2022-06-08T05:13:04,794030532+0000 - ERROR - No fallback registry specified, terminating
Batch Shipyard Version
3.9.1
Batch Pool Configuration
pool_specification: id: scrum-pool vm_configuration: platform_image: offer: centos-container publisher: microsoft-azure-batch sku: 7-8 vm_count: dedicated: 1 low_priority: 0 vm_size: Standard_A2_v2 reboot_on_start_task_failed: true
Additional Logs
Linux 15f0febd29bf4e55b151a692b61a1eac000000 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux 2022-06-08T05:12:25,187903573+0000 - WARNING - Unknown DISTRIB_CODENAME. 2022-06-08T05:12:25,202975095+0000 - INFO - Prep start Configuration: Custom image: 0 Native mode: 1 OS Distribution: centos 7 Batch Shipyard version: 3.9.1 Blobxfer version: 1.9.4 Singularity version: User mountpoint: Mount path: /mnt/resource/batch/tasks/mounts Batch Insights: 0 Prometheus: NE=, CA=, Network optimization: 0 Encryption cert thumbprint: Install Kata Containers: 0 Default container runtime: runc Install BeeGFS BeeOND: 0 Storage cluster mounts (1): Custom mount: Install LIS: GPU: GPU ignore warnings: Azure Blob: 0 Azure File: 0 GlusterFS on compute: 0 HPN-SSH: 0 Enable Azure Batch group for Docker access: Fallback registry: Docker image preload delay: 0 Cascade via container: 1 Concurrent source downloads: 10 Block on images: # Singularity decryption certs:
2022-06-08T05:12:25,265201338+0000 - INFO - Ephemeral disk discovered as /dev/sdb 2022-06-08T05:12:25,281373591+0000 - DEBUG - lsblk: /dev/sdb1 8:17 0 20G 0 part /mnt/resource 2022-06-08T05:12:25,298301166+0000 - INFO - ephemeral: /dev/sdb1 (encrypted=0 user=/mnt/resource) 2022-06-08T05:12:25,370942501+0000 - INFO - VmSize=standard_a2_v2 RDMA=0 2022-06-08T05:12:25,374938513+0000 - INFO - LIS installation not required 2022-06-08T05:12:25,376557558+0000 - INFO - No singularity decryption certificates defined ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2022-06-08 05:10:04 UTC; 2min 20s ago Docs: https://docs.docker.com Main PID: 1164 (dockerd) Tasks: 12 Memory: 843.6M CGroup: /system.slice/docker.service └─1164 /usr/bin/dockerd -H fd:// --containerd /var/run/containerd/containerd.sock
Jun 08 05:09:59 15f0febd29bf4e55b151a692b61a1eac000000 dockerd[1164]: time="2022-06-08T05:09:59.844055023Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc Jun 08 05:09:59 15f0febd29bf4e55b151a692b61a1eac000000 dockerd[1164]: time="2022-06-08T05:09:59.844076823Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0 }] }" module=grpc
Jun 08 05:09:59 15f0febd29bf4e55b151a692b61a1eac000000 dockerd[1164]: time="2022-06-08T05:09:59.844087822Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 08 05:10:00 15f0febd29bf4e55b151a692b61a1eac000000 dockerd[1164]: time="2022-06-08T05:10:00.549123795Z" level=info msg="Loading containers: start."
Jun 08 05:10:02 15f0febd29bf4e55b151a692b61a1eac000000 dockerd[1164]: time="2022-06-08T05:10:02.622634623Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jun 08 05:10:03 15f0febd29bf4e55b151a692b61a1eac000000 dockerd[1164]: time="2022-06-08T05:10:03.637093010Z" level=info msg="Loading containers: done."
Jun 08 05:10:04 15f0febd29bf4e55b151a692b61a1eac000000 dockerd[1164]: time="2022-06-08T05:10:04.350500980Z" level=info msg="Docker daemon" commit=847da184ad5048b27f5bdf9d53d070f731b43180 graphdriver(s)=overlay2 version=20.10.11+azure-3
Jun 08 05:10:04 15f0febd29bf4e55b151a692b61a1eac000000 dockerd[1164]: time="2022-06-08T05:10:04.351261171Z" level=info msg="Daemon has completed initialization"
Jun 08 05:10:04 15f0febd29bf4e55b151a692b61a1eac000000 systemd[1]: Started Docker Application Container Engine.
Jun 08 05:10:04 15f0febd29bf4e55b151a692b61a1eac000000 dockerd[1164]: time="2022-06-08T05:10:04.707034818Z" level=info msg="API listen on /var/run/docker.sock"
Client:
Version: 20.10.11+azure-3
API version: 1.41
Go version: go1.16.12
Git commit: dea9396e184290f638ea873c76db7c80efd5a1d2
Built: Wed Nov 17 23:49:46 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Engine: Version: 20.10.11+azure-3 API version: 1.41 (minimum version 1.12) Go version: go1.16.12 Git commit: 847da184ad5048b27f5bdf9d53d070f731b43180 Built: Thu Nov 18 00:21:59 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.12+azure-1 GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d runc: Version: 1.0.3 GitCommit: f46b6ba2c9314cfc8caae24a32ec5fe9ef1059fe docker-init: Version: 0.19.0 GitCommit:
Client: Context: default Debug Mode: false
Server: Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 2 Server Version: 20.10.11+azure-3 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux Default Runtime: runc Init Binary: docker-init containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d runc version: f46b6ba2c9314cfc8caae24a32ec5fe9ef1059fe init version: Kernel Version: 3.10.0-1127.19.1.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.701GiB Name: 15f0febd29bf4e55b151a692b61a1eac000000 ID: 5NBN:SBAR:2RJ4:4N7S:MW57:D6OV:YS2L:LIOZ:BXZ7:QMWC:J5AU:KNJJ Docker Root Dir: /mnt/resource/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
2022-06-08T05:12:25,770292688+0000 - INFO - Docker root dir within ephemeral temp disk: /mnt/resource/docker 2022-06-08T05:12:25,771988035+0000 - INFO - Checking for Nvidia Hardware 00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03) 00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01) 00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01) 00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02) 00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA 2022-06-08T05:12:26,119040957+0000 - INFO - No Nvidia card(s) detected! 2022-06-08T05:12:26+0000 - DEBUG - Logging into 1 Docker registry servers... 2022-06-08T05:12:26+0000 - DEBUG - Logging into Docker registry: scrumsalesdockerprd.azurecr.io with user: scrumsalesdockerprd WARNING! Using --password via the CLI is insecure. Use --password-stdin. WARNING! Your password will be stored unencrypted in /mnt/resource/batch/tasks/startup/wd/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded 2022-06-08T05:12:26+0000 - INFO - Docker registry logins completed. 2022-06-08T05:12:26+0000 - WARNING - No Singularity registry servers found. 2022-06-08T05:12:26,348386782+0000 - DEBUG - VM size standard_a2_v2 does not have IB RDMA 2022-06-08T05:12:26,349782721+0000 - DEBUG - Not an RDMA capable VM size, skipping IB detection/setup 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:22:48:e8:38:ff brd ff:ff:ff:ff:ff:ff inet 10.6.3.23/25 brd 10.6.3.127 scope global noprefixroute eth0 valid_lft forever preferred_lft forever inet6 fe80::222:48ff:fee8:38ff/64 scope link valid_lft forever preferred_lft forever 3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:bf:43:8c:ee brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever 2022-06-08T05:12:26,354430051+0000 - INFO - Batch Insights disabled. 2022-06-08T05:12:26,355991295+0000 - INFO - Prometheus node exporter disabled. 2022-06-08T05:12:26,360116311+0000 - INFO - Prometheus cAdvisor disabled. 2022-06-08T05:12:26,361768457+0000 - DEBUG - Pulling Docker Image: mcr.microsoft.com/blobxfer:1.9.4 (fallback: 0) 1.9.4: Pulling from blobxfer 89d9c30c1d48: Pulling fs layer 6de18253c5d3: Pulling fs layer 89d9c30c1d48: Verifying Checksum 89d9c30c1d48: Download complete 6de18253c5d3: Verifying Checksum 6de18253c5d3: Download complete 89d9c30c1d48: Pull complete 6de18253c5d3: Pull complete Digest: sha256:94192812382de05b77d8766720cf4c22cc84fd15c86a838a043366fa2047af83 Status: Downloaded newer image for mcr.microsoft.com/blobxfer:1.9.4 mcr.microsoft.com/blobxfer:1.9.4 2022-06-08T05:12:34,397626166+0000 - DEBUG - Pulling Docker Image: mcr.microsoft.com/azure-batch/shipyard:3.9.1-cargo (fallback: 0) 2022-06-08T05:13:04,792495588+0000 - ERROR - Error response from daemon: Head "https://mcr.microsoft.com/v2/azure-batch/shipyard/manifests/3.9.1-cargo": dial tcp 204.79.197.219:443: i/o timeout 2022-06-08T05:13:04,794030532+0000 - ERROR - No fallback registry specified, terminating
Additonal Comments
This issue happens around 1 time per month. we added the following parameter in pool.yaml file. But this parameter did not work as expected. reboot_on_start_task_failed: true