Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.1k stars 2.52k forks source link

AML fastai example "custom docker image" does not work #1482

Open hamelsmu opened 3 years ago

hamelsmu commented 3 years ago

@keijik @cody-dkdc

Running the fastai custom docker example does not work

Things I tried:

Requests

I've included the logs from my attempted AML Run from this notebook below 👇🏽

2021-05-22T15:06:36Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore
2021-05-22T15:06:36Z Starting output-watcher...
2021-05-22T15:06:37Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
2021-05-22T15:06:37Z Executing 'Copy ACR Details file' on 10.8.96.89
2021-05-22T15:06:37Z Copy ACR Details file succeeded on 10.8.96.89. Output: 
>>>   
>>>   
2021-05-22T15:06:52Z Running Docker Command attempt 1 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
. See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
2021-05-22T15:06:57Z Force Restart Docker Service
2021-05-22T15:06:57Z 
2021-05-22T15:06:57Z Waiting for docker daemon to come up.
2021-05-22T15:06:57Z Docker daemon is active
2021-05-22T15:06:57Z Retry Docker Command...
2021-05-22T15:07:12Z Running Docker Command attempt 2 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
. See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
2021-05-22T15:07:21Z Force Restart Docker Service
2021-05-22T15:07:21Z 
2021-05-22T15:07:22Z Waiting for docker daemon to come up.
2021-05-22T15:07:22Z Docker daemon is active
2021-05-22T15:07:22Z Retry Docker Command...
2021-05-22T15:07:37Z Running Docker Command attempt 3 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
. See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
2021-05-22T15:07:53Z Force Restart Docker Service
2021-05-22T15:07:53Z 
2021-05-22T15:07:53Z Waiting for docker daemon to come up.
2021-05-22T15:07:53Z Docker daemon is active
2021-05-22T15:07:53Z Retry Docker Command...
2021-05-22T15:08:09Z Running Docker Command attempt 4 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
. See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
2021-05-22T15:08:41Z Force Restart Docker Service
2021-05-22T15:08:42Z 
2021-05-22T15:08:42Z Waiting for docker daemon to come up.
2021-05-22T15:08:42Z Docker daemon is active
2021-05-22T15:08:42Z Retry Docker Command...
2021-05-22T15:08:57Z Running Docker Command attempt 5 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
. See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
2021-05-22T15:10:02Z Force Restart Docker Service
2021-05-22T15:10:02Z 
2021-05-22T15:10:02Z Waiting for docker daemon to come up.
2021-05-22T15:10:02Z Docker daemon is active
2021-05-22T15:10:02Z Retry Docker Command...
2021-05-22T15:10:17Z Running Docker Command attempt 6 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
. See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
2021-05-22T15:12:26Z Force Restart Docker Service
2021-05-22T15:12:26Z 
2021-05-22T15:12:26Z Waiting for docker daemon to come up.
2021-05-22T15:12:26Z Docker daemon is active
2021-05-22T15:12:26Z Retry Docker Command...
2021-05-22T15:12:41Z Running Docker Command attempt 7 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
. See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
2021-05-22T15:13:11Z Job environment preparation failed on 10.8.96.89. Output: 
>>>   2021/05/22 15:06:35 Starting App Insight Logger for task:  prepareJobEnvironment
>>>   2021/05/22 15:06:35 Version: 3.0.01597.0004 Branch: 2021-05-17-bing-hotfix Commit: 974f3e4
>>>   2021/05/22 15:06:35 runtime.GOOS linux
>>>   2021/05/22 15:06:35 Checking if '/tmp' exists
>>>   2021/05/22 15:06:35 Reading dyanamic configs
>>>   2021/05/22 15:06:35 Container sas url: https://baiscriptseastusprod.blob.core.windows.net/aihosttools?sv=2018-03-28&sr=c&si=aihosttoolspolicy&sig=gCpFfTbL8hPl%2BzV43hBdfOZC4SuKqZoJraIo10S4%2FYw%3D
>>>   2021/05/22 15:06:35 Failed to read from file /mnt/batch/tasks/startup/wd/az_resource/xdsenv.variable/azsecpack.variables, open /mnt/batch/tasks/startup/wd/az_resource/xdsenv.variable/azsecpack.variables: no such file or directory
>>>   2021/05/22 15:06:35 [in autoUpgradeFromJobNodeSetup] Is Azsecpack installer on host: false. Is Azsecpack enabled: false,
>>>   2021/05/22 15:06:35 Starting Azsecpack installation on machine: bf9f722d45714167beb968edcea13f1600000E#398a6654-997b-47e9-b12b-9515b896b4de#91095667-e119-4555-acea-1826488492f0#ds-tengri-resources-eastus#dds-ml-east#dds-ml
>>>   2021/05/22 15:06:35 Is Azsecpack enabled: false, GetDisableVsatlsscan: true
>>>   2021/05/22 15:06:35 Turning off azsecpack, if it is already running
>>>   2021/05/22 15:06:35 [doTurnOffAzsecpack] output:Unit mdsd.service could not be found.
>>>   ,err:exit status 1.
>>>   2021/05/22 15:06:35 OS patching disabled by dynamic configs. Skipping.
>>>   2021/05/22 15:06:35 Job: AZ_BATCHAI_JOB_NAME does not turn on the DetonationChamber
>>>   2021/05/22 15:06:35 Start to getting gpu count by running nvidia-smi command
>>>   2021/05/22 15:06:35 GPU count found on the node: 0
>>>   2021/05/22 15:06:35 AMLComputeXDSEndpoint:  https://6e64c585-4845-4356-b1e0-a28ca62f252a.workspace.eastus.cert.api.azureml.ms/xdsbatchai
>>>   2021/05/22 15:06:35 AMLComputeXDSApiVersion:  2018-02-01
>>>   2021/05/22 15:06:35 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/config
>>>   2021/05/22 15:06:35 This is not a aml-workstation (compute instance), current offer type: amlcompute. Starting identity responder as part of prepareJobEnvironment.
>>>   2021/05/22 15:06:35 Starting identity responder.
>>>   2021/05/22 15:06:35 Starting identity responder.
>>>   2021/05/22 15:06:35 Failed to open file /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/config/.batchai.IdentityResponder.envlist: open /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/config/.batchai.IdentityResponder.envlist: no such file or directory
>>>   2021/05/22 15:06:35 Logfile used for identity responder: /mnt/batch/tasks/workitems/051f9434-a110-4ced-be03-f37876075345/job-1/fastai-custom-image__f2308802-f3fd-4f69-9710-581502704959/IdentityResponderLog-tvmps_63c0616b393d93f50f271aee1053d8f6130f081c9a609118eb8f1295575dc40c_d.txt
>>>   2021/05/22 15:06:35 Logfile used for identity responder: /mnt/batch/tasks/workitems/051f9434-a110-4ced-be03-f37876075345/job-1/fastai-custom-image__f2308802-f3fd-4f69-9710-581502704959/IdentityResponderLog-tvmps_63c0616b393d93f50f271aee1053d8f6130f081c9a609118eb8f1295575dc40c_d.txt
>>>   2021/05/22 15:06:35 Started Identity Responder for job.
>>>   2021/05/22 15:06:35 Started Identity Responder for job.
>>>   2021/05/22 15:06:35 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/wd
>>>   2021/05/22 15:06:35 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/shared
>>>   2021/05/22 15:06:35 From the policy service, the filtering patterns is: , data store is 
>>>   2021/05/22 15:06:35 Mounting job level file systems
>>>   2021/05/22 15:06:35 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts
>>>   2021/05/22 15:06:35 Attempting to read datastore credentials file: /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/config/.amlcompute.datastorecredentials
>>>   2021/05/22 15:06:35 Datastore credentials file not found, skipping.
>>>   2021/05/22 15:06:35 Attempting to read runtime sas tokens file: /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/config/.master.runtimesastokens
>>>   2021/05/22 15:06:35 Runtime sas tokens file not found, skipping.
>>>   2021/05/22 15:06:35 No NFS configured
>>>   2021/05/22 15:06:35 No Azure File Shares configured
>>>   2021/05/22 15:06:35 Mounting blob file systems
>>>   2021/05/22 15:06:35 Blobfuse runtime version 1.3.6
>>>   2021/05/22 15:06:35 Mounting azureml-blobstore-6e64c585-4845-4356-b1e0-a28ca62f252a container from ddsmleast9411768689 account at /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore
>>>   2021/05/22 15:06:35 Using Compute Identity to authenticate Blobfuse: false.
>>>   2021/05/22 15:06:35 Using Compute Identity to authenticate Blobfuse: false.
>>>   2021/05/22 15:06:35 Blobfuse cache size set to 11257 MB.
>>>   2021/05/22 15:06:35 Running following command: /bin/bash -c sudo blobfuse /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore --tmp-path=/mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/caches/workspaceblobstore --file-cache-timeout-in-seconds=1000000 --cache-size-mb=11257 -o nonempty -o allow_other --config-file=/mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/configs/workspaceblobstore.cfg --log-level=LOG_WARNING
>>>   2021/05/22 15:06:35 Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore
>>>   2021/05/22 15:06:36 Waiting for blobfs to be mounted at /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore
>>>   2021/05/22 15:06:36 Successfully mounted azureml-blobstore-6e64c585-4845-4356-b1e0-a28ca62f252a container from ddsmleast9411768689 account at /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore
>>>   2021/05/22 15:06:36 Created run_id directory: /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore/azureml/fastai-custom-image_1621695915_a4bb441e
>>>   2021/05/22 15:06:36 No unmanaged file systems configured
>>>   2021/05/22 15:06:36 Start to getting gpu count by running nvidia-smi command
>>>   2021/05/22 15:06:36 From the policy service, the filtering patterns is: , data store is 
>>>   2021/05/22 15:06:36 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore/azureml/fastai-custom-image_1621695915_a4bb441e/azureml_compute_logs
>>>   2021/05/22 15:06:36 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore/azureml/fastai-custom-image_1621695915_a4bb441e/logs
>>>   2021/05/22 15:06:36 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/mounts/workspaceblobstore/azureml/fastai-custom-image_1621695915_a4bb441e/outputs
>>>   2021/05/22 15:06:36 Starting output-watcher...
>>>   2021/05/22 15:06:36 Single file input dataset is enabled.
>>>   2021/05/22 15:06:36 Start to pulling docker image: fastdotai/fastai:latest
>>>   2021/05/22 15:06:36 Start pull docker image: fastdotai
>>>   2021/05/22 15:06:36 Getting credentials for image fastdotai/fastai:latest with url 
>>>   2021/05/22 15:06:36 Container registry is not ACR.
>>>   2021/05/22 15:06:36 Skip getting ACR Credentials from Identity and will be getting it from EMS
>>>   2021/05/22 15:06:36 Getting ACR Credentials from EMS for environment fastai:Autosave_2021-05-22T15:05:19Z_b9284463
>>>   2021/05/22 15:06:36 Requesting XDS for registry details.
>>>   2021/05/22 15:06:36 Attempt 1 of http call to https://6e64c585-4845-4356-b1e0-a28ca62f252a.workspace.eastus.cert.api.azureml.ms/xdsbatchai/hosttoolapi/subscriptions/91095667-e119-4555-acea-1826488492f0/resourceGroups/ds-tengri-resources-eastus/workspaces/dds-ml-east/clusters/dds-ml/nodes/tvmps_63c0616b393d93f50f271aee1053d8f6130f081c9a609118eb8f1295575dc40c_d?api-version=2018-02-01
>>>   2021/05/22 15:06:37 Got container registry details from credentials service for registry address: .
>>>   2021/05/22 15:06:37 Writing ACR Details to file...
>>>   2021/05/22 15:06:37 Copying ACR Details file to worker nodes...
>>>   2021/05/22 15:06:37 Executing 'Copy ACR Details file' on 10.8.96.89
>>>   2021/05/22 15:06:37 Begin executing 'Copy ACR Details file' task on Node
>>>   2021/05/22 15:06:37 'Copy ACR Details file' task Node result: succeeded
>>>   2021/05/22 15:06:37 Copy ACR Details file succeeded on 10.8.96.89. Output: 
>>>   >>>   
>>>   >>>   
>>>   2021/05/22 15:06:37 EncryptedDockerRegistryPassword is empty.
>>>   2021/05/22 15:06:37 EMS returned empty credentials for environment fastai
>>>   2021/05/22 15:06:37 Save docker credentials for image fastdotai/fastai:latest in /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/wd/docker_login_6FE00B6271AD80D6
>>>   2021/05/22 15:06:37 The login info is empty, skipping login to the docker registry.
>>>   2021/05/22 15:06:37 Start run pull docker image command
>>>   2021/05/22 15:06:40 Not exporting to RunHistory as the exporter is either stopped or there is no data.
>>>   Stopped: false
>>>   OriginalData: 18
>>>   FilteredData: 0.
>>>   2021/05/22 15:06:52 Running Docker Command attempt 1 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:06:52 Running Docker Command attempt 1 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:06:57 Force Restart Docker Service
>>>   2021/05/22 15:06:57 Force Restart Docker Service
>>>   2021/05/22 15:06:57 
>>>   2021/05/22 15:06:57 
>>>   2021/05/22 15:06:57 Last 20 lines of Docker daemon log file, fetched after force restart:
>>>    time="2021-05-22T15:06:57.302348600Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:06:57.302735600Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:06:57.303901500Z" level=info msg="parsed scheme: \"unix\"" module=grpc
>>>   time="2021-05-22T15:06:57.303923800Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
>>>   time="2021-05-22T15:06:57.303956200Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:06:57.303966600Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:06:57.309015500Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
>>>   time="2021-05-22T15:06:57.311528200Z" level=warning msg="Your kernel does not support swap memory limit"
>>>   time="2021-05-22T15:06:57.311564200Z" level=warning msg="Your kernel does not support cgroup rt period"
>>>   time="2021-05-22T15:06:57.311571500Z" level=warning msg="Your kernel does not support cgroup rt runtime"
>>>   time="2021-05-22T15:06:57.311577100Z" level=warning msg="Your kernel does not support cgroup blkio weight"
>>>   time="2021-05-22T15:06:57.311582400Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
>>>   time="2021-05-22T15:06:57.311699200Z" level=info msg="Loading containers: start."
>>>   time="2021-05-22T15:06:57.393926400Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
>>>   time="2021-05-22T15:06:57.424473200Z" level=info msg="Loading containers: done."
>>>   time="2021-05-22T15:06:57.439214700Z" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled" storage-driver=overlay2
>>>   time="2021-05-22T15:06:57.440205600Z" level=info msg="Docker daemon" commit=7d75c1d40d88ddef08653dbd611f41df42bdf087 graphdriver(s)=overlay2 version=19.03.14+azure
>>>   time="2021-05-22T15:06:57.440454200Z" level=info msg="Daemon has completed initialization"
>>>   time="2021-05-22T15:06:57.463106300Z" level=info msg="API listen on /var/run/docker.sock"
>>>   Started Docker Application Container Engine.
>>>   
>>>   2021/05/22 15:06:57 Finished restarting docker service if needed
>>>   2021/05/22 15:06:57 Waiting for docker daemon to come up.
>>>   2021/05/22 15:06:57 Waiting for docker daemon to come up.
>>>   2021/05/22 15:06:57 Docker daemon is active
>>>   2021/05/22 15:06:57 Docker daemon is active
>>>   2021/05/22 15:06:57 Retry Docker Command...
>>>   2021/05/22 15:06:57 Retry Docker Command...
>>>   2021/05/22 15:07:12 Running Docker Command attempt 2 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:07:12 Running Docker Command attempt 2 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:07:21 Force Restart Docker Service
>>>   2021/05/22 15:07:21 Force Restart Docker Service
>>>   2021/05/22 15:07:21 
>>>   2021/05/22 15:07:21 
>>>   2021/05/22 15:07:21 Last 20 lines of Docker daemon log file, fetched after force restart:
>>>    time="2021-05-22T15:07:21.719788000Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:07:21.719798300Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:07:21.721117500Z" level=info msg="parsed scheme: \"unix\"" module=grpc
>>>   time="2021-05-22T15:07:21.721284300Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
>>>   time="2021-05-22T15:07:21.721449400Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:07:21.721592000Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:07:21.729531400Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
>>>   time="2021-05-22T15:07:21.731169400Z" level=warning msg="Your kernel does not support swap memory limit"
>>>   time="2021-05-22T15:07:21.731189800Z" level=warning msg="Your kernel does not support cgroup rt period"
>>>   time="2021-05-22T15:07:21.731196600Z" level=warning msg="Your kernel does not support cgroup rt runtime"
>>>   time="2021-05-22T15:07:21.731202200Z" level=warning msg="Your kernel does not support cgroup blkio weight"
>>>   time="2021-05-22T15:07:21.731207600Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
>>>   time="2021-05-22T15:07:21.731574600Z" level=info msg="Loading containers: start."
>>>   time="2021-05-22T15:07:21.820365400Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
>>>   time="2021-05-22T15:07:21.858103900Z" level=info msg="Loading containers: done."
>>>   time="2021-05-22T15:07:21.874279400Z" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled" storage-driver=overlay2
>>>   time="2021-05-22T15:07:21.874631700Z" level=info msg="Docker daemon" commit=7d75c1d40d88ddef08653dbd611f41df42bdf087 graphdriver(s)=overlay2 version=19.03.14+azure
>>>   time="2021-05-22T15:07:21.874679200Z" level=info msg="Daemon has completed initialization"
>>>   time="2021-05-22T15:07:21.887387900Z" level=info msg="API listen on /var/run/docker.sock"
>>>   Started Docker Application Container Engine.
>>>   
>>>   2021/05/22 15:07:22 Finished restarting docker service if needed
>>>   2021/05/22 15:07:22 Waiting for docker daemon to come up.
>>>   2021/05/22 15:07:22 Waiting for docker daemon to come up.
>>>   2021/05/22 15:07:22 Docker daemon is active
>>>   2021/05/22 15:07:22 Docker daemon is active
>>>   2021/05/22 15:07:22 Retry Docker Command...
>>>   2021/05/22 15:07:22 Retry Docker Command...
>>>   2021/05/22 15:07:37 Running Docker Command attempt 3 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:07:37 Running Docker Command attempt 3 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:07:53 Force Restart Docker Service
>>>   2021/05/22 15:07:53 Force Restart Docker Service
>>>   2021/05/22 15:07:53 
>>>   2021/05/22 15:07:53 
>>>   2021/05/22 15:07:53 Last 20 lines of Docker daemon log file, fetched after force restart:
>>>    time="2021-05-22T15:07:53.712328900Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:07:53.712338500Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:07:53.714142100Z" level=info msg="parsed scheme: \"unix\"" module=grpc
>>>   time="2021-05-22T15:07:53.714241000Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
>>>   time="2021-05-22T15:07:53.714448200Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:07:53.714533300Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:07:53.722682800Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
>>>   time="2021-05-22T15:07:53.723940700Z" level=warning msg="Your kernel does not support swap memory limit"
>>>   time="2021-05-22T15:07:53.723958800Z" level=warning msg="Your kernel does not support cgroup rt period"
>>>   time="2021-05-22T15:07:53.723966800Z" level=warning msg="Your kernel does not support cgroup rt runtime"
>>>   time="2021-05-22T15:07:53.723972800Z" level=warning msg="Your kernel does not support cgroup blkio weight"
>>>   time="2021-05-22T15:07:53.723978700Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
>>>   time="2021-05-22T15:07:53.724092000Z" level=info msg="Loading containers: start."
>>>   time="2021-05-22T15:07:53.805591800Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
>>>   time="2021-05-22T15:07:53.843376900Z" level=info msg="Loading containers: done."
>>>   time="2021-05-22T15:07:53.856206300Z" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled" storage-driver=overlay2
>>>   time="2021-05-22T15:07:53.856574000Z" level=info msg="Docker daemon" commit=7d75c1d40d88ddef08653dbd611f41df42bdf087 graphdriver(s)=overlay2 version=19.03.14+azure
>>>   time="2021-05-22T15:07:53.856712700Z" level=info msg="Daemon has completed initialization"
>>>   Started Docker Application Container Engine.
>>>   time="2021-05-22T15:07:53.871094800Z" level=info msg="API listen on /var/run/docker.sock"
>>>   
>>>   2021/05/22 15:07:53 Finished restarting docker service if needed
>>>   2021/05/22 15:07:53 Waiting for docker daemon to come up.
>>>   2021/05/22 15:07:53 Waiting for docker daemon to come up.
>>>   2021/05/22 15:07:53 Docker daemon is active
>>>   2021/05/22 15:07:53 Docker daemon is active
>>>   2021/05/22 15:07:53 Retry Docker Command...
>>>   2021/05/22 15:07:53 Retry Docker Command...
>>>   2021/05/22 15:08:09 Running Docker Command attempt 4 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:08:09 Running Docker Command attempt 4 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:08:41 Force Restart Docker Service
>>>   2021/05/22 15:08:41 Force Restart Docker Service
>>>   2021/05/22 15:08:42 
>>>   2021/05/22 15:08:42 
>>>   2021/05/22 15:08:42 Last 20 lines of Docker daemon log file, fetched after force restart:
>>>    time="2021-05-22T15:08:42.087658100Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:08:42.087667500Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:08:42.089030500Z" level=info msg="parsed scheme: \"unix\"" module=grpc
>>>   time="2021-05-22T15:08:42.089058200Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
>>>   time="2021-05-22T15:08:42.089076100Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:08:42.089089600Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:08:42.098016300Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
>>>   time="2021-05-22T15:08:42.099754700Z" level=warning msg="Your kernel does not support swap memory limit"
>>>   time="2021-05-22T15:08:42.099778000Z" level=warning msg="Your kernel does not support cgroup rt period"
>>>   time="2021-05-22T15:08:42.099785600Z" level=warning msg="Your kernel does not support cgroup rt runtime"
>>>   time="2021-05-22T15:08:42.099791900Z" level=warning msg="Your kernel does not support cgroup blkio weight"
>>>   time="2021-05-22T15:08:42.099798100Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
>>>   time="2021-05-22T15:08:42.099938600Z" level=info msg="Loading containers: start."
>>>   time="2021-05-22T15:08:42.180795500Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
>>>   time="2021-05-22T15:08:42.212181600Z" level=info msg="Loading containers: done."
>>>   time="2021-05-22T15:08:42.228966500Z" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled" storage-driver=overlay2
>>>   time="2021-05-22T15:08:42.229346600Z" level=info msg="Docker daemon" commit=7d75c1d40d88ddef08653dbd611f41df42bdf087 graphdriver(s)=overlay2 version=19.03.14+azure
>>>   time="2021-05-22T15:08:42.229412500Z" level=info msg="Daemon has completed initialization"
>>>   time="2021-05-22T15:08:42.242934500Z" level=info msg="API listen on /var/run/docker.sock"
>>>   Started Docker Application Container Engine.
>>>   
>>>   2021/05/22 15:08:42 Finished restarting docker service if needed
>>>   2021/05/22 15:08:42 Waiting for docker daemon to come up.
>>>   2021/05/22 15:08:42 Waiting for docker daemon to come up.
>>>   2021/05/22 15:08:42 Docker daemon is active
>>>   2021/05/22 15:08:42 Docker daemon is active
>>>   2021/05/22 15:08:42 Retry Docker Command...
>>>   2021/05/22 15:08:42 Retry Docker Command...
>>>   2021/05/22 15:08:57 Running Docker Command attempt 5 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:08:57 Running Docker Command attempt 5 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:10:02 Force Restart Docker Service
>>>   2021/05/22 15:10:02 Force Restart Docker Service
>>>   2021/05/22 15:10:02 
>>>   2021/05/22 15:10:02 
>>>   2021/05/22 15:10:02 Last 20 lines of Docker daemon log file, fetched after force restart:
>>>    time="2021-05-22T15:10:02.554674900Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:10:02.554683700Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:10:02.555814300Z" level=info msg="parsed scheme: \"unix\"" module=grpc
>>>   time="2021-05-22T15:10:02.556471100Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
>>>   time="2021-05-22T15:10:02.556488600Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:10:02.556501300Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:10:02.564340600Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
>>>   time="2021-05-22T15:10:02.565848300Z" level=warning msg="Your kernel does not support swap memory limit"
>>>   time="2021-05-22T15:10:02.565884700Z" level=warning msg="Your kernel does not support cgroup rt period"
>>>   time="2021-05-22T15:10:02.565891600Z" level=warning msg="Your kernel does not support cgroup rt runtime"
>>>   time="2021-05-22T15:10:02.565896900Z" level=warning msg="Your kernel does not support cgroup blkio weight"
>>>   time="2021-05-22T15:10:02.565903600Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
>>>   time="2021-05-22T15:10:02.566121800Z" level=info msg="Loading containers: start."
>>>   time="2021-05-22T15:10:02.651460900Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
>>>   time="2021-05-22T15:10:02.682343000Z" level=info msg="Loading containers: done."
>>>   time="2021-05-22T15:10:02.698387100Z" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled" storage-driver=overlay2
>>>   time="2021-05-22T15:10:02.698780400Z" level=info msg="Docker daemon" commit=7d75c1d40d88ddef08653dbd611f41df42bdf087 graphdriver(s)=overlay2 version=19.03.14+azure
>>>   time="2021-05-22T15:10:02.698851500Z" level=info msg="Daemon has completed initialization"
>>>   time="2021-05-22T15:10:02.715795400Z" level=info msg="API listen on /var/run/docker.sock"
>>>   Started Docker Application Container Engine.
>>>   
>>>   2021/05/22 15:10:02 Finished restarting docker service if needed
>>>   2021/05/22 15:10:02 Waiting for docker daemon to come up.
>>>   2021/05/22 15:10:02 Waiting for docker daemon to come up.
>>>   2021/05/22 15:10:02 Docker daemon is active
>>>   2021/05/22 15:10:02 Docker daemon is active
>>>   2021/05/22 15:10:02 Retry Docker Command...
>>>   2021/05/22 15:10:02 Retry Docker Command...
>>>   2021/05/22 15:10:17 Running Docker Command attempt 6 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:10:17 Running Docker Command attempt 6 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:12:26 Force Restart Docker Service
>>>   2021/05/22 15:12:26 Force Restart Docker Service
>>>   2021/05/22 15:12:26 
>>>   2021/05/22 15:12:26 
>>>   2021/05/22 15:12:26 Last 20 lines of Docker daemon log file, fetched after force restart:
>>>    time="2021-05-22T15:12:26.277283800Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:12:26.277294100Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:12:26.279050800Z" level=info msg="parsed scheme: \"unix\"" module=grpc
>>>   time="2021-05-22T15:12:26.279075400Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
>>>   time="2021-05-22T15:12:26.279089200Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/containerd/containerd.sock 0  <nil>}] <nil>}" module=grpc
>>>   time="2021-05-22T15:12:26.279097600Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
>>>   time="2021-05-22T15:12:26.283940400Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
>>>   time="2021-05-22T15:12:26.285551700Z" level=warning msg="Your kernel does not support swap memory limit"
>>>   time="2021-05-22T15:12:26.285572600Z" level=warning msg="Your kernel does not support cgroup rt period"
>>>   time="2021-05-22T15:12:26.285579900Z" level=warning msg="Your kernel does not support cgroup rt runtime"
>>>   time="2021-05-22T15:12:26.285585600Z" level=warning msg="Your kernel does not support cgroup blkio weight"
>>>   time="2021-05-22T15:12:26.285591500Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
>>>   time="2021-05-22T15:12:26.285715100Z" level=info msg="Loading containers: start."
>>>   time="2021-05-22T15:12:26.364858000Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
>>>   time="2021-05-22T15:12:26.396268400Z" level=info msg="Loading containers: done."
>>>   time="2021-05-22T15:12:26.412820700Z" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled" storage-driver=overlay2
>>>   time="2021-05-22T15:12:26.413208200Z" level=info msg="Docker daemon" commit=7d75c1d40d88ddef08653dbd611f41df42bdf087 graphdriver(s)=overlay2 version=19.03.14+azure
>>>   time="2021-05-22T15:12:26.413260000Z" level=info msg="Daemon has completed initialization"
>>>   time="2021-05-22T15:12:26.432611400Z" level=info msg="API listen on /var/run/docker.sock"
>>>   Started Docker Application Container Engine.
>>>   
>>>   2021/05/22 15:12:26 Finished restarting docker service if needed
>>>   2021/05/22 15:12:26 Waiting for docker daemon to come up.
>>>   2021/05/22 15:12:26 Waiting for docker daemon to come up.
>>>   2021/05/22 15:12:26 Docker daemon is active
>>>   2021/05/22 15:12:26 Docker daemon is active
>>>   2021/05/22 15:12:26 Retry Docker Command...
>>>   2021/05/22 15:12:26 Retry Docker Command...
>>>   2021/05/22 15:12:41 Running Docker Command attempt 7 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:12:41 Running Docker Command attempt 7 failed with client timeout err: exit status 1,Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   . See documentation for error details: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#docker-pull-fails-with-error-nethttp-request-canceled-while-waiting-for-connection-clienttimeout-exceeded-while-awaiting-headers
>>>   2021/05/22 15:12:41 Run docker command to pull public image failed with error: Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   .
>>>   2021/05/22 15:12:41 Docker config dir /mnt/batch/tasks/shared/LS_root/jobs/dds-ml-east/azureml/fastai-custom-image_1621695915_a4bb441e/wd/docker_login_6FE00B6271AD80D6 does not exist, skip removing it
>>>   2021/05/22 15:12:41 Pull docker image time: 6m4.6839193s
>>>   
>>>   2021/05/22 15:12:41 Get credentials or pull docker image failed with err: Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   , skipping start Docker Container
>>>   2021/05/22 15:12:41 Starting Container fail with err Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   
>>>   2021/05/22 15:12:41 Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   
>>>   2021/05/22 15:12:45 Attempt 1 of http call to https://6e64c585-4845-4356-b1e0-a28ca62f252a.workspace.eastus.api.azureml.ms/history/v1.0/private/subscriptions/91095667-e119-4555-acea-1826488492f0/resourceGroups/ds-tengri-resources-eastus/providers/Microsoft.MachineLearningServices/workspaces/DDS-ML-EAST/runs/fastai-custom-image_1621695915_a4bb441e/spans
>>>   2021/05/22 15:13:11 Time Out after 20 second retries for flushing the logs, doing another retry before exiting
>>>   2021/05/22 15:13:11 Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
>>>   
>>>   
2021-05-22T15:13:11Z PostJobNodeHealthCheck
2021-05-22T15:13:11Z Executing 'Post job node health check' on 10.8.96.89
2021-05-22T15:13:41Z Post job node health check succeeded on 10.8.96.89. Output: 
>>>   2021/05/22 15:13:11 Starting App Insight Logger for task:  postJobNodeHealthCheck
>>>   2021/05/22 15:13:11 Version: 3.0.01597.0004 Branch: 2021-05-17-bing-hotfix Commit: 974f3e4
>>>   2021/05/22 15:13:11 Start Post-job node health check
>>>   2021/05/22 15:13:11 PostJobNodeHealthCheck
>>>   2021/05/22 15:13:11 GetDBE: get DBE error
>>>   2021/05/22 15:13:11 No system error was found
>>>   2021/05/22 15:13:11 DBEOutput: 
>>>   2021/05/22 15:13:11 GetOOM: get OOM error
>>>   2021/05/22 15:13:11 No system error was found
>>>   2021/05/22 15:13:11 Skipping NCCL CUDA Error Check because it's not enabled in dynamic config
>>>   2021/05/22 15:13:11 This is a cpu cluster, skipping gpu usage check
>>>   2021/05/22 15:13:11 Not exporting to RunHistory as the exporter is either stopped or there is no data.
>>>   Stopped: false
>>>   OriginalData: 1
>>>   FilteredData: 0.
>>>   2021/05/22 15:13:11 Process Exiting with Code:  0
>>>   2021/05/22 15:13:41 Time Out after 20 second retries for flushing the logs, doing another retry before exiting
>>>   
2021-05-22T15:13:41Z Executing 'JobRelease task' on 10.8.96.89
2021-05-22T15:14:12Z JobRelease task succeeded on 10.8.96.89. Output: 
>>>   2021/05/22 15:13:41 Starting App Insight Logger for task:  jobRelease
>>>   2021/05/22 15:13:41 Version: 3.0.01597.0004 Branch: 2021-05-17-bing-hotfix Commit: 974f3e4
>>>   2021/05/22 15:13:42 Exit since job container is not in running state.
>>>   2021/05/22 15:14:12 Time Out after 20 second retries for flushing the logs, doing another retry before exiting
>>>   2021/05/22 15:14:12 App Insight Client has already been closed
>>>   2021/05/22 15:14:12 Not exporting to RunHistory as the exporter is either stopped or there is no data.
>>>   Stopped: false
>>>   OriginalData: 1
>>>   FilteredData: 0.
>>>   
2021-05-22T15:14:12Z Executing 'Collect error information from workers' on 10.8.96.89
2021-05-22T15:14:12Z Collect error information from workers succeeded on 10.8.96.89. Output: 
>>>   
>>>   
2021-05-22T15:14:12Z Executing 'Job environment clean-up' on 10.8.96.89
2021-05-22T15:14:12Z Removing container fastai-custom-image_1621695915_a4bb441e exited with 1, Error: No such container: fastai-custom-image_1621695915_a4bb441e
hamelsmu commented 3 years ago

cc: @gregce

vivram commented 3 years ago

The notebook uses the dockerhub for the image: fastai_env.docker.base_image = "fastdotai/fastai:latest"

Since your workspace is behind privatelink, can you help check if dockerhub is accessible from within your cluster? If not, can you push the image to your workspace ACR and retry with the image from your workspace ACR instead?

hamelsmu commented 3 years ago

Yes it works if I push the image to my private registry. Perhaps update the docs to say that public images may not work with a Docker registry?

You can close this issue if you like.

vivram commented 3 years ago

With the option to specify privateregistry called out explicitly in this doc, I think there is enough bits in the notebook and in public docs. What is missing here is the error message does not indicate that the network setup can be the cause for image pull to fail.

We can fix the error message here.