Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

blobxfer.exe is missing #358

Open adamzhangsm opened 4 years ago

adamzhangsm commented 4 years ago

Problem Description

Shipyard has the following script in the startup task to prepare the docker images and download blobxfer.exe

https://github.com/Azure/batch-shipyard/blob/ff49d187a4a082305c97e3e26946f6560ea68f56/scripts/windows/shipyard_nodeprep_nativedocker.ps1

If the node gets reboot or unhealthy then healthy, the start task will be triggered again. but it will not download them since the tag file is in place already:

if (Test-Path $NodePrepFinished -pathType Leaf) { Write-Host "$NodePrepFinished file exists, assuming successful completion of node prep" exit 0 }

The stdout.txt of this startup task which the blobxfer was not downloaded will be attached later.

From node agent log, we can see this folder D:\batch\tasks\startup got deleted due to the recovering or GC task. So the blobxfer.exe was there when the node joined this pool first time. but got deleted with this folder D:\batch\tasks\startup. And it will never get downloaded again because of the script shipyard_nodeprep_nativedocker.ps1

Batch Shipyard Version: latest version

Steps to Reproduce

  1. Use shipyard to create the node and some task
  2. shipyard will download blobxfer.exe in D:\batch\tasks\startup
  3. Restart this node or restart the node agent service
  4. The startup folder D:\batch\tasks\startup will be deleted, and blobxfer.exe will not be downloaded then we can't use shipyard to pull any container image.

Expected Results

The script should download blobxfer.exe again

Actual Results

No.

Redacted Configuration

INSERT RELEVANT YAML FILES

Additional Logs

INSERT ADDITIONAL LOGS HERE

Additonal Comments

adamzhangsm commented 4 years ago

Here is the stdout.txt of the startup task:

Configuration [Native Docker, Windows]:

Batch Shipyard version: 3.9.1 Blobxfer version: 1.9.4 Mounts path: D:\batch\tasks\mounts Custom image: False Encrypted: Azure File: False

Directory: D:\batch\tasks\volatile\startup

Mode LastWriteTime Length Name


-a---- 8/18/2020 8:59 PM 0 .save
Client: Debug Mode: false Plugins: cluster: Manage Docker Enterprise clusters (Mirantis Inc., v1.4.0)

Server: Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 3 Server Version: 19.03.11 Storage Driver: windowsfilter Windows: Logging Driver: json-file Plugins: Volume: local Network: ics internal l2bridge l2tunnel nat null overlay private transparent Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog Swarm: inactive Default Isolation: process Kernel Version: 10.0 14393 (14393.3866.amd64fre.rs1_release.200805-1327) Operating System: Windows Server 2016 Datacenter Version 1607 (OS Build 14393.3866) OSType: windows Architecture: x86_64 CPUs: 4 Total Memory: 14GiB Name: ad9858181000000 ID: CYSL:UYRR:D3XA:OWUY:5WT6:WPFT:LXBC:CAWG:ZU5D:H4PT:2MNP:L2M6 Docker Root Dir: C:\ProgramData\docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false Product License: this node is not a swarm manager - check license status on a manager node

D:\batch\tasks\volatile.batch_shipyard_node_prep_finished file exists, assuming successful completion of node prep