Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

TensorFlow-CPU quickstart issues #369

Open fuglede opened 3 years ago

fuglede commented 3 years ago

Following the TensorFlow CPU quickstart, I run into a couple of issues

  1. When creating the pool, I get a

RuntimeError: Could not find an Azure Batch Node Agent Sku for this offer=ubuntuserver publisher=canonical sku=16.04-lts. You can list the valid and available Marketplace images with the command: account images

From a look at Azure Portal, it looks like only 18.04 is currently available; indeed, changing pool.yml to use 18.04-LTS instead is enough to get rid of this issue. This probably affects many of the bundled recipes:

batch-shipyard/recipes$ grep -R 16.04 .
./Caffe-CPU/config/pool.yaml:      sku: 16.04-LTS
./Caffe-GPU/config/pool.yaml:      sku: 16.04-LTS
./Caffe2-CPU/config/pool.yaml:      sku: 16.04-LTS
./Caffe2-GPU/config/pool.yaml:      sku: 16.04-LTS
./Chainer-CPU/config/pool.yaml:      sku: 16.04-LTS
./Chainer-GPU/config/pool.yaml:      sku: 16.04-LTS
./CNTK-CPU-Infiniband-IntelMPI/docker/Dockerfile:FROM ubuntu:16.04
./CNTK-CPU-OpenMPI/config/multinode/pool.yaml:      sku: 16.04-LTS
./CNTK-CPU-OpenMPI/config/singlenode/pool.yaml:      sku: 16.04-LTS
./CNTK-GPU-Infiniband-IntelMPI/docker/Dockerfile:FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
./CNTK-GPU-OpenMPI/config/multinode-multigpu/pool.yaml:      sku: 16.04-LTS
./CNTK-GPU-OpenMPI/config/singlenode-multigpu/pool.yaml:      sku: 16.04-LTS
./CNTK-GPU-OpenMPI/config/singlenode-singlegpu/pool.yaml:      sku: 16.04-LTS
./FFmpeg-GPU/config/pool.yaml:      sku: 16.04-LTS
./HPMLA-CPU-OpenMPI/config/pool.yaml:      sku: 16.04-LTS
./HPMLA-CPU-OpenMPI/Data-Shredding/README.md:* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
./HPMLA-CPU-OpenMPI/docker/Dockerfile:FROM ubuntu:16.04
./HPMLA-CPU-OpenMPI/docker/README.md:* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
./HPMLA-CPU-OpenMPI/README.md:* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
./Keras+Theano-CPU/config/pool.yaml:      sku: 16.04-LTS
./Keras+Theano-GPU/config/pool.yaml:      sku: 16.04-LTS
./MXNet-CPU/config/multinode/pool.yaml:      sku: 16.04-LTS
./MXNet-CPU/config/singlenode/pool.yaml:      sku: 16.04-LTS
./MXNet-CPU/docker/Dockerfile:FROM ubuntu:16.04
./MXNet-GPU/config/multinode/pool.yaml:      sku: 16.04-LTS
./MXNet-GPU/config/singlenode/pool.yaml:      sku: 16.04-LTS
./NAMD-GPU/config/pool.yaml:      sku: 16.04-LTS
./NAMD-TCP/config/pool.yaml:      sku: 16.04-LTS
./RemoteFS-GlusterFS+BatchPool/config/pool.yaml:      sku: 16.04-LTS
./TensorFlow-CPU/config/pool.yaml:      sku: 16.04-LTS
./TensorFlow-Distributed/config/cpu/pool.yaml:      sku: 16.04-LTS
./TensorFlow-Distributed/config/gpu/pool.yaml:      sku: 16.04-LTS
./TensorFlow-GPU/config/docker/pool.yaml:      sku: 16.04-LTS
./TensorFlow-GPU/config/singularity/pool.yaml:      sku: 16.04-LTS
./Torch-CPU/config/pool.yaml:      sku: 16.04-LTS
./Torch-CPU/docker/Dockerfile:FROM ubuntu:16.04
./Torch-GPU/config/pool.yaml:      sku: 16.04-LTS
  1. After the pool is created and I try to create the included job, I get another error:
    $ ../shipyard jobs add --tail stdout.txt
    2021-09-16 10:16:30.581 INFO - Adding job tensorflowjob to pool tensorflow-cpu
    2021-09-16 10:16:30.673 DEBUG - constructing 1 task specifications for submission to job tensorflowjob
    2021-09-16 10:16:30.738 DEBUG - submitting 1 task specifications to job tensorflowjob
    2021-09-16 10:16:30.741 DEBUG - submitting 1 tasks (0 -> 0) to job tensorflowjob
    2021-09-16 10:16:30.971 INFO - submitted all 1 tasks to job tensorflowjob
    2021-09-16 10:16:30.971 DEBUG - attempting to stream file stdout.txt from job=tensorflowjob task=task-00000
    Traceback (most recent call last):
    File "/mnt/c/Users/username/repos/batch-shipyard/shipyard.py", line 3136, in <module>
    cli()
    File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
    File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
    File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
    File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
    File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
    File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
    File "/mnt/c/Users/username/repos/batch-shipyard/shipyard.py", line 1968, in jobs_add
    convoy.fleet.action_jobs_add(
    File "/mnt/c/Users/username/repos/batch-shipyard/convoy/fleet.py", line 4065, in action_jobs_add
    batch.add_jobs(
    File "/mnt/c/Users/username/repos/batch-shipyard/convoy/batch.py", line 5892, in add_jobs
    stream_file_and_wait_for_task(
    File "/mnt/c/Users/username/repos/batch-shipyard/convoy/batch.py", line 3309, in stream_file_and_wait_for_task
    tfp = batch_client.file.get_properties_from_task(
    File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/azure/batch/operations/_file_operations.py", line 328, in get_properties_from_task
    raise models.BatchErrorException(self._deserialize, response)
    azure.batch.models._models_py3.BatchErrorException: Request encountered an exception.
    Code: None
    Message: None

Removing the resource_files section is enough to take care of the issue; probably unsurprising as the given blob_source (https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/mnist/convolutional.py) 404s.