Following the TensorFlow CPU quickstart, I run into a couple of issues
When creating the pool, I get a
RuntimeError: Could not find an Azure Batch Node Agent Sku for this offer=ubuntuserver publisher=canonical sku=16.04-lts. You can list the valid and available Marketplace images with the command: account images
From a look at Azure Portal, it looks like only 18.04 is currently available; indeed, changing pool.yml to use 18.04-LTS instead is enough to get rid of this issue. This probably affects many of the bundled recipes:
After the pool is created and I try to create the included job, I get another error:
$ ../shipyard jobs add --tail stdout.txt
2021-09-16 10:16:30.581 INFO - Adding job tensorflowjob to pool tensorflow-cpu
2021-09-16 10:16:30.673 DEBUG - constructing 1 task specifications for submission to job tensorflowjob
2021-09-16 10:16:30.738 DEBUG - submitting 1 task specifications to job tensorflowjob
2021-09-16 10:16:30.741 DEBUG - submitting 1 tasks (0 -> 0) to job tensorflowjob
2021-09-16 10:16:30.971 INFO - submitted all 1 tasks to job tensorflowjob
2021-09-16 10:16:30.971 DEBUG - attempting to stream file stdout.txt from job=tensorflowjob task=task-00000
Traceback (most recent call last):
File "/mnt/c/Users/username/repos/batch-shipyard/shipyard.py", line 3136, in <module>
cli()
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/decorators.py", line 64, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/mnt/c/Users/username/repos/batch-shipyard/shipyard.py", line 1968, in jobs_add
convoy.fleet.action_jobs_add(
File "/mnt/c/Users/username/repos/batch-shipyard/convoy/fleet.py", line 4065, in action_jobs_add
batch.add_jobs(
File "/mnt/c/Users/username/repos/batch-shipyard/convoy/batch.py", line 5892, in add_jobs
stream_file_and_wait_for_task(
File "/mnt/c/Users/username/repos/batch-shipyard/convoy/batch.py", line 3309, in stream_file_and_wait_for_task
tfp = batch_client.file.get_properties_from_task(
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/azure/batch/operations/_file_operations.py", line 328, in get_properties_from_task
raise models.BatchErrorException(self._deserialize, response)
azure.batch.models._models_py3.BatchErrorException: Request encountered an exception.
Code: None
Message: None
Following the TensorFlow CPU quickstart, I run into a couple of issues
From a look at Azure Portal, it looks like only 18.04 is currently available; indeed, changing
pool.yml
to use 18.04-LTS instead is enough to get rid of this issue. This probably affects many of the bundled recipes:Removing the
resource_files
section is enough to take care of the issue; probably unsurprising as the givenblob_source
(https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/mnist/convolutional.py) 404s.