Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

cannot specify all start task failed nodes or unusable with a specific node id #249

Closed veonua closed 5 years ago

veonua commented 5 years ago

Problem Description

File "convoy/batch.py", line 825, in wait_for_pool_ready File "convoy/batch.py", line 680, in _block_for_nodes_ready RuntimeError: Please inspect both the node status above and files found within the caffe-cpu//startup directory (in the current working directory) if available. If this error appears non-transient, please submit an issue on GitHub, if not you can delete these nodes with "pool nodes del --all-start-task-failed" first prior to the resize operation. [30591] Failed to execute script shipyard

veon@ubuntu:~/work/InvoiceRecognition/services/config$ ../batch-shipyard-3.6.1-cli-linux-x86_64 pool nodes del --all-start-task-failed Traceback (most recent call last): File "shipyard.py", line 2812, in File "site-packages/click/core.py", line 764, in call File "site-packages/click/core.py", line 717, in main File "site-packages/click/core.py", line 1137, in invoke File "site-packages/click/core.py", line 1137, in invoke File "site-packages/click/core.py", line 1137, in invoke File "site-packages/click/core.py", line 956, in invoke File "site-packages/click/core.py", line 555, in invoke File "site-packages/click/decorators.py", line 64, in new_func File "site-packages/click/core.py", line 555, in invoke File "shipyard.py", line 1815, in nodes_del File "convoy/fleet.py", line 3576, in action_pool_nodes_del ValueError: cannot specify all start task failed nodes or unusable with a specific node id [30682] Failed to execute script shipyard

Batch Shipyard Version

3.6.1

Steps to Reproduce

just tried to run my pool

Expected Results

be able to clean up pool using command in the error message

Actual Results

cannot specify all start task failed nodes or unusable with a specific node id

alfpark commented 5 years ago

This issue was identified and fixed in master@17e26f091b92a8606bec1492f9948a0c02ba08af.

Workarounds:

  1. Get fix via latest master or Docker cli image.
  2. Delete nodes using Batch Explorer or the Portal.
  3. Delete the entire pool.