DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
892 stars 241 forks source link

Cluster scaler terminates entirely on a spot request failure #1699

Open joelarmstrong opened 7 years ago

joelarmstrong commented 7 years ago

Currently the AWS provisioner will terminate the entire workflow if it hits the spot request limit:

Exception in thread preemptable-scaler:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/dist-packages/bd2k/util/threading.py", line 51, in run
    self.tryRun( )
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 462, in tryRun
    preemptable=self.preemptable)
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/abstractProvisioner.py", line 177, in setNodeCount
    preemptable=preemptable)
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 581, in _addNodes
    tentative=True)
  File "/usr/local/lib/python2.7/dist-packages/cgcloud/lib/ec2.py", line 356, in create_spot_instances
    requests = ec2.request_spot_instances( price, image_id, count=num_instances, **spec )
  File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 1638, in request_spot_instances
    verb='POST')
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1186, in get_list
    raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>MaxSpotInstanceCountExceeded</Code><Message>Max spot instance count exceeded</Message></Error></Errors><RequestID>e0d171f5-f10d-4319-b3c3-6571fb2a9462</RequestID></Response>

EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>MaxSpotInstanceCountExceeded</Code><Message>Max spot instance count exceeded</Message></Error></Errors><RequestID>e0d171f5-f10d-4319-b3c3-6571fb2a9462</RequestID></Response>
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 292, in check
    scalerThread.join(timeout=0)
  File "/usr/local/lib/python2.7/dist-packages/bd2k/util/threading.py", line 51, in run
    self.tryRun( )
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 462, in tryRun
    preemptable=self.preemptable)
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/abstractProvisioner.py", line 177, in setNodeCount
    preemptable=preemptable)
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 581, in _addNodes
    tentative=True)
  File "/usr/local/lib/python2.7/dist-packages/cgcloud/lib/ec2.py", line 356, in create_spot_instances
    requests = ec2.request_spot_instances( price, image_id, count=num_instances, **spec )
  File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 1638, in request_spot_instances
    verb='POST')
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1186, in get_list
    raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>MaxSpotInstanceCountExceeded</Code><Message>Max spot instance count exceeded</Message></Error></Errors><RequestID>e0d171f5-f10d-4319-b3c3-6571fb2a9462</RequestID></Response>
Waiting for workers to shutdown
Forcing provisioner to reduce cluster size to zero.

I'd suggest that instead we just drop a warning, (possibly) decrease the number of requested instances, and keep trying without killing the workflow. Users might easily go over their limit without realizing it, especially if they share AWS accounts or have a new AWS account. Unfortunately I can't submit a patch for this, because I can't test if it works, because my spot limit is 0 thanks to the AWS account reshuffle (not that I'm bitter about that :)).

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-169

adamnovak commented 5 years ago

When this error occurs, Toil does not stop the running cluster nodes (see #2196). That makes this bug extremely dangerous.

unito-bot commented 2 years ago

➤ Melaina Legaspi commented:

Marking this ticket as low priority, we haven’t addressed this in many years.

unito-bot commented 2 years ago

➤ Melaina Legaspi commented:

Adam Novak :"This needs to be reproduced and the best approach would be to mock the spot market.”