Fix error handling, allow hosts to come up within 180 seconds

Issue #, if available:

If multiple hosts come up, and one or more takes ~10 seconds or more to come up than the primary host, this code times out as the primary tries to get status from all the hosts:

2020-09-10T04:12:27.334-07:00   09-10 11:12 urllib3.connectionpool WARNING Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f8c1c2210>: Failed to establish a new connection: [Errno 111] Connection refused')': /
2020-09-10T04:12:29.334-07:00   09-10 11:12 urllib3.connectionpool WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f8c1c2b10>: Failed to establish a new connection: [Errno 111] Connection refused')': /
2020-09-10T04:12:33.335-07:00   09-10 11:12 urllib3.connectionpool WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f8c1ccd10>: Failed to establish a new connection: [Errno 111] Connection refused')': /
2020-09-10T04:12:41.338-07:00   09-10 11:12 urllib3.connectionpool WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f8c1cc750>: Failed to establish a new connection: [Errno 111] Connection refused')': /
2020-09-10T04:12:41.338-07:00
09-10 11:12 smspark-submit INFO     exiting with code 1: Algorithm Error: (caused by ConnectionError): Exception while getting status for host algo-2: HTTPConnectionPool(host='algo-2', port=5555): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f8c1ccbd0>: Failed to establish a new connection: [Errno 111] Connection refused'))

That's about 14 seconds of retries. Which means the intended timeout of 180s isn't reached.

I could consistently reproduce this by adding this code:

        if not self._is_primary_host:
            # simulate starting up slower than primary.
            sleep_time = 150
            self.logger.info(f"sleeping for {sleep_time} seconds")
            time.sleep(sleep_time)

here:

https://github.com/aws/sagemaker-spark-container/blob/9093da731b01e942980ffbff7f224c0a079c24ba/src/smspark/job.py#L84-L95

Description of changes:

Two changes:

(1) Catch the ConnectionError in the predicate function, turn it into False so the retry is hit. (2) The status_client no longer wraps the exception with an AlgorithmError before reraising it. This means caller doesn't have to unwrap the exception before handling it. (Otherwise, the code change (1) would have to catch an AlgorithmError.)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

aws / sagemaker-spark-container

Fix error handling, allow hosts to come up within 180 seconds #17