Curl returned error code 55 --- failed to save tensorflow checkpoint on S3

vbvg2008 commented 6 years ago

Please fill out the form below.

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow
Framework Version: 1.11.0
Python Version: 2.7
CPU or GPU: GPU
Python SDK Version: default
Are you using a custom image: yes

Describe the problem

I was training my code on p3.8xlarge, and after several training steps, the curl error will pop up and cause the training to fail. I tried 3 times, the error occurred at 3900 steps for the first time, 4200 steps for the second time and at 7200 steps for the third time.

logs

tensorflow - Saving checkpoints for 7200 into s3://....../checkpoints/model.ckpt. tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55 tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:137 : Unknown: : Unable to connect to endpoint ERROR - container_support.training - uncaught exception during training: : Unable to connect to endpoint

nadiaya commented 6 years ago

What is the size of the checkpoint?

The tensorflow plugin for S3 file system does not support checkpoints bigger than 5GB.

vbvg2008 commented 6 years ago

@nadiaya the size of checkpoint is ~90MB. It seems like this error happens in a random manner, every time I save checkpoint to S3, there's a chance this error might happen. For example, when I was saving my checkpoint every 300 straining steps, sometimes this error happens around 10,000 training steps, sometimes this error show up around 4,000 steps.

The training job fails after the error.

nadiaya commented 6 years ago

Would it be possible to see the full stack trace with the error? And training script you have been using?

Do you get the same error if running outside of SageMaker?

vbvg2008 commented 6 years ago

full stack trace:

2018-11-02 19:06:20,204 INFO - tensorflow - Saving checkpoints for 18600 into s3://....model.ckpt.
2018-11-02 19:06:20.388378: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:20.388425: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:20.388437: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 0 ms before attempting again.
2018-11-02 19:06:20.446791: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:20.446871: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:20.446887: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 50 ms before attempting again.
2018-11-02 19:06:20.564552: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:20.564585: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:20.564596: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 100 ms before attempting again.
2018-11-02 19:06:20.812535: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:20.812569: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:20.812582: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 200 ms before attempting again.
2018-11-02 19:06:21.088961: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:21.089003: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:21.089017: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 400 ms before attempting again.
2018-11-02 19:06:21.549034: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:21.549081: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:21.549093: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 800 ms before attempting again.
2018-11-02 19:06:22.440253: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:22.440295: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:22.440306: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 1600 ms before attempting again.
2018-11-02 19:06:24.160717: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:24.160758: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:24.160770: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 3200 ms before attempting again.
2018-11-02 19:06:27.438495: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:27.438536: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:27.438550: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 6400 ms before attempting again.
2018-11-02 19:06:33.956772: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:33.956812: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:33.956824: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 12800 ms before attempting again.
2018-11-02 19:06:46.823362: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
2018-11-02 19:06:46.823400: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-02 19:06:46.891715: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:137 : Unknown: : Unable to connect to endpoint
2018-11-02 19:06:47,350 ERROR - container_support.training - uncaught exception during training: : Unable to connect to endpoint
#011 [[{{node save/SaveV2}} = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, save/mul_10, save/mul_11, save/mul_12, save/mul_13, save/mul_14, save/mul_15, save/mul_16, save/mul_17, save/mul_18, save/mul_19)]]
Caused by op u'save/SaveV2', defined at:
File "/usr/local/bin/entry.py", line 28, in <module>
modes[mode]()
File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 36, in start
fw.train()
File "/usr/local/lib/python2.7/dist-packages/tf_container/train_entry_point.py", line 173, in train
train_wrapper.train()
File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 73, in train
tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 637, in run
getattr(self, task_to_run)()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 674, in run_master
self._start_distributed_training(saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
saving_listeners=saving_listeners)
File "/opt/ml/code/extensions/estimator.py", line 339, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/opt/ml/code/extensions/estimator.py", line 1162, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/opt/ml/code/extensions/estimator.py", line 1309, in _train_model_distributed
saving_listeners)
File "/opt/ml/code/extensions/estimator.py", line 1389, in _train_with_estimator_spec
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
self._scaffold.finalize()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
self._saver.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1106, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1143, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 778, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 284, in _AddSaveOps
save = self.save_op(filename_tensor, saveables)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 202, in save_op
tensors)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1690, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): : Unable to connect to endpoint

There's no error running outside of Sagemaker because the checkpoint doesn't have to be uploaded to S3 bucket.

nadiaya commented 6 years ago

There are a few other known reasons that cause TensorFlow S3 File System Plugin issues:

does the S3 bucket, where the checkpoints are being written to, has object versioning turned on?
do you use encryption? client side or service side?

gdj0nes commented 5 years ago

@nadiaya We are getting the same problem using script mode with TF 1.11.0 with Python 3 using both the SageMaker S3 buckets and our own bucket by specifying a model_dir. The issues appears to happen in the process of writing the checkpoints. However it is able to write most the summaries before failing. We do not experience this error when running on our local machines and saving to the s3 bucket. Errors resulting from attempting to run the working model locally but reading from the SageMaker produced artifacts suggests the write did not finish because variables are not found in the checkpoint.

Furthermore the issue seems closely tied to the process of saving the model weights. A temp dir of the weight is built but the connection seems to exit before this is cleaned up.

The following error emerges after the initial train and eval construction

algo-1-X4BT1_1  | 2018-12-03 22:52:05.519559: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
algo-1-X4BT1_1  | 2018-12-03 22:52:05.519783: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:52:05.587283: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
algo-1-X4BT1_1  | 2018-12-03 22:52:05.587645: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:52:48.810402: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
algo-1-X4BT1_1  | 2018-12-03 22:52:48.810545: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:53:32.464510: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:53:32.464739: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:53:32.464848: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 0 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:53:35.774847: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:53:35.774988: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:53:35.775155: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 50 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:53:39.376903: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:53:39.376969: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:53:39.377067: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 100 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:53:42.802456: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:53:42.802568: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:53:42.802729: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 200 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:53:46.654457: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:53:46.654530: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:53:46.654636: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 400 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:53:50.301838: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:53:50.301935: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:53:50.302105: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 800 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:53:54.539857: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:53:54.540024: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:53:54.540099: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 1600 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:53:59.827496: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:53:59.827554: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:53:59.827593: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 3200 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:54:06.337327: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:54:06.337473: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:54:06.337673: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 6400 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:54:15.885651: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:54:15.886023: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:54:15.886075: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 12800 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:54:35.367398: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:54:35.367750: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:54:35.367859: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 0 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:54:38.503032: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:54:38.503099: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:54:38.503137: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 50 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:54:42.086329: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:54:42.086425: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:54:42.086582: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 100 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:54:45.487994: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:54:45.488089: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:54:45.488241: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 200 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:54:49.191080: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:54:49.191179: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:54:49.191343: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 400 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:54:52.729371: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 28
algo-1-X4BT1_1  | 2018-12-03 22:54:52.729467: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-X4BT1_1  | 2018-12-03 22:54:52.729659: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 800 ms before attempting again.
algo-1-X4BT1_1  | 2018-12-03 22:54:57.300598: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
algo-1-X4BT1_1  | 2018-12-03 22:54:57.300750: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.

nadiaya commented 5 years ago

Thank you so much! This is very useful information!

We do not experience this error when running on our local machines and saving to the s3 bucket.

How exactly are you running locally? Do you run just the script? or using SageMaker Python SDK 'local' instance type?

NKNY commented 5 years ago

Dear @nadiaya,

Have there been any updates regards finding the cause of the issue? For us this problem has been persisting for the last 3 weeks. It is hard to precisely quantify as on some rare days all training requests seem to succeed but on others more often that not the jobs fail at some point during training. There are a few different cases we’ve encountered (with associated error logs).

1) Training script dies when restoring from a checkpoint (curl 55). Checkpoint seems to be written properly. https://gist.github.com/NKNY/44de05c3b22d2803b561be171a5df011

2) Training script dies when writing a checkpoint at some point after the training had started (curl 55). Checkpoint writing is not finished. For example instead of model.ckpt-480 data, index and meta files being in the s3 model dir there is a folder model.ckpt-480_temp_5c7b0f1901f842568f42c989c49a4307 containing data files of the same sizes as previous checkpoints, a smaller-than-previous-timesteps index file and no meta file. https://gist.github.com/NKNY/4bd591341881567fcb2944fe158e568f

3) Training script dies when writing the checkpoint at timestep 0 (curl 28). This error, unlike the ones above, seems to have disappeared after setting S3_CONNECT_TIMEOUT_MSEC and S3_REQUEST_TIMEOUT_MSEC to ’60000’ via os.environ at the start of the training script. https://gist.github.com/NKNY/ed8084bd1ef9a019267877bcdc9cd655

4) When the input data was stored as numpy arrays (and then later read via tensorflow.data.Dataset.from_tensor_slices) it lead to a ~1GB graph.pbtxt file. When writing such a file during timestep 0 the training script would most (but not all) of the times die. We then converted the data to tfrecords, leading to a 2mb graph.pbtxt and no crashes when writing that file. https://gist.github.com/NKNY/72fec59e424d441b3b13c249ede51f6a

The first two types of failures can happen at any point during the training without having changed any of the training parameters e.g. when running the same script 3 times: first time an error happens 60 steps into the training, the second run can be 1500 steps and the third run can finish successfully (2000 steps).

Sagemaker script mode environment [EU (Ireland)]:

py_version: 'py3'
train_instance_count: 1
train_instance_type: 'ml.p2.xlarge'
framework_version: ‘1.12.0'

Size of checkpoints is 500-1000mb due to embeddings trained as a part of an RNN.

When executed on our own machines (outside of sagemaker environment, storing the input data and writing checkpoints locally without making any calls to s3) the training is always completed.

Please let me know if I can assist further. Thank you

andremoeller commented 5 years ago

@NKNY @vbvg2008 @gdj0nes

The SageMaker TensorFlow 1.12 release includes fixes to the S3 plugin that should solve most (possibly all) of these "Unable to connect to endpoint" issues.

Would it be possible for you to use framework_version '1.12'?

albertlim commented 5 years ago

@andremoeller where can we get the correct SageMaker TensorFlow 1.12 release, and could you let us know what the cause is? We're also running into the exact same issue.

andremoeller commented 5 years ago

@albertlim

It seems like the other user seems to already be using 1.12, but you would supply framework_version='1.12' argument to your TensorFlow estimator's constructor. I believe the cause is due to implementation problems in the S3 FileSystem: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/s3/s3_file_system.cc

I believe the 1.12 image has some fixes built into it that may reduce the occurrence of "Unable to connect to endpoint" issues.

albertlim commented 5 years ago

@andremoeller got it thanks. Unfortunately right now we aren't using Sagemaker, instead we're spinning up P3 EC2 instances, with the latest Nvidias Volta Deep Learning images, that has tensorflow-gpu==1.12.0+nv

Any recommendations? We just haven't looked into Sagemaker yet.

andremoeller commented 5 years ago

@albertlim

If the problem is the "Unable to connect to endpoint" problem, and you're writing or reading checkpoints to S3, then that TensorFlow installation is probably using the standard S3 FileSystem implementation. SageMaker included some fixes to the checkpoint behavior to make it more robust. Since SageMaker is a managed service, you won't need to maintain your EC2 instances, you can run many jobs at once, distribute training without a cluster manager, do HPO with a managed service, etc.

If you're interested in trying it out, here's how you'd move over to SageMaker: https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-tensorflow. Two tips: if you already use the Dataset API, you don't have to read from the "channels", and it's easiest to go through the SageMaker console to create the IAM Role.

If you'd like to stick with EC2, you might want to look into "yas3fs" and mount your own S3 bucket -- I haven't tried it myself, but I've heard of others getting good results: https://github.com/danilop/yas3fs.

albertlim commented 5 years ago

@andremoeller thanks again for the detailed response!

Just to understand, seems like I can do what I'm currently doing on SageMaker? We already have our estimators, input_fn, and data query methods ready to go. It makes external DB calls to get large volumes of data already. All we need is a GPU server.

andremoeller commented 5 years ago

Sure thing, @albertlim

It seems like it -- SageMaker will just run the script you give it in a TensorFlow Docker container, and if you have more dependencies, the docs describe how to add those to your environment in SageMaker.

About the external DB calls and permissions: If the DB is an AWS service, you can query your DB directly from SageMaker as long as the IAM role you give to SageMaker has the permissions to do so. If it's some other DB, you might have to get your credentials into the SageMaker container when it runs your job before you make the query. Getting the data is easiest if it's in S3, in which case SageMaker can download your data before the job starts, and you can just read from local files.

albertlim commented 5 years ago

@andremoeller got it. We're starting to discuss using SageMaker. Until we get the go ahead to do so however, would it be possible to share the commit or pull request that shows the fix?

andremoeller commented 5 years ago

@albertlim

Unfortunately, I won't be able to share the diff with you. I can tell you that it replaces the bare S3 client with the TransferManager:

https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/examples-s3-transfermanager.html

Among other things, this lets the S3 plugin do multi-part uploads, and retry on failed parts, rather than fail atomically.

NKNY commented 5 years ago

Thank you for your suggestion @andremoeller! Indeed, I was already using framework_version='1.12'. From testing it appears that the issue I described above has now disappeared without me changing anything on my end. Thank you for your help!

nikhila0912 commented 4 years ago

@andremoeller Hi, I have recently started using the framework_version = '1.14' and came across this issue. Is it resolved for 1.12 only? Also is the framework_version pointing to the tensorflow version? or is it related to the image?

laurenyu commented 4 years ago

@nikhila0912 could you open a new issue in this repository? thanks!

aws / sagemaker-python-sdk