aws-samples / eks-kubeflow-workshop

Kubeflow workshop on EKS. Mainly focus on AWS integration examples. Please go check kubeflow website http://kubeflow.org for other examples
Apache License 2.0
96 stars 56 forks source link

AWS access error when running training on Kubernetes pods #22

Closed plaffitte closed 4 years ago

plaffitte commented 4 years ago

Hi all, I'm trying to run the Kubeflow example provided in the workshop (https://eksworkshop.com/advanced/420_kubeflow/training/) but I'm stuck with an error. The pod fails after creating it with following command:

kubectl create -f mnist-training.yaml

The error is the following:

`Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz 32768/29515 [=================================] - 0s 1us/step 40960/29515 [=========================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz 26427392/26421880 [==============================] - 0s 0us/step 26435584/26421880 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz 16384/5148 [===============================================================================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz 4423680/4422102 [==============================] - 0s 0us/step 4431872/4422102 [==============================] - 0s 0us/step WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 2019-12-19 15:48:33.163833: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-12-19 15:48:33.185216: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499990000 Hz 2019-12-19 15:48:33.185530: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3758280 executing computations on platform Host. Devices: 2019-12-19 15:48:33.185554: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , 2019-12-19 15:48:33.229507: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1 2019-12-19 15:48:33.229532: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0 2019-12-19 15:48:33.229543: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default 2019-12-19 15:48:33.229555: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http 2019-12-19 15:48:33.229567: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2 2019-12-19 15:48:33.229578: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating Instance with default EC2MetadataClient and refresh rate 900000 2019-12-19 15:48:33.229593: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.229627: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25 2019-12-19 15:48:33.229713: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.229817: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2 2019-12-19 15:48:33.229833: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-12-19 15:48:33.244732: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 301 2019-12-19 15:48:33.244762: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-12-19 15:48:33.244811: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.244891: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-12-19 15:48:33.246346: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError PermanentRedirect The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.: 2019-12-19 15:48:33.246373: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-12-19 15:48:33.246457: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.246539: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-12-19 15:48:33.247445: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 301 2019-12-19 15:48:33.247466: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-12-19 15:48:33.247500: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.247575: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-12-19 15:48:33.248619: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError PermanentRedirect The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.: 2019-12-19 15:48:33.248646: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-12-19 15:48:33.248684: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.248754: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-12-19 15:48:33.249817: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 301 2019-12-19 15:48:33.249838: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-12-19 15:48:33.249873: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.249942: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-12-19 15:48:33.250905: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError PermanentRedirect The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.: 2019-12-19 15:48:33.250930: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-12-19 15:48:33.250967: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.251040: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-12-19 15:48:33.251994: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 301 2019-12-19 15:48:33.252014: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-12-19 15:48:33.252048: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.252115: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-12-19 15:48:33.252729: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError PermanentRedirect The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.: 2019-12-19 15:48:33.252754: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-12-19 15:48:33.252793: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-12-19 15:48:33.252860: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-12-19 15:48:33.253469: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 301 2019-12-19 15:48:33.253491: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.

train_images.shape: (60000, 28, 28, 1), of float64 test_images.shape: (10000, 28, 28, 1), of float64


Layer (type) Output Shape Param #

Conv1 (Conv2D) (None, 13, 13, 8) 80


flatten (Flatten) (None, 1352) 0


Softmax (Dense) (None, 10) 13530

Total params: 13,610 Trainable params: 13,610 Non-trainable params: 0


Traceback (most recent call last): File "mnist.py", line 89, in main() File "mnist.py", line 82, in main model = train(train_images, train_labels, args.epochs, args.model_summary_path) File "mnist.py", line 51, in train model.fit(train_images, train_labels, epochs=epochs, callbacks=[tensorboard_callback]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 880, in fit validation_steps=validation_steps) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215, in model_iteration mode=mode) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 106, in configure_callbacks callback_list.set_model(callback_model) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 178, in set_model callback.set_model(model) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 1010, in set_model self._init_writer() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 947, in _init_writer self.writer = tf_summary.FileWriter(self.log_dir, K.get_session().graph) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in init filename_suffix) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in init gfile.MakeDirs(self._logdir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir recursive_create_dir_v2(dirname) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2 pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnknownError: : No response body. Response code: 301`

And my config file looks like this: `apiVersion: v1 kind: Pod metadata: name: mnist-training labels: app: mnist type: training framework: tensorflow spec: restartPolicy: OnFailure containers:

Anyone has an idea what this issue relates to?

plaffitte commented 4 years ago

After some googling around I have gathered that the problem seems to be related to the endpoint not being specified properly, as per the error message: The bucket you are attempting to access must be addressed using the specified endpoint

It seems that whatever s3 client is being used within tensorflow is not formatting the s3 endpoint correctly.

plaffitte commented 4 years ago

Found the issue, the variable pointing the s3 bucket was wrong. I had written it as follows s3://<my-bucket-name> when it was to be like this<my-bucket-name>.s3.amazon.com