Closed dbgordon closed 4 years ago
From troubleshooting so far this appears to be an issue when the container starts up to run the tensorflow command as it is having issues writing to the S3 bucket after the training.
Thanks @dbgordon for creating this issue. Can you check if CONFIG_FILE has proper values? We use following sed commands to replace environment variables here during Kubeflow install part (https://eksworkshop.com/kubeflow/install/#customize-your-configuration). I’d check the CONFIG_FILE and see if NodeInstaceRole is actually replaced by your ROLE_NAME. You can do ‘echo $ROLE_NAME’ and validate values to confirm. Kubeflow installs fine with default values and I suspect that’s the culprit of the error. To recover, you’ll have to uninstall and re-install with proper values�. Let me know if this helps
Thanks @dalbhanj . I'm not sure which config file you are referring to. I installed Kubeflow into the Cloud9 environment as the workshop instructions suggested don't see a CONFIG_FILE anywhere.
Regardless, we are seeing this problem completely outside of KubeFlow and outside of the lab just going from AWS EKS container (not even the TensorFlow one) to an S3 bucket. So, seems there might be a more fundamental problem going on.
Earlier you mentioned that you guys saw this problem show up when running the workshop at KubeCon. Did it show up with everyone or just some people?
Look for this file. I think its under eksworkshop-eksctl folder 'export CONFIG_FILE=${KF_DIR}/kfctl_aws.0.7.0.yaml'
You can paste the content of the file here The issue is intermittent and not everyone saw it at Kubecon.
Rebuilding since my environment got messed up with all this experimenting.
Regardless, we've made progress. It appears that pulling the credentials from the environment variables results in this failure behavior. If you pull them from a file the problem appears to no longer show.
Look for this file. I think its under eksworkshop-eksctl folder 'export CONFIG_FILE=${KF_DIR}/kfctl_aws.0.7.0.yaml'
You can paste the content of the file here The issue is intermittent and not everyone saw it at Kubecon.
Here is my config file contents. It does not include the replaced values, as you suspected. This is after a second install from scratch:
apiVersion: kfdef.apps.kubeflow.org/v1beta1 kind: KfDef metadata: creationTimestamp: null name: kubeflow-aws namespace: kubeflow spec: applications:
I believe I see why the replacement with my $ROLE_NAME didn't happen. Because, as you can see in the un-altered config file above, the "role" to be replaced (that the sed command is looking for, is set to "eksctl-kubeflow-aws-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx". Note that it is "eksctl-KUBEFLOW. . .".
The sed command we are running in the instructions is: sed -i "s@eksctl-eksworkshop-eksctl-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx@$ROLE_NAME@" ${CONFIG_FILE}
Note: "eksctl-EKSWORKSHOP. . ."
This is unrelated to Kubeflow setup. This chapter (https://eksworkshop.com/kubeflow/training/) is supposed to work on EKS without Kubeflow.
2019-11-24 07:08:50.764524: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 12800 ms before attempting again.
2019-11-24 07:09:03.564645: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-24 07:09:03.564756: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-24 07:09:03.569469: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 6
2019-11-24 07:09:03.569527: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
Traceback (most recent call last):
File "mnist.py", line 89, in <module>
main()
File "mnist.py", line 82, in main
model = train(train_images, train_labels, args.epochs, args.model_summary_path)
File "mnist.py", line 51, in train
train_images.shape: (60000, 28, 28, 1), of float64
test_images.shape: (10000, 28, 28, 1), of float64
_________________________________________________________________
Layer (type) Output Shape Param #
apiVersion: v1
=================================================================
Conv1 (Conv2D) (None, 13, 13, 8) 80
_________________________________________________________________
flatten (Flatten) (None, 1352) 0
_________________________________________________________________
Softmax (Dense) (None, 10) 13530
=================================================================
Total params: 13,610
Trainable params: 13,610
Non-trainable params: 0
_________________________________________________________________
model.fit(train_images, train_labels, epochs=epochs, callbacks=[tensorboard_callback])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 880, in fit
validation_steps=validation_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215, in model_iteration
mode=mode)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 106, in configure_callbacks
callback_list.set_model(callback_model)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 178, in set_model
callback.set_model(model)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 1010, in set_model
self._init_writer()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 947, in _init_writer
self.writer = tf_summary.FileWriter(self.log_dir, K.get_session().graph)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in __init__
filename_suffix)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
gfile.MakeDirs(self._logdir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir
recursive_create_dir_v2(dirname)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2
pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
We brought the problem in this PR. https://github.com/aws-samples/eks-workshop/pull/522/files#diff-3d8576a7d473c236f821a0d67193b0b2L61-L62
The behavior is a little bit different here.
$ jq -r .AccessKey.AccessKeyId /tmp/create_output.json
AKIAVCDTNILGAGD6IPPZ
$ echo -n 'AKIAVCDTNILGAGD6IPPZ'
AKIAVCDTNILGAGD6IPPZ%
I am using zsh here and %
here means unterminated lines in a command's output.
If we do base64 we can see that, we can see the difference.
$ jq -r .AccessKey.AccessKeyId /tmp/create_output.json | base64
QUtJQVZDRFROSUxHQUdENklQUFoK
$ echo -n 'AKIAVCDTNILGAGD6IPPZ' | base64
QUtJQVZDRFROSUxHQUdENklQUFo=
Check echo doc,
The following option is available:
-n Do not print the trailing newline character. This may also be achieved by appending
`\c' to the end of the string, as is done by iBCS2 compatible systems. Note that this
option as well as the effect of `\c' are implementation-defined in IEEE Std
1003.1-2001 (``POSIX.1'') as amended by Cor. 1-2002. Applications aiming for maximum
portability are strongly encouraged to use printf(1) to suppress the newline charac-
ter.
In kubernetes, we have not to print the trailing new line to create secret. https://kubernetes.io/docs/concepts/configuration/secret/#creating-a-secret-manually
The easiest way is to use -j
to replace -r
option in jq.
--join-output / -j: Like -r but jq won’t print a newline after each output.
@dalbhanj @arun-gupta @otterley
@Jeffwan thanks for bringing this up. Al though, @dbgordon issue seems to be related to Cluster name, because substitution is hard coded to eksworkshop clustername.
Regarding the problem, can we create a new issue to address it? Does this issue only occur when you use zsh? Workshop uses Cloud9 IDE and is OS-agnostic (OS specific instructions are not provided)
Whoops! Sorry about that! (I could have sworn I tested it before making the PR.)
@Jeffwan thanks for bringing this up. Al though, @dbgordon issue seems to be related to Cluster name, because substitution is hard coded to eksworkshop clustername.
Regarding the problem, can we create a new issue to address it? Does this issue only occur when you use zsh? Workshop uses Cloud9 IDE and is OS-agnostic (OS specific instructions are not provided)
@dalbhanj
Most of attendees meet this problem and jq, echo behavior is same in cloud9 or mac. The reason I mention I am using zsh because the display is different, %
is only in zsh
If there's an issue's there. Let me know. We can put PR link there. I hope someone else can double test it.
@otterley Not a problem, this is kind of tricky and not easy to detect.
@Jeffwan Confirmed! You hit the nail on this issue. I was able to reproduce and once I used your PR (https://github.com/aws-samples/eks-workshop/pull/543), I was able to resolve it.
Closing since this is resolved
Working with the ML workshop and in the 2nd module on Model Training, we're running into an error message when trying to start up the training container using the workshop code snippet:
curl -LO https://eksworkshop.com/kubeflow/kubeflow.files/mnist-training.yaml envsubst < mnist-training.yaml | kubectl create -f -
Two of us have seen this in two different settings: when just plain old running through the lab as is, in Cloud9, and also while running the same scenario through GitLab runners.
Error message is:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz 32768/29515 [=================================] - 0s 0us/step 40960/29515 [=========================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz 26427392/26421880 [==============================] - 0s 0us/step 26435584/26421880 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz 16384/5148 [===============================================================================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz 4423680/4422102 [==============================] - 0s 0us/step 4431872/4422102 [==============================] - 0s 0us/step WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 2019-11-21 18:30:18.140945: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-11-21 18:30:18.144825: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz 2019-11-21 18:30:18.145045: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x31383d0 executing computations on platform Host. Devices: 2019-11-21 18:30:18.145081: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0):,
2019-11-21 18:30:18.193253: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1
2019-11-21 18:30:18.193298: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0
2019-11-21 18:30:18.193310: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default
2019-11-21 18:30:18.193322: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http
2019-11-21 18:30:18.193470: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2
2019-11-21 18:30:18.193489: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating Instance with default EC2MetadataClient and refresh rate 900000
2019-11-21 18:30:18.193507: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-21 18:30:18.193553: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25
2019-11-21 18:30:18.193605: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-21 18:30:18.193737: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2019-11-21 18:30:18.193753: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-21 18:30:18.215249: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-21 18:30:18.215299: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-21 18:30:18.215433: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-21 18:30:18.215593: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-21 18:30:18.231459: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-21 18:30:18.231503: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-21 18:30:18.231698: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-21 18:30:18.231859: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-21 18:30:18.249705: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-21 18:30:18.249750: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-21 18:30:18.249809: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-21 18:30:18.249934: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-21 18:30:18.265698: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-21 18:30:18.265743: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-21 18:30:18.265800: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-21 18:30:18.265894: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-21 18:30:18.280835: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-21 18:30:18.280882: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-21 18:30:18.280939: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-21 18:30:18.281034: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-21 18:30:18.299545: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-21 18:30:18.299592: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-21 18:30:18.299648: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-21 18:30:18.299758: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-21 18:30:18.315516: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-21 18:30:18.315561: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
Traceback (most recent call last):
File "mnist.py", line 89, in
train_images.shape: (60000, 28, 28, 1), of float64 test_images.shape: (10000, 28, 28, 1), of float64
Layer (type) Output Shape Param #
Conv1 (Conv2D) (None, 13, 13, 8) 80
flatten (Flatten) (None, 1352) 0
Softmax (Dense) (None, 10) 13530
Total params: 13,610 Trainable params: 13,610 Non-trainable params: 0
File "mnist.py", line 82, in main model = train(train_images, train_labels, args.epochs, args.model_summary_path) File "mnist.py", line 51, in train model.fit(train_images, train_labels, epochs=epochs, callbacks=[tensorboard_callback]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 880, in fit validation_steps=validation_steps) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215, in model_iteration mode=mode) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 106, in configure_callbacks callback_list.set_model(callback_model) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 178, in set_model callback.set_model(model) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 1010, in set_model self._init_writer() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 947, in _init_writer self.writer = tf_summary.FileWriter(self.log_dir, K.get_session().graph) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in init filename_suffix) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in init gfile.MakeDirs(self._logdir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir recursive_create_dir_v2(dirname) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2 pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnknownError: : No response body. Response code: 400