aws-samples / eks-workshop

AWS Workshop for Learning EKS
https://eksworkshop.com
MIT No Attribution
804 stars 1.24k forks source link

training container fails to start - Kubeflow workshop - Model Training module #542

Closed dbgordon closed 4 years ago

dbgordon commented 4 years ago

Working with the ML workshop and in the 2nd module on Model Training, we're running into an error message when trying to start up the training container using the workshop code snippet:

curl -LO https://eksworkshop.com/kubeflow/kubeflow.files/mnist-training.yaml envsubst < mnist-training.yaml | kubectl create -f -

Two of us have seen this in two different settings: when just plain old running through the lab as is, in Cloud9, and also while running the same scenario through GitLab runners.

Error message is:

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz 32768/29515 [=================================] - 0s 0us/step 40960/29515 [=========================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz 26427392/26421880 [==============================] - 0s 0us/step 26435584/26421880 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz 16384/5148 [===============================================================================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz 4423680/4422102 [==============================] - 0s 0us/step 4431872/4422102 [==============================] - 0s 0us/step WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 2019-11-21 18:30:18.140945: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-11-21 18:30:18.144825: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz 2019-11-21 18:30:18.145045: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x31383d0 executing computations on platform Host. Devices: 2019-11-21 18:30:18.145081: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , 2019-11-21 18:30:18.193253: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1 2019-11-21 18:30:18.193298: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0 2019-11-21 18:30:18.193310: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default 2019-11-21 18:30:18.193322: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http 2019-11-21 18:30:18.193470: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2 2019-11-21 18:30:18.193489: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating Instance with default EC2MetadataClient and refresh rate 900000 2019-11-21 18:30:18.193507: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-11-21 18:30:18.193553: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25 2019-11-21 18:30:18.193605: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-11-21 18:30:18.193737: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2 2019-11-21 18:30:18.193753: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-11-21 18:30:18.215249: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400 2019-11-21 18:30:18.215299: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-11-21 18:30:18.215433: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-11-21 18:30:18.215593: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-11-21 18:30:18.231459: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400 2019-11-21 18:30:18.231503: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-11-21 18:30:18.231698: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-11-21 18:30:18.231859: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-11-21 18:30:18.249705: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400 2019-11-21 18:30:18.249750: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-11-21 18:30:18.249809: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-11-21 18:30:18.249934: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-11-21 18:30:18.265698: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400 2019-11-21 18:30:18.265743: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-11-21 18:30:18.265800: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-11-21 18:30:18.265894: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-11-21 18:30:18.280835: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400 2019-11-21 18:30:18.280882: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-11-21 18:30:18.280939: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-11-21 18:30:18.281034: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-11-21 18:30:18.299545: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400 2019-11-21 18:30:18.299592: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. 2019-11-21 18:30:18.299648: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key 2019-11-21 18:30:18.299758: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing. 2019-11-21 18:30:18.315516: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400 2019-11-21 18:30:18.315561: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. Traceback (most recent call last): File "mnist.py", line 89, in

train_images.shape: (60000, 28, 28, 1), of float64 test_images.shape: (10000, 28, 28, 1), of float64


Layer (type) Output Shape Param #

Conv1 (Conv2D) (None, 13, 13, 8) 80


flatten (Flatten) (None, 1352) 0


Softmax (Dense) (None, 10) 13530

Total params: 13,610 Trainable params: 13,610 Non-trainable params: 0


main()

File "mnist.py", line 82, in main model = train(train_images, train_labels, args.epochs, args.model_summary_path) File "mnist.py", line 51, in train model.fit(train_images, train_labels, epochs=epochs, callbacks=[tensorboard_callback]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 880, in fit validation_steps=validation_steps) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215, in model_iteration mode=mode) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 106, in configure_callbacks callback_list.set_model(callback_model) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 178, in set_model callback.set_model(model) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 1010, in set_model self._init_writer() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 947, in _init_writer self.writer = tf_summary.FileWriter(self.log_dir, K.get_session().graph) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in init filename_suffix) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in init gfile.MakeDirs(self._logdir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir recursive_create_dir_v2(dirname) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2 pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnknownError: : No response body. Response code: 400

dbgordon commented 4 years ago

From troubleshooting so far this appears to be an issue when the container starts up to run the tensorflow command as it is having issues writing to the S3 bucket after the training.

dalbhanj commented 4 years ago

Thanks @dbgordon for creating this issue. Can you check if CONFIG_FILE has proper values? We use following sed commands to replace environment variables here during Kubeflow install part (https://eksworkshop.com/kubeflow/install/#customize-your-configuration). I’d check the CONFIG_FILE and see if NodeInstaceRole is actually replaced by your ROLE_NAME. You can do ‘echo $ROLE_NAME’ and validate values to confirm. Kubeflow installs fine with default values and I suspect that’s the culprit of the error. To recover, you’ll have to uninstall and re-install with proper values�. Let me know if this helps

dbgordon commented 4 years ago

Thanks @dalbhanj . I'm not sure which config file you are referring to. I installed Kubeflow into the Cloud9 environment as the workshop instructions suggested don't see a CONFIG_FILE anywhere.

Regardless, we are seeing this problem completely outside of KubeFlow and outside of the lab just going from AWS EKS container (not even the TensorFlow one) to an S3 bucket. So, seems there might be a more fundamental problem going on.

Earlier you mentioned that you guys saw this problem show up when running the workshop at KubeCon. Did it show up with everyone or just some people?

dalbhanj commented 4 years ago

Look for this file. I think its under eksworkshop-eksctl folder 'export CONFIG_FILE=${KF_DIR}/kfctl_aws.0.7.0.yaml'

You can paste the content of the file here The issue is intermittent and not everyone saw it at Kubecon.

dbgordon commented 4 years ago

Rebuilding since my environment got messed up with all this experimenting.

Regardless, we've made progress. It appears that pulling the credentials from the environment variables results in this failure behavior. If you pull them from a file the problem appears to no longer show.

dbgordon commented 4 years ago

Look for this file. I think its under eksworkshop-eksctl folder 'export CONFIG_FILE=${KF_DIR}/kfctl_aws.0.7.0.yaml'

You can paste the content of the file here The issue is intermittent and not everyone saw it at Kubecon.

Here is my config file contents. It does not include the replaced values, as you suspected. This is after a second install from scratch:

apiVersion: kfdef.apps.kubeflow.org/v1beta1 kind: KfDef metadata: creationTimestamp: null name: kubeflow-aws namespace: kubeflow spec: applications:

dbgordon commented 4 years ago

I believe I see why the replacement with my $ROLE_NAME didn't happen. Because, as you can see in the un-altered config file above, the "role" to be replaced (that the sed command is looking for, is set to "eksctl-kubeflow-aws-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx". Note that it is "eksctl-KUBEFLOW. . .".

The sed command we are running in the instructions is: sed -i "s@eksctl-eksworkshop-eksctl-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx@$ROLE_NAME@" ${CONFIG_FILE}

Note: "eksctl-EKSWORKSHOP. . ."

Jeffwan commented 4 years ago

This is unrelated to Kubeflow setup. This chapter (https://eksworkshop.com/kubeflow/training/) is supposed to work on EKS without Kubeflow.

2019-11-24 07:08:50.764524: W tensorflow/core/platform/s3/aws_logging.cc:57] Request failed, now waiting 12800 ms before attempting again.
2019-11-24 07:09:03.564645: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-24 07:09:03.564756: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-24 07:09:03.569469: E tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 6
2019-11-24 07:09:03.569527: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
Traceback (most recent call last):
  File "mnist.py", line 89, in <module>
    main()
  File "mnist.py", line 82, in main
    model = train(train_images, train_labels, args.epochs, args.model_summary_path)
  File "mnist.py", line 51, in train

train_images.shape: (60000, 28, 28, 1), of float64
test_images.shape: (10000, 28, 28, 1), of float64
_________________________________________________________________
Layer (type)                 Output Shape              Param #
apiVersion: v1
=================================================================
Conv1 (Conv2D)               (None, 13, 13, 8)         80
_________________________________________________________________
flatten (Flatten)            (None, 1352)              0
_________________________________________________________________
Softmax (Dense)              (None, 10)                13530
=================================================================
Total params: 13,610
Trainable params: 13,610
Non-trainable params: 0
_________________________________________________________________
    model.fit(train_images, train_labels, epochs=epochs, callbacks=[tensorboard_callback])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 880, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215, in model_iteration
    mode=mode)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 106, in configure_callbacks
    callback_list.set_model(callback_model)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 178, in set_model
    callback.set_model(model)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 1010, in set_model
    self._init_writer()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 947, in _init_writer
    self.writer = tf_summary.FileWriter(self.log_dir, K.get_session().graph)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in __init__
    filename_suffix)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
    gfile.MakeDirs(self._logdir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir
    recursive_create_dir_v2(dirname)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
Jeffwan commented 4 years ago

Problem

We brought the problem in this PR. https://github.com/aws-samples/eks-workshop/pull/522/files#diff-3d8576a7d473c236f821a0d67193b0b2L61-L62

The behavior is a little bit different here.

$ jq -r .AccessKey.AccessKeyId /tmp/create_output.json
AKIAVCDTNILGAGD6IPPZ

$ echo -n 'AKIAVCDTNILGAGD6IPPZ'
AKIAVCDTNILGAGD6IPPZ%

I am using zsh here and % here means unterminated lines in a command's output.

If we do base64 we can see that, we can see the difference.

$ jq -r .AccessKey.AccessKeyId /tmp/create_output.json | base64
QUtJQVZDRFROSUxHQUdENklQUFoK

$ echo -n 'AKIAVCDTNILGAGD6IPPZ' | base64
QUtJQVZDRFROSUxHQUdENklQUFo=

Check echo doc,

     The following option is available:

     -n    Do not print the trailing newline character.  This may also be achieved by appending
           `\c' to the end of the string, as is done by iBCS2 compatible systems.  Note that this
           option as well as the effect of `\c' are implementation-defined in IEEE Std
           1003.1-2001 (``POSIX.1'') as amended by Cor. 1-2002.  Applications aiming for maximum
           portability are strongly encouraged to use printf(1) to suppress the newline charac-
           ter.

In kubernetes, we have not to print the trailing new line to create secret. https://kubernetes.io/docs/concepts/configuration/secret/#creating-a-secret-manually

Solution

The easiest way is to use -j to replace -r option in jq.

--join-output / -j: Like -r but jq won’t print a newline after each output.

@dalbhanj @arun-gupta @otterley

dalbhanj commented 4 years ago

@Jeffwan thanks for bringing this up. Al though, @dbgordon issue seems to be related to Cluster name, because substitution is hard coded to eksworkshop clustername.

Regarding the problem, can we create a new issue to address it? Does this issue only occur when you use zsh? Workshop uses Cloud9 IDE and is OS-agnostic (OS specific instructions are not provided)

otterley commented 4 years ago

Whoops! Sorry about that! (I could have sworn I tested it before making the PR.)

Jeffwan commented 4 years ago

@Jeffwan thanks for bringing this up. Al though, @dbgordon issue seems to be related to Cluster name, because substitution is hard coded to eksworkshop clustername.

Regarding the problem, can we create a new issue to address it? Does this issue only occur when you use zsh? Workshop uses Cloud9 IDE and is OS-agnostic (OS specific instructions are not provided)

@dalbhanj Most of attendees meet this problem and jq, echo behavior is same in cloud9 or mac. The reason I mention I am using zsh because the display is different, % is only in zsh

If there's an issue's there. Let me know. We can put PR link there. I hope someone else can double test it.

@otterley Not a problem, this is kind of tricky and not easy to detect.

dalbhanj commented 4 years ago

@Jeffwan Confirmed! You hit the nail on this issue. I was able to reproduce and once I used your PR (https://github.com/aws-samples/eks-workshop/pull/543), I was able to resolve it.

dalbhanj commented 4 years ago

Closing since this is resolved