Closed danheo412 closed 4 years ago
I was told to check the s3 path was set. I set it and I'm getting this error now
workshop:~/environment/eksworkshop-eksctl $ kubectl logs mnist-training -f
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
40960/29515 [=========================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
26435584/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
16384/5148 [===============================================================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step
4431872/4422102 [==============================] - 0s 0us/step
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-11-26 07:17:23.914719: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-11-26 07:17:23.939677: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2019-11-26 07:17:23.939857: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3ee9e00 executing computations on platform Host. Devices:
2019-11-26 07:17:23.939880: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-11-26 07:17:23.984441: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1
2019-11-26 07:17:23.984475: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0
2019-11-26 07:17:23.984489: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default
2019-11-26 07:17:23.984502: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http
2019-11-26 07:17:23.984521: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2
2019-11-26 07:17:23.984536: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating Instance with default EC2MetadataClient and refresh rate 900000
2019-11-26 07:17:23.984555: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-26 07:17:23.984598: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25
2019-11-26 07:17:23.984650: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-26 07:17:23.984776: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2019-11-26 07:17:23.984793: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-26 07:17:24.005064: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-26 07:17:24.005103: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-26 07:17:24.005273: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-26 07:17:24.005416: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-26 07:17:24.019755: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-26 07:17:24.019789: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-26 07:17:24.019905: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-26 07:17:24.020041: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-26 07:17:24.033487: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-26 07:17:24.033521: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-26 07:17:24.033572: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-26 07:17:24.033664: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-26 07:17:24.045873: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-26 07:17:24.045905: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-26 07:17:24.045955: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-26 07:17:24.046048: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-26 07:17:24.058349: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-26 07:17:24.058382: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-26 07:17:24.058430: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-26 07:17:24.058521: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-26 07:17:24.070959: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-26 07:17:24.070990: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2019-11-26 07:17:24.071036: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2019-11-26 07:17:24.071126: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2019-11-26 07:17:24.082978: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 400
2019-11-26 07:17:24.083010: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
Traceback (most recent call last):
File "mnist.py", line 89, in <module>
train_images.shape: (60000, 28, 28, 1), of float64
test_images.shape: (10000, 28, 28, 1), of float64
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
Conv1 (Conv2D) (None, 13, 13, 8) 80
_________________________________________________________________
flatten (Flatten) (None, 1352) 0
_________________________________________________________________
Softmax (Dense) (None, 10) 13530
=================================================================
Total params: 13,610
Trainable params: 13,610
Non-trainable params: 0
_________________________________________________________________
main()
File "mnist.py", line 82, in main
model = train(train_images, train_labels, args.epochs, args.model_summary_path)
File "mnist.py", line 51, in train
model.fit(train_images, train_labels, epochs=epochs, callbacks=[tensorboard_callback])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 880, in fit
validation_steps=validation_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215, in model_iteration
mode=mode)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 106, in configure_callbacks
callback_list.set_model(callback_model)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 178, in set_model
callback.set_model(model)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 1010, in set_model
self._init_writer()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 947, in _init_writer
self.writer = tf_summary.FileWriter(self.log_dir, K.get_session().graph)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in __init__
filename_suffix)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
gfile.MakeDirs(self._logdir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir
recursive_create_dir_v2(dirname)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2
pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: : No response body. Response code: 400
Following other posts, I'm posting the content of the kfctl_aws.0.7.0.yaml file:
workshop:~/environment/eksworkshop-eksctl $ cat kfctl_aws.0.7.0.yaml
apiVersion: kfdef.apps.kubeflow.org/v1beta1
kind: KfDef
metadata:
creationTimestamp: null
name: eksworkshop-eksctl
namespace: kubeflow
spec:
applications:
- kustomizeConfig:
parameters:
- name: namespace
value: istio-system
repoRef:
name: manifests
path: istio/istio-crds
name: istio-crds
- kustomizeConfig:
parameters:
- name: namespace
value: istio-system
repoRef:
name: manifests
path: istio/istio-install
name: istio-install
- kustomizeConfig:
parameters:
- name: clusterRbacConfig
value: "OFF"
repoRef:
name: manifests
path: istio/istio
name: istio
- kustomizeConfig:
repoRef:
name: manifests
path: application/application-crds
name: application-crds
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: application/application
name: application
- kustomizeConfig:
repoRef:
name: manifests
path: metacontroller
name: metacontroller
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: argo
name: argo
- kustomizeConfig:
repoRef:
name: manifests
path: kubeflow-roles
name: kubeflow-roles
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: common/centraldashboard
name: centraldashboard
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: admission-webhook/webhook
name: webhook
- kustomizeConfig:
overlays:
- application
parameters:
- name: webhookNamePrefix
value: admission-webhook-
repoRef:
name: manifests
path: admission-webhook/bootstrap
name: bootstrap
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: jupyter/jupyter-web-app
name: jupyter-web-app
- kustomizeConfig:
overlays:
- istio
repoRef:
name: manifests
path: metadata
name: metadata
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: jupyter/notebook-controller
name: notebook-controller
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pytorch-job/pytorch-job-crds
name: pytorch-job-crds
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pytorch-job/pytorch-operator
name: pytorch-operator
- kustomizeConfig:
overlays:
- application
parameters:
- name: usageId
value: "144553881180253599"
- name: reportUsage
value: "true"
repoRef:
name: manifests
path: common/spartakus
name: spartakus
- kustomizeConfig:
overlays:
- istio
repoRef:
name: manifests
path: tensorboard
name: tensorboard
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: tf-training/tf-job-crds
name: tf-job-crds
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: tf-training/tf-job-operator
name: tf-job-operator
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: katib/katib-crds
name: katib-crds
- kustomizeConfig:
overlays:
- application
- istio
repoRef:
name: manifests
path: katib/katib-controller
name: katib-controller
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/api-service
name: api-service
- kustomizeConfig:
overlays:
- application
parameters:
- name: minioPvcName
value: minio-pv-claim
repoRef:
name: manifests
path: pipeline/minio
name: minio
- kustomizeConfig:
overlays:
- application
parameters:
- name: mysqlPvcName
value: mysql-pv-claim
repoRef:
name: manifests
path: pipeline/mysql
name: mysql
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/persistent-agent
name: persistent-agent
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/pipelines-runner
name: pipelines-runner
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: pipeline/pipelines-ui
name: pipelines-ui
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/pipelines-viewer
name: pipelines-viewer
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/scheduledworkflow
name: scheduledworkflow
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/pipeline-visualization-service
name: pipeline-visualization-service
- kustomizeConfig:
overlays:
- application
- istio
repoRef:
name: manifests
path: profiles
name: profiles
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: seldon/seldon-core-operator
name: seldon-core
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: mpi-job/mpi-operator
name: mpi-operator
- kustomizeConfig:
parameters:
- name: namespace
value: istio-system
repoRef:
name: manifests
path: aws/istio-ingress
name: istio-ingress
- kustomizeConfig:
overlays:
- application
parameters:
- name: clusterName
value: eksworkshop-eksctl
repoRef:
name: manifests
path: aws/aws-alb-ingress-controller
name: aws-alb-ingress-controller
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: aws/nvidia-device-plugin
name: nvidia-device-plugin
plugins:
- kind: KfAwsPlugin
metadata:
creationTimestamp: null
name: aws
spec:
auth:
basicAuth:
password:
name: password
username: admin
region: us-west-2
roles:
- eksctl-eksworkshop-eksctl-nodegro-NodeInstanceRole-1HY8SCMLKFYS5
repos:
- name: manifests
uri: https://github.com/kubeflow/manifests/archive/v0.7-branch.tar.gz
version: master
status:
reposCache:
- localPath: '"/home/ec2-user/environment/eksworkshop-eksctl/.cache/manifests/manifests-0.7-branch"'
name: manifests
I'm also curious what is the best way to debug this issue... like how do I make sense of the error stack
File "mnist.py", line 82, in main
model = train(train_images, train_labels, args.epochs, args.model_summary_path)
File "mnist.py", line 51, in train
model.fit(train_images, train_labels, epochs=epochs, callbacks=[tensorboard_callback])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 880, in fit
validation_steps=validation_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215, in model_iteration
mode=mode)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 106, in configure_callbacks
callback_list.set_model(callback_model)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 178, in set_model
callback.set_model(model)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 1010, in set_model
self._init_writer()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/callbacks.py", line 947, in _init_writer
self.writer = tf_summary.FileWriter(self.log_dir, K.get_session().graph)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in __init__
filename_suffix)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
gfile.MakeDirs(self._logdir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir
recursive_create_dir_v2(dirname)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2
pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: : No response body. Response code: 400
Hi @danheo412, thanks for the details. This PR fixes the issue you are experiencing: https://github.com/aws-samples/eks-workshop/pull/543/files
PR has been merged, it should show up on main workshop site in a few minutes.
Let me know if the fix resolves the issue
Closing since this is resolved
Hi, I was following the workshop on ML with EKS and Kubeflow, but I ran into a blocker during the steps on training a model. I followed the steps exactly, but the pods running the training job keeps failing and I got this below. I tried it three times and failis the same way. I’m wondering if there is something wrong with the image…? Or access control…? Not sure at all. Could you pls help me debug..? Instruction is from: https://eksworkshop.com/kubeflow/training/