Closed charlesa101 closed 4 years ago
@charlesa101 Thanks for trying out. I am assuming you have replaced input, output buckets and role Arn.
Would you please run the following command provide the output ?
kubectl get trainingjobs xgboost-mnist
kubectl describe trainingjob xgboost-mnist
@gautamkmr, here you go thank you! yeah i have my own bucket and sagemaker executor role
NAME STATUS SECONDARY-STATUS CREATION-TIME SAGEMAKER-JOB-NAME
xgboost-mnist 2020-03-09T16:51:08Z ```
```kubectl describe TrainingJob
Name: xgboost-mnist
Namespace: default
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"sagemaker.aws.amazon.com/v1","kind":"TrainingJob","metadata":{"annotations":{},"name":"xgboost-mnist","namespace":"default"...
API Version: sagemaker.aws.amazon.com/v1
Kind: TrainingJob
Metadata:
Creation Timestamp: 2020-03-09T06:58:17Z
Generation: 1
Resource Version: 117181
Self Link: /apis/sagemaker.aws.amazon.com/v1/namespaces/default/trainingjobs/xgboost-mnist
UID: 5a907178-61d3-11ea-b461-02efd6507006
Spec:
Algorithm Specification:
Training Image: 825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest
Training Input Mode: File
Hyper Parameters:
Name: max_depth
Value: 5
Name: eta
Value: 0.2
Name: gamma
Value: 4
Name: min_child_weight
Value: 6
Name: silent
Value: 0
Name: objective
Value: multi:softmax
Name: num_class
Value: 10
Name: num_round
Value: 10
Input Data Config:
Channel Name: train
Compression Type: None
Content Type: text/csv
Data Source:
S 3 Data Source:
S 3 Data Distribution Type: FullyReplicated
S 3 Data Type: S3Prefix
S 3 Uri: s3://<MY-BUCKET>/xgboost-mnist/train/
Channel Name: validation
Compression Type: None
Content Type: text/csv
Data Source:
S 3 Data Source:
S 3 Data Distribution Type: FullyReplicated
S 3 Data Type: S3Prefix
S 3 Uri: s3://<MY-BUCKET>/xgboost-mnist/validation/
Output Data Config:
S 3 Output Path: s3://<MY-BUCKET>/xgboost-mnist/models/
Region: us-east-2
Resource Config:
Instance Count: 1
Instance Type: ml.m4.xlarge
Volume Size In GB: 5
Role Arn: arn:aws:iam::<ACCOUNT>:role/sagemaker_execution_role
Stopping Condition:
Max Runtime In Seconds: 86400```
@charlesa101 Thanks for providing the output. It appears that operator is not running successfully on your k8s cluster. you can verify that
kubectl get pods -A | grep -i sagemaker
You can follow steps from here to install the operator, let us know if you face any issue.
yeah that's what i noticed as well now
NAME READY STATUS RESTARTS AGE
sagemaker-k8s-operator-controller-manager-5858fd7b8d-h89s8 0/2 Pending 0 24h```
Name: sagemaker-k8s-operator-controller-manager-5858fd7b8d-h89s8
Namespace: sagemaker-k8s-operator-system
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: control-plane=controller-manager
pod-template-hash=5858fd7b8d
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
Controlled By: ReplicaSet/sagemaker-k8s-operator-controller-manager-5858fd7b8d
Containers:
kube-rbac-proxy:
Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
Port: 8443/TCP
Host Port: 0/TCP
Args:
--secure-listen-address=0.0.0.0:8443
--upstream=http://127.0.0.1:8080/
--logtostderr=true
--v=10
Environment:
AWS_ROLE_ARN: arn:aws:iam::123456789012:role/DELETE_ME
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from sagemaker-k8s-operator-default-token-rwdkn (ro)
manager:
Image: 957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1
Port: <none>
Host Port: <none>
Command:
/manager
Args:
--metrics-addr=127.0.0.1:8080
Limits:
cpu: 100m
memory: 30Mi
Requests:
cpu: 100m
memory: 20Mi
Environment:
AWS_DEFAULT_SAGEMAKER_ENDPOINT:
AWS_ROLE_ARN: arn:aws:iam::123456789012:role/DELETE_ME
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from sagemaker-k8s-operator-default-token-rwdkn (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
sagemaker-k8s-operator-default-token-rwdkn:
Type: Secret (a volume populated by a Secret)
SecretName: sagemaker-k8s-operator-default-token-rwdkn
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 64s (x1378 over 34h) default-scheduler no nodes available to schedule pods
my eks/ecr is on us-east2, but it seems all the crd artifacts are coming from us-east1 could that be the issue?
EKS can pull the image from other region too. I think in your case it seems that you don't have any worker node associated to cluster? At least thats what below message says.
Warning FailedScheduling 64s (x1378 over 34h) default-scheduler no nodes available to schedule pods
Can you run ?
kubectl get node
@charlesa101 did you get chance to review it again?
NAME STATUS ROLES AGE VERSION
ip-172-16-116-51.us-east-2.compute.internal Ready <none> 5h47m v1.14.8-eks-b8860f
ip-172-16-121-255.us-east-2.compute.internal Ready <none> 5h47m v1.14.8-eks-b8860f
ip-172-16-137-197.us-east-2.compute.internal Ready <none> 5h47m v1.14.8-eks-b8860f
yeah i did, recreated the cluster again but still the same issue
@charlesa101 In previous describe output of pod
it appears that cluster did not have any worker nodes available (no nodes available to schedule pods)
.
But based on recent output it appears that you have three worker nodes available.
NAME STATUS ROLES AGE VERSION ip-172-16-116-51.us-east-2.compute.internal Ready
5h47m v1.14.8-eks-b8860f ip-172-16-121-255.us-east-2.compute.internal Ready 5h47m v1.14.8-eks-b8860f ip-172-16-137-197.us-east-2.compute.internal Ready 5h47m v1.14.8-eks-b8860f
Could you please describe each of these nodes and operator pod ?
# Describe nodes , assuming the names of nodes are same as you mentioned in previous comment.
kubectl describe node ip-172-16-116-51.us-east-2.compute.internal
kubectl describe node ip-172-16-121-255.us-east-2.compute.internal
kubectl describe node ip-172-16-137-197.us-east-2.compute.internal
#Get the operator pod name
kubectl get pods -A | grep -i sagemaker
kubectl describe pod <put the pod name here> -n sagemaker-k8s-operator-system
If operator has been deployed successfully and if trainingjob is still not yet running please attach the out put of describe trainingjob as well ?
kubectl describe trainingjob xgboost-mnist
i tried to look checked the operator pod, here is the log @gautamkmr
kubectl logs -f sagemaker-k8s-operator-controller-manager-5858fd7b8d-2dk5c -n sagemaker-k8s-operator-system manager
2020-03-15T18:09:13.864Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "127.0.0.1:8080"}
2020-03-15T18:09:13.865Z INFO controller-runtime.controller Starting EventSource {"controller": "trainingjob", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.865Z INFO controller-runtime.controller Starting EventSource {"controller": "hyperparametertuningjob", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.865Z INFO controller-runtime.controller Starting EventSource {"controller": "hostingdeployment", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z INFO controller-runtime.controller Starting EventSource {"controller": "model", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z INFO controller-runtime.controller Starting EventSource {"controller": "endpointconfig", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z INFO controller-runtime.controller Starting EventSource {"controller": "batchtransformjob", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z INFO setup starting manager
2020-03-15T18:09:13.866Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"}
2020-03-15T18:09:14.066Z INFO controller-runtime.controller Starting Controller {"controller": "trainingjob"}
2020-03-15T18:09:14.066Z INFO controller-runtime.controller Starting Controller {"controller": "model"}
2020-03-15T18:09:14.067Z INFO controller-runtime.controller Starting Controller {"controller": "batchtransformjob"}
2020-03-15T18:09:14.067Z INFO controller-runtime.controller Starting Controller {"controller": "hostingdeployment"}
2020-03-15T18:09:14.066Z INFO controller-runtime.controller Starting Controller {"controller": "endpointconfig"}
2020-03-15T18:09:14.067Z INFO controller-runtime.controller Starting Controller {"controller": "hyperparametertuningjob"}
2020-03-15T18:09:14.167Z INFO controller-runtime.controller Starting workers {"controller": "trainingjob", "worker count": 1}
2020-03-15T18:09:14.167Z INFO controller-runtime.controller Starting workers {"controller": "model", "worker count": 1}
2020-03-15T18:09:14.167Z INFO controller-runtime.controller Starting workers {"controller": "endpointconfig", "worker count": 1}
2020-03-15T18:09:14.167Z INFO controller-runtime.controller Starting workers {"controller": "batchtransformjob", "worker count": 1}
2020-03-15T18:09:14.167Z INFO controller-runtime.controller Starting workers {"controller": "hostingdeployment", "worker count": 1}
2020-03-15T18:09:14.167Z INFO controller-runtime.controller Starting workers {"controller": "hyperparametertuningjob", "worker count": 1}
2020-03-15T19:09:19.962Z INFO controllers.TrainingJob Getting resource {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:09:19.962Z INFO controllers.TrainingJob Job status is empty, setting to intermediate status {"trainingjob": "default/xgboost-mnist", "status": "SynchronizingK8sJobWithSageMaker"}
2020-03-15T19:09:19.963Z INFO controllers.TrainingJob Updating job status {"trainingjob": "default/xgboost-mnist", "new-status": {"trainingJobStatus":"SynchronizingK8sJobWithSageMaker","lastCheckTime":"2020-03-15T19:09:19Z"}}
2020-03-15T19:09:19.976Z INFO controllers.TrainingJob Getting resource {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:09:19.976Z INFO controllers.TrainingJob Adding generated name to spec {"trainingjob": "default/xgboost-mnist", "new-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444"}
2020-03-15T19:09:19.982Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "trainingjob", "request": "default/xgboost-mnist"}
2020-03-15T19:09:19.983Z INFO controllers.TrainingJob Getting resource {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:09:19.983Z INFO controllers.TrainingJob Loaded AWS config {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:09:19.983Z INFO controllers.TrainingJob Calling SM API DescribeTrainingJob {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:09:20.916Z ERROR controllers.TrainingJob.handleSageMakerApiError Handling unrecoverable sagemaker API error {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "error": "UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 01ea5be5-6bd5-4bae-b79e-2bc8d86338ee"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).handleSageMakerApiError
/workspace/controllers/trainingjob/trainingjob_controller.go:396
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).Reconcile
/workspace/controllers/trainingjob/trainingjob_controller.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:216
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
2020-03-15T19:09:20.916Z INFO controllers.TrainingJob.handleSageMakerApiError Updating job status {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "new-status": {"trainingJobStatus":"Failed","additional":"UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 01ea5be5-6bd5-4bae-b79e-2bc8d86338ee","lastCheckTime":"2020-03-15T19:09:20Z","cloudWatchLogUrl":"https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logStream:group=/aws/sagemaker/TrainingJobs;prefix=xgboost-mnist-792eb47166f011ea88d202c3652bf444;streamFilter=typeLogStreamPrefix","sageMakerTrainingJobName":"xgboost-mnist-792eb47166f011ea88d202c3652bf444"}}
2020-03-15T19:09:20.924Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "trainingjob", "request": "default/xgboost-mnist"}
2020-03-15T19:11:41.623Z INFO controllers.TrainingJob Getting resource {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:11:41.623Z INFO controllers.TrainingJob Loaded AWS config {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:11:41.623Z INFO controllers.TrainingJob Calling SM API DescribeTrainingJob {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:11:42.150Z ERROR controllers.TrainingJob.handleSageMakerApiError Handling unrecoverable sagemaker API error {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "error": "UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 7145c885-b685-4663-8dd3-6c212ce574b2"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).handleSageMakerApiError
/workspace/controllers/trainingjob/trainingjob_controller.go:396
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).Reconcile
/workspace/controllers/trainingjob/trainingjob_controller.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:216
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
2020-03-15T19:11:42.150Z INFO controllers.TrainingJob.handleSageMakerApiError Updating job status {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "new-status": {"trainingJobStatus":"Failed","additional":"UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 7145c885-b685-4663-8dd3-6c212ce574b2","lastCheckTime":"2020-03-15T19:11:42Z","cloudWatchLogUrl":"https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logStream:group=/aws/sagemaker/TrainingJobs;prefix=xgboost-mnist-792eb47166f011ea88d202c3652bf444;streamFilter=typeLogStreamPrefix","sageMakerTrainingJobName":"xgboost-mnist-792eb47166f011ea88d202c3652bf444"}}
2020-03-15T19:11:42.159Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "trainingjob", "request": "default/xgboost-mnist"}
@charlesa101 Thanks for sharing the log. You are on right track. I think the issue now is operator pod is unable to retrieve credentials from IAM service to talk to sagemaker.
"error": "UnrecognizedClientException: The security token included in the request is invalid.\n
Could you please check your trust.json basically trust policy have three places to update cluster region and OIDC ID and one place to add your AWS account number.
Hi @charlesa101
Closing this issue since there has been no activity in 90 days. Please re-open if you still need help
Thanks
Hi, I'm having the exact same issue except that my pod is running fine. I setup my k8s cluster using terraform with 1 master node and 1 worker node. When I submit the trainingjob, there is no status or job name or anything else. I tried all the commands above and it looks like the scheduler was able to assign the pods to the worker node. Any help would be appreciated! Please see outputs for commands below:
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-67tgx 1/1 Running 0 2d18h
kube-system aws-node-k2q7z 1/1 Running 0 2d18h
kube-system coredns-85d5b4454c-cwfvj 1/1 Running 0 2d18h
kube-system coredns-85d5b4454c-x5ld9 1/1 Running 0 2d18h
kube-system kube-proxy-54vm5 1/1 Running 0 2d18h
kube-system kube-proxy-r8j7j 1/1 Running 0 2d18h
kube-system metrics-server-64cf6869bd-6nppx 1/1 Running 0 2d18h
sagemaker-jobs sagemaker-k8s-operator-controller-manager-855f498957-fhkvv 2/2 Running 0 2d18h
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl describe pod sagemaker-k8s-operator-controller-manager-855f498957-fhkvv -n sagemaker-jobs
Name: sagemaker-k8s-operator-controller-manager-855f498957-fhkvv
Namespace: sagemaker-jobs
Priority: 0
Node: ip-10-0-1-245.us-west-2.compute.internal/10.0.1.245
Start Time: Fri, 24 Jun 2022 22:26:03 +0000
Labels: control-plane=controller-manager
pod-template-hash=855f498957
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.0.1.144
IPs:
IP: 10.0.1.144
Controlled By: ReplicaSet/sagemaker-k8s-operator-controller-manager-855f498957
Containers:
manager:
Container ID: docker://d8fc52b3e20a050999d3f24ab914f1d865a84a168a8b038f3fa81ce59cccbced
Image: 957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1
Image ID: docker-pullable://957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s@sha256:94ffbba68954249b1724fdb43f1e8ab13547114555b4a217849687d566191e23
Port: <none>
Host Port: <none>
Command:
/manager
Args:
--metrics-addr=127.0.0.1:8080
--namespace=sagemaker-jobs
State: Running
Started: Fri, 24 Jun 2022 22:26:09 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 30Mi
Requests:
cpu: 100m
memory: 20Mi
Environment:
AWS_DEFAULT_SAGEMAKER_ENDPOINT:
AWS_DEFAULT_REGION: us-west-2
AWS_REGION: us-west-2
AWS_ROLE_ARN: arn:aws:iam::438029713005:role/model-training-sagemaker-role20220624222338450100000009
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6j8rt (ro)
kube-rbac-proxy:
Container ID: docker://4ecdaa395fdc70d5cead609465dbf21f6e11771a80ad5db0a6125053ab08b9d3
Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
Image ID: docker-pullable://gcr.io/kubebuilder/kube-rbac-proxy@sha256:297896d96b827bbcb1abd696da1b2d81cab88359ac34cce0e8281f266b4e08de
Port: 8443/TCP
Host Port: 0/TCP
Args:
--secure-listen-address=0.0.0.0:8443
--upstream=http://127.0.0.1:8080/
--logtostderr=true
--v=10
State: Running
Started: Fri, 24 Jun 2022 22:26:11 +0000
Ready: True
Restart Count: 0
Environment:
AWS_DEFAULT_REGION: us-west-2
AWS_REGION: us-west-2
AWS_ROLE_ARN: arn:aws:iam::438029713005:role/model-training-sagemaker-role20220624222338450100000009
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6j8rt (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
kube-api-access-6j8rt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl logs sagemaker-k8s-operator-controller-manager-855f498957-fhkvv manager -n sagemaker-jobs
I0624 22:26:11.339445 1 request.go:621] Throttling request took 1.046981399s, request: GET:https://172.20.0.1:443/apis/extensions/v1beta1?timeout=32s
2022-06-24T22:26:12.443Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "127.0.0.1:8080"}
2022-06-24T22:26:12.443Z INFO Starting manager in the namespace: sagemaker-jobs
2022-06-24T22:26:12.443Z INFO setup starting manager
2022-06-24T22:26:12.444Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"}
2022-06-24T22:26:12.444Z INFO controller Starting EventSource {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "EndpointConfig", "controller": "endpointconfig", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.444Z INFO controller Starting EventSource {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "BatchTransformJob", "controller": "batchtransformjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.445Z INFO controller Starting EventSource {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingAutoscalingPolicy", "controller": "hostingautoscalingpolicy", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.444Z INFO controller Starting EventSource {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "Model", "controller": "model", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.444Z INFO controller Starting EventSource {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "TrainingJob", "controller": "trainingjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.445Z INFO controller Starting EventSource {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "ProcessingJob", "controller": "processingjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.446Z INFO controller Starting EventSource {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HyperparameterTuningJob", "controller": "hyperparametertuningjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.446Z INFO controller Starting EventSource {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingDeployment", "controller": "hostingdeployment", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.665Z INFO controller Starting Controller {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "Model", "controller": "model"}
2022-06-24T22:26:12.666Z INFO controller Starting Controller {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingAutoscalingPolicy", "controller": "hostingautoscalingpolicy"}
2022-06-24T22:26:12.666Z INFO controller Starting Controller {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "EndpointConfig", "controller": "endpointconfig"}
2022-06-24T22:26:12.666Z INFO controller Starting Controller {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "BatchTransformJob", "controller": "batchtransformjob"}
2022-06-24T22:26:12.666Z INFO controller Starting Controller {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "ProcessingJob", "controller": "processingjob"}
2022-06-24T22:26:12.666Z INFO controller Starting Controller {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HyperparameterTuningJob", "controller": "hyperparametertuningjob"}
2022-06-24T22:26:12.666Z INFO controller Starting Controller {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "TrainingJob", "controller": "trainingjob"}
2022-06-24T22:26:12.746Z INFO controller Starting Controller {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingDeployment", "controller": "hostingdeployment"}
2022-06-24T22:26:12.747Z INFO controller Starting workers {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingDeployment", "controller": "hostingdeployment", "worker count": 1}
2022-06-24T22:26:12.766Z INFO controller Starting workers {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "Model", "controller": "model", "worker count": 1}
2022-06-24T22:26:12.766Z INFO controller Starting workers {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "EndpointConfig", "controller": "endpointconfig", "worker count": 1}
2022-06-24T22:26:12.766Z INFO controller Starting workers {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingAutoscalingPolicy", "controller": "hostingautoscalingpolicy", "worker count": 1}
2022-06-24T22:26:12.766Z INFO controller Starting workers {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "ProcessingJob", "controller": "processingjob", "worker count": 1}
2022-06-24T22:26:12.766Z INFO controller Starting workers {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "BatchTransformJob", "controller": "batchtransformjob", "worker count": 1}
2022-06-24T22:26:12.766Z INFO controller Starting workers {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "TrainingJob", "controller": "trainingjob", "worker count": 1}
2022-06-24T22:26:12.766Z INFO controller Starting workers {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HyperparameterTuningJob", "controller": "hyperparametertuningjob", "worker count": 1}
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl get trainingjobs
NAME STATUS SECONDARY-STATUS CREATION-TIME SAGEMAKER-JOB-NAME
osic-test-run 2022-06-24T22:38:13Z
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl describe trainingjob osic-test-run
Name: osic-test-run
Namespace: default
Labels: <none>
Annotations: <none>
API Version: sagemaker.aws.amazon.com/v1
Kind: TrainingJob
Metadata:
Creation Timestamp: 2022-06-24T22:38:13Z
Generation: 1
Managed Fields:
API Version: sagemaker.aws.amazon.com/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:algorithmSpecification:
.:
f:trainingImage:
f:trainingInputMode:
f:inputDataConfig:
f:outputDataConfig:
.:
f:s3OutputPath:
f:region:
f:resourceConfig:
.:
f:instanceCount:
f:instanceType:
f:volumeSizeInGB:
f:roleArn:
f:stoppingCondition:
.:
f:maxRuntimeInSeconds:
f:trainingJobName:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-06-24T22:38:13Z
Resource Version: 3182
UID: 0a0880c0-baf9-4f1a-8aa3-37480520c3e2
Spec:
Algorithm Specification:
Training Image: 438029713005.dkr.ecr.us-west-2.amazonaws.com/model-training:latest
Training Input Mode: File
Input Data Config:
Channel Name: train
Compression Type: None
Data Source:
s3DataSource:
s3DataDistributionType: FullyReplicated
s3DataType: S3Prefix
s3Uri: s3://osic-full-including-override
Output Data Config:
s3OutputPath: s3://osic-full-including-override/experiments
Region: us-west-2
Resource Config:
Instance Count: 1
Instance Type: ml.p3.2xlarge
Volume Size In GB: 500
Role Arn: arn:aws:iam::438029713005:role/model-training-sagemaker-role20220624222338450100000009
Stopping Condition:
Max Runtime In Seconds: 900
Training Job Name: osic-test-run
Events: <none>
please let me know if you need to see anything else!
Deployed the sample mnist training job but seems its not getting invoked on the SageMaker