aws / amazon-sagemaker-operator-for-k8s

Amazon SageMaker operator for Kubernetes
https://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-operators.html
Apache License 2.0
150 stars 53 forks source link

unable to kick off the sagemaker job #99

Closed charlesa101 closed 4 years ago

charlesa101 commented 4 years ago

Deployed the sample mnist training job but seems its not getting invoked on the SageMaker


kubectl describe TrainingJob            
Name:         xgboost-mnist
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sagemaker.aws.amazon.com/v1","kind":"TrainingJob","metadata":{"annotations":{},"name":"xgboost-mnist","namespace":"default"...
API Version:  sagemaker.aws.amazon.com/v1
Kind:         TrainingJob
Metadata:
  Creation Timestamp:  2020-03-09T06:58:17Z
  Generation:          1
  Resource Version:    117181
  Self Link:           /apis/sagemaker.aws.amazon.com/v1/namespaces/default/trainingjobs/xgboost-mnist
  UID:                 5a907178-61d3-11ea-b461-02efd6507006
Spec:
  Algorithm Specification:
    Training Image:       825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest
    Training Input Mode:  File
  Hyper Parameters:
    Name:   max_depth
    Value:  5
    Name:   eta
    Value:  0.2
    Name:   gamma
    Value:  4
    Name:   min_child_weight
    Value:  6
    Name:   silent
    Value:  0
    Name:   objective
    Value:  multi:softmax
    Name:   num_class
    Value:  10
    Name:   num_round
    Value:  10
  Input Data Config:
    Channel Name:      train
    Compression Type:  None
    Content Type:      text/csv
    Data Source:
      S 3 Data Source:
        S 3 Data Distribution Type:  FullyReplicated
        S 3 Data Type:               S3Prefix
        S 3 Uri:                     s3://<MY-BUCKET>/xgboost-mnist/train/
    Channel Name:                    validation
    Compression Type:                None
    Content Type:                    text/csv
    Data Source:
      S 3 Data Source:
        S 3 Data Distribution Type:  FullyReplicated
        S 3 Data Type:               S3Prefix
        S 3 Uri:                     s3://<MY-BUCKET>/xgboost-mnist/validation/
  Output Data Config:
    S 3 Output Path:  s3://<MY-BUCKET>/xgboost-mnist/models/
  Region:             us-east-2
  Resource Config:
    Instance Count:     1
    Instance Type:      ml.m4.xlarge
    Volume Size In GB:  5
  Role Arn:             arn:aws:iam::<ACCOUNT>:role/sagemaker_execution_role
  Stopping Condition:
    Max Runtime In Seconds:  86400```
goswamig commented 4 years ago

@charlesa101 Thanks for trying out. I am assuming you have replaced input, output buckets and role Arn.

Would you please run the following command provide the output ?

kubectl  get trainingjobs xgboost-mnist
kubectl describe trainingjob xgboost-mnist
charlesa101 commented 4 years ago

@gautamkmr, here you go thank you! yeah i have my own bucket and sagemaker executor role


NAME            STATUS   SECONDARY-STATUS   CREATION-TIME          SAGEMAKER-JOB-NAME
xgboost-mnist                               2020-03-09T16:51:08Z ```

```kubectl describe TrainingJob            
Name:         xgboost-mnist
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sagemaker.aws.amazon.com/v1","kind":"TrainingJob","metadata":{"annotations":{},"name":"xgboost-mnist","namespace":"default"...
API Version:  sagemaker.aws.amazon.com/v1
Kind:         TrainingJob
Metadata:
  Creation Timestamp:  2020-03-09T06:58:17Z
  Generation:          1
  Resource Version:    117181
  Self Link:           /apis/sagemaker.aws.amazon.com/v1/namespaces/default/trainingjobs/xgboost-mnist
  UID:                 5a907178-61d3-11ea-b461-02efd6507006
Spec:
  Algorithm Specification:
    Training Image:       825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest
    Training Input Mode:  File
  Hyper Parameters:
    Name:   max_depth
    Value:  5
    Name:   eta
    Value:  0.2
    Name:   gamma
    Value:  4
    Name:   min_child_weight
    Value:  6
    Name:   silent
    Value:  0
    Name:   objective
    Value:  multi:softmax
    Name:   num_class
    Value:  10
    Name:   num_round
    Value:  10
  Input Data Config:
    Channel Name:      train
    Compression Type:  None
    Content Type:      text/csv
    Data Source:
      S 3 Data Source:
        S 3 Data Distribution Type:  FullyReplicated
        S 3 Data Type:               S3Prefix
        S 3 Uri:                     s3://<MY-BUCKET>/xgboost-mnist/train/
    Channel Name:                    validation
    Compression Type:                None
    Content Type:                    text/csv
    Data Source:
      S 3 Data Source:
        S 3 Data Distribution Type:  FullyReplicated
        S 3 Data Type:               S3Prefix
        S 3 Uri:                     s3://<MY-BUCKET>/xgboost-mnist/validation/
  Output Data Config:
    S 3 Output Path:  s3://<MY-BUCKET>/xgboost-mnist/models/
  Region:             us-east-2
  Resource Config:
    Instance Count:     1
    Instance Type:      ml.m4.xlarge
    Volume Size In GB:  5
  Role Arn:             arn:aws:iam::<ACCOUNT>:role/sagemaker_execution_role
  Stopping Condition:
    Max Runtime In Seconds:  86400```
goswamig commented 4 years ago

@charlesa101 Thanks for providing the output. It appears that operator is not running successfully on your k8s cluster. you can verify that

 kubectl get pods -A | grep -i sagemaker

You can follow steps from here to install the operator, let us know if you face any issue.

charlesa101 commented 4 years ago

yeah that's what i noticed as well now


NAME                                                         READY   STATUS    RESTARTS   AGE
sagemaker-k8s-operator-controller-manager-5858fd7b8d-h89s8   0/2     Pending   0          24h```
charlesa101 commented 4 years ago

Name:               sagemaker-k8s-operator-controller-manager-5858fd7b8d-h89s8
Namespace:          sagemaker-k8s-operator-system
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             control-plane=controller-manager
                    pod-template-hash=5858fd7b8d
Annotations:        kubernetes.io/psp: eks.privileged
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/sagemaker-k8s-operator-controller-manager-5858fd7b8d
Containers:
  kube-rbac-proxy:
    Image:      gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
    Port:       8443/TCP
    Host Port:  0/TCP
    Args:
      --secure-listen-address=0.0.0.0:8443
      --upstream=http://127.0.0.1:8080/
      --logtostderr=true
      --v=10
    Environment:
      AWS_ROLE_ARN:                 arn:aws:iam::123456789012:role/DELETE_ME
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from sagemaker-k8s-operator-default-token-rwdkn (ro)
  manager:
    Image:      957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1
    Port:       <none>
    Host Port:  <none>
    Command:
      /manager
    Args:
      --metrics-addr=127.0.0.1:8080
    Limits:
      cpu:     100m
      memory:  30Mi
    Requests:
      cpu:     100m
      memory:  20Mi
    Environment:
      AWS_DEFAULT_SAGEMAKER_ENDPOINT:  
      AWS_ROLE_ARN:                    arn:aws:iam::123456789012:role/DELETE_ME
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from sagemaker-k8s-operator-default-token-rwdkn (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  sagemaker-k8s-operator-default-token-rwdkn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  sagemaker-k8s-operator-default-token-rwdkn
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  64s (x1378 over 34h)  default-scheduler  no nodes available to schedule pods
charlesa101 commented 4 years ago

my eks/ecr is on us-east2, but it seems all the crd artifacts are coming from us-east1 could that be the issue?

goswamig commented 4 years ago

EKS can pull the image from other region too. I think in your case it seems that you don't have any worker node associated to cluster? At least thats what below message says.

  Warning  FailedScheduling  64s (x1378 over 34h)  default-scheduler  no nodes available to schedule pods

Can you run ?

kubectl get node
goswamig commented 4 years ago

@charlesa101 did you get chance to review it again?

charlesa101 commented 4 years ago

NAME                                           STATUS   ROLES    AGE     VERSION
ip-172-16-116-51.us-east-2.compute.internal    Ready    <none>   5h47m   v1.14.8-eks-b8860f
ip-172-16-121-255.us-east-2.compute.internal   Ready    <none>   5h47m   v1.14.8-eks-b8860f
ip-172-16-137-197.us-east-2.compute.internal   Ready    <none>   5h47m   v1.14.8-eks-b8860f
charlesa101 commented 4 years ago

yeah i did, recreated the cluster again but still the same issue

goswamig commented 4 years ago

@charlesa101 In previous describe output of pod it appears that cluster did not have any worker nodes available (no nodes available to schedule pods).

But based on recent output it appears that you have three worker nodes available.

NAME STATUS ROLES AGE VERSION ip-172-16-116-51.us-east-2.compute.internal Ready 5h47m v1.14.8-eks-b8860f ip-172-16-121-255.us-east-2.compute.internal Ready 5h47m v1.14.8-eks-b8860f ip-172-16-137-197.us-east-2.compute.internal Ready 5h47m v1.14.8-eks-b8860f

Could you please describe each of these nodes and operator pod ?

# Describe nodes , assuming the names of nodes are same as you mentioned in previous comment.
kubectl describe node ip-172-16-116-51.us-east-2.compute.internal 
kubectl describe node ip-172-16-121-255.us-east-2.compute.internal 
kubectl describe node ip-172-16-137-197.us-east-2.compute.internal 
#Get the operator pod name 
kubectl get pods -A | grep -i sagemaker
kubectl describe pod <put the pod name here>  -n sagemaker-k8s-operator-system

If operator has been deployed successfully and if trainingjob is still not yet running please attach the out put of describe trainingjob as well ?

kubectl describe trainingjob xgboost-mnist
charlesa101 commented 4 years ago

i tried to look checked the operator pod, here is the log @gautamkmr

kubectl logs -f sagemaker-k8s-operator-controller-manager-5858fd7b8d-2dk5c  -n sagemaker-k8s-operator-system manager
2020-03-15T18:09:13.864Z        INFO    controller-runtime.metrics      metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2020-03-15T18:09:13.865Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "trainingjob", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.865Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "hyperparametertuningjob", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.865Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "hostingdeployment", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "model", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "endpointconfig", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z        INFO    controller-runtime.controller   Starting EventSource    {"controller": "batchtransformjob", "source": "kind source: /, Kind="}
2020-03-15T18:09:13.866Z        INFO    setup   starting manager
2020-03-15T18:09:13.866Z        INFO    controller-runtime.manager      starting metrics server {"path": "/metrics"}
2020-03-15T18:09:14.066Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "trainingjob"}
2020-03-15T18:09:14.066Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "model"}
2020-03-15T18:09:14.067Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "batchtransformjob"}
2020-03-15T18:09:14.067Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "hostingdeployment"}
2020-03-15T18:09:14.066Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "endpointconfig"}
2020-03-15T18:09:14.067Z        INFO    controller-runtime.controller   Starting Controller     {"controller": "hyperparametertuningjob"}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "trainingjob", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "model", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "endpointconfig", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "batchtransformjob", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "hostingdeployment", "worker count": 1}
2020-03-15T18:09:14.167Z        INFO    controller-runtime.controller   Starting workers        {"controller": "hyperparametertuningjob", "worker count": 1}
2020-03-15T19:09:19.962Z        INFO    controllers.TrainingJob Getting resource        {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:09:19.962Z        INFO    controllers.TrainingJob Job status is empty, setting to intermediate status     {"trainingjob": "default/xgboost-mnist", "status": "SynchronizingK8sJobWithSageMaker"}
2020-03-15T19:09:19.963Z        INFO    controllers.TrainingJob Updating job status     {"trainingjob": "default/xgboost-mnist", "new-status": {"trainingJobStatus":"SynchronizingK8sJobWithSageMaker","lastCheckTime":"2020-03-15T19:09:19Z"}}
2020-03-15T19:09:19.976Z        INFO    controllers.TrainingJob Getting resource        {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:09:19.976Z        INFO    controllers.TrainingJob Adding generated name to spec   {"trainingjob": "default/xgboost-mnist", "new-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444"}
2020-03-15T19:09:19.982Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "trainingjob", "request": "default/xgboost-mnist"}
2020-03-15T19:09:19.983Z        INFO    controllers.TrainingJob Getting resource        {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:09:19.983Z        INFO    controllers.TrainingJob Loaded AWS config       {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:09:19.983Z        INFO    controllers.TrainingJob Calling SM API DescribeTrainingJob      {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:09:20.916Z        ERROR   controllers.TrainingJob.handleSageMakerApiError Handling unrecoverable sagemaker API error      {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "error": "UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 01ea5be5-6bd5-4bae-b79e-2bc8d86338ee"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).handleSageMakerApiError
        /workspace/controllers/trainingjob/trainingjob_controller.go:396
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).Reconcile
        /workspace/controllers/trainingjob/trainingjob_controller.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:216
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
2020-03-15T19:09:20.916Z        INFO    controllers.TrainingJob.handleSageMakerApiError Updating job status     {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "new-status": {"trainingJobStatus":"Failed","additional":"UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 01ea5be5-6bd5-4bae-b79e-2bc8d86338ee","lastCheckTime":"2020-03-15T19:09:20Z","cloudWatchLogUrl":"https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logStream:group=/aws/sagemaker/TrainingJobs;prefix=xgboost-mnist-792eb47166f011ea88d202c3652bf444;streamFilter=typeLogStreamPrefix","sageMakerTrainingJobName":"xgboost-mnist-792eb47166f011ea88d202c3652bf444"}}
2020-03-15T19:09:20.924Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "trainingjob", "request": "default/xgboost-mnist"}
2020-03-15T19:11:41.623Z        INFO    controllers.TrainingJob Getting resource        {"trainingjob": "default/xgboost-mnist"}
2020-03-15T19:11:41.623Z        INFO    controllers.TrainingJob Loaded AWS config       {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:11:41.623Z        INFO    controllers.TrainingJob Calling SM API DescribeTrainingJob      {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2"}
2020-03-15T19:11:42.150Z        ERROR   controllers.TrainingJob.handleSageMakerApiError Handling unrecoverable sagemaker API error      {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "error": "UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 7145c885-b685-4663-8dd3-6c212ce574b2"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).handleSageMakerApiError
        /workspace/controllers/trainingjob/trainingjob_controller.go:396
go.amzn.com/sagemaker/sagemaker-k8s-operator/controllers/trainingjob.(*TrainingJobReconciler).Reconcile
        /workspace/controllers/trainingjob/trainingjob_controller.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:216
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
2020-03-15T19:11:42.150Z        INFO    controllers.TrainingJob.handleSageMakerApiError Updating job status     {"trainingjob": "default/xgboost-mnist", "training-job-name": "xgboost-mnist-792eb47166f011ea88d202c3652bf444", "aws-region": "us-east-2", "new-status": {"trainingJobStatus":"Failed","additional":"UnrecognizedClientException: The security token included in the request is invalid.\n\tstatus code: 400, request id: 7145c885-b685-4663-8dd3-6c212ce574b2","lastCheckTime":"2020-03-15T19:11:42Z","cloudWatchLogUrl":"https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logStream:group=/aws/sagemaker/TrainingJobs;prefix=xgboost-mnist-792eb47166f011ea88d202c3652bf444;streamFilter=typeLogStreamPrefix","sageMakerTrainingJobName":"xgboost-mnist-792eb47166f011ea88d202c3652bf444"}}
2020-03-15T19:11:42.159Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "trainingjob", "request": "default/xgboost-mnist"}
goswamig commented 4 years ago

@charlesa101 Thanks for sharing the log. You are on right track. I think the issue now is operator pod is unable to retrieve credentials from IAM service to talk to sagemaker.

"error": "UnrecognizedClientException: The security token included in the request is invalid.\n

Could you please check your trust.json basically trust policy have three places to update cluster region and OIDC ID and one place to add your AWS account number.

surajkota commented 4 years ago

Hi @charlesa101

Closing this issue since there has been no activity in 90 days. Please re-open if you still need help

Thanks

angadkalra commented 2 years ago

Hi, I'm having the exact same issue except that my pod is running fine. I setup my k8s cluster using terraform with 1 master node and 1 worker node. When I submit the trainingjob, there is no status or job name or anything else. I tried all the commands above and it looks like the scheduler was able to assign the pods to the worker node. Any help would be appreciated! Please see outputs for commands below:

ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl get pods -A                                                                                                                                                                                                                                                    
NAMESPACE        NAME                                                         READY   STATUS    RESTARTS   AGE                                                                                                                                                                                                                
kube-system      aws-node-67tgx                                               1/1     Running   0          2d18h
kube-system      aws-node-k2q7z                                               1/1     Running   0          2d18h
kube-system      coredns-85d5b4454c-cwfvj                                     1/1     Running   0          2d18h
kube-system      coredns-85d5b4454c-x5ld9                                     1/1     Running   0          2d18h
kube-system      kube-proxy-54vm5                                             1/1     Running   0          2d18h
kube-system      kube-proxy-r8j7j                                             1/1     Running   0          2d18h
kube-system      metrics-server-64cf6869bd-6nppx                              1/1     Running   0          2d18h
sagemaker-jobs   sagemaker-k8s-operator-controller-manager-855f498957-fhkvv   2/2     Running   0          2d18h
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl describe pod sagemaker-k8s-operator-controller-manager-855f498957-fhkvv -n sagemaker-jobs
Name:         sagemaker-k8s-operator-controller-manager-855f498957-fhkvv
Namespace:    sagemaker-jobs
Priority:     0
Node:         ip-10-0-1-245.us-west-2.compute.internal/10.0.1.245
Start Time:   Fri, 24 Jun 2022 22:26:03 +0000
Labels:       control-plane=controller-manager
              pod-template-hash=855f498957
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.0.1.144
IPs:
  IP:           10.0.1.144
Controlled By:  ReplicaSet/sagemaker-k8s-operator-controller-manager-855f498957
Containers:
  manager:
    Container ID:  docker://d8fc52b3e20a050999d3f24ab914f1d865a84a168a8b038f3fa81ce59cccbced
    Image:         957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1
    Image ID:      docker-pullable://957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s@sha256:94ffbba68954249b1724fdb43f1e8ab13547114555b4a217849687d566191e23
    Port:          <none>
    Host Port:     <none>
    Command:
      /manager
    Args:
      --metrics-addr=127.0.0.1:8080
      --namespace=sagemaker-jobs
    State:          Running
      Started:      Fri, 24 Jun 2022 22:26:09 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  30Mi
    Requests:
      cpu:     100m
      memory:  20Mi
    Environment:
      AWS_DEFAULT_SAGEMAKER_ENDPOINT:
      AWS_DEFAULT_REGION:              us-west-2
      AWS_REGION:                      us-west-2
      AWS_ROLE_ARN:                    arn:aws:iam::438029713005:role/model-training-sagemaker-role20220624222338450100000009
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6j8rt (ro)
kube-rbac-proxy:
    Container ID:  docker://4ecdaa395fdc70d5cead609465dbf21f6e11771a80ad5db0a6125053ab08b9d3
    Image:         gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
    Image ID:      docker-pullable://gcr.io/kubebuilder/kube-rbac-proxy@sha256:297896d96b827bbcb1abd696da1b2d81cab88359ac34cce0e8281f266b4e08de
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:8443
      --upstream=http://127.0.0.1:8080/
      --logtostderr=true
      --v=10
    State:          Running
      Started:      Fri, 24 Jun 2022 22:26:11 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      AWS_DEFAULT_REGION:           us-west-2
      AWS_REGION:                   us-west-2
      AWS_ROLE_ARN:                 arn:aws:iam::438029713005:role/model-training-sagemaker-role20220624222338450100000009
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6j8rt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  kube-api-access-6j8rt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl logs sagemaker-k8s-operator-controller-manager-855f498957-fhkvv manager -n sagemaker-jobs
I0624 22:26:11.339445       1 request.go:621] Throttling request took 1.046981399s, request: GET:https://172.20.0.1:443/apis/extensions/v1beta1?timeout=32s
2022-06-24T22:26:12.443Z        INFO    controller-runtime.metrics      metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2022-06-24T22:26:12.443Z        INFO    Starting manager in the namespace:      sagemaker-jobs
2022-06-24T22:26:12.443Z        INFO    setup   starting manager
2022-06-24T22:26:12.444Z        INFO    controller-runtime.manager      starting metrics server {"path": "/metrics"}
2022-06-24T22:26:12.444Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "EndpointConfig", "controller": "endpointconfig", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.444Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "BatchTransformJob", "controller": "batchtransformjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.445Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingAutoscalingPolicy", "controller": "hostingautoscalingpolicy", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.444Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "Model", "controller": "model", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.444Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "TrainingJob", "controller": "trainingjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.445Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "ProcessingJob", "controller": "processingjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.446Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HyperparameterTuningJob", "controller": "hyperparametertuningjob", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.446Z        INFO    controller      Starting EventSource    {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingDeployment", "controller": "hostingdeployment", "source": "kind source: /, Kind="}
2022-06-24T22:26:12.665Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "Model", "controller": "model"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingAutoscalingPolicy", "controller": "hostingautoscalingpolicy"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "EndpointConfig", "controller": "endpointconfig"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "BatchTransformJob", "controller": "batchtransformjob"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "ProcessingJob", "controller": "processingjob"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HyperparameterTuningJob", "controller": "hyperparametertuningjob"}
2022-06-24T22:26:12.666Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "TrainingJob", "controller": "trainingjob"}
2022-06-24T22:26:12.746Z        INFO    controller      Starting Controller     {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingDeployment", "controller": "hostingdeployment"}
2022-06-24T22:26:12.747Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingDeployment", "controller": "hostingdeployment", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "Model", "controller": "model", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "EndpointConfig", "controller": "endpointconfig", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HostingAutoscalingPolicy", "controller": "hostingautoscalingpolicy", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "ProcessingJob", "controller": "processingjob", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "BatchTransformJob", "controller": "batchtransformjob", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "TrainingJob", "controller": "trainingjob", "worker count": 1}
2022-06-24T22:26:12.766Z        INFO    controller      Starting workers        {"reconcilerGroup": "sagemaker.aws.amazon.com", "reconcilerKind": "HyperparameterTuningJob", "controller": "hyperparametertuningjob", "worker count": 1}
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl get trainingjobs
NAME            STATUS   SECONDARY-STATUS   CREATION-TIME          SAGEMAKER-JOB-NAME
osic-test-run                               2022-06-24T22:38:13Z  
ubuntu@ip-172-31-35-229:/imvaria/repos/model-training$ kubectl describe trainingjob osic-test-run                                                                                                                                                                                                                             
Name:         osic-test-run                                                                                                                                                                                                                                                                                                   
Namespace:    default                                                                                                                                                                                                                                                                                                         
Labels:       <none>                                                                                                                                                                                                                                                                                                          
Annotations:  <none>                                                                                                                                                                                                                                                                                                          
API Version:  sagemaker.aws.amazon.com/v1                                                                                                                                                                                                                                                                                     
Kind:         TrainingJob                                                                                                                                                                                                                                                                                                     
Metadata:                                                                                                                                                                                                                                                                                                                     
  Creation Timestamp:  2022-06-24T22:38:13Z                                                                                                                                                                                                                                                                                   
  Generation:          1                                                                                                                                                                                                                                                                                                      
  Managed Fields:
    API Version:  sagemaker.aws.amazon.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:algorithmSpecification:
          .:
          f:trainingImage:
          f:trainingInputMode:
        f:inputDataConfig:
        f:outputDataConfig:
          .:
          f:s3OutputPath:
        f:region:
        f:resourceConfig:
          .:
          f:instanceCount:
          f:instanceType:
          f:volumeSizeInGB:
        f:roleArn:
        f:stoppingCondition:
          .:
          f:maxRuntimeInSeconds:
        f:trainingJobName:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2022-06-24T22:38:13Z
  Resource Version:  3182
  UID:               0a0880c0-baf9-4f1a-8aa3-37480520c3e2
Spec:
  Algorithm Specification:
Training Image:       438029713005.dkr.ecr.us-west-2.amazonaws.com/model-training:latest
    Training Input Mode:  File
  Input Data Config:
    Channel Name:      train
    Compression Type:  None
    Data Source:
      s3DataSource:
        s3DataDistributionType:  FullyReplicated
        s3DataType:              S3Prefix
        s3Uri:                   s3://osic-full-including-override
  Output Data Config:
    s3OutputPath:  s3://osic-full-including-override/experiments
  Region:          us-west-2
  Resource Config:
    Instance Count:     1
    Instance Type:      ml.p3.2xlarge
    Volume Size In GB:  500
  Role Arn:             arn:aws:iam::438029713005:role/model-training-sagemaker-role20220624222338450100000009
  Stopping Condition:
    Max Runtime In Seconds:  900
  Training Job Name:         osic-test-run
Events:                      <none>

please let me know if you need to see anything else!