Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4k stars 2.49k forks source link

Kubernetes instance pass own request resource and ingnore InstantceType configuration #1900

Open alipek opened 1 year ago

alipek commented 1 year ago

I am creating instance type configuration in AKS cluster. Enable autoscale for nodepools gpuproc2 and cpuproc2. When deployment is creating configuration spec.resources.requests from defined Instance type is ignored and pod are created with other values. Values are seeing like hardcoded values cpu 100m and memory 512Mi. Until this will working properly I can't allocate pods on different nodes from same nodepool group.

---
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
  name: standard-nc4
spec:
  nodeSelector:
    agentpool: gpuproc2
  resources:
    limits:
      cpu: "1"
      memory: "20Gi"
    requests:
      cpu: "700m"
      memory: "1900Mi"
---
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
  name: standard-cpuproc
spec:
  nodeSelector:
    agentpool: cpuproc2
  resources:
    limits:
      cpu: "3"
      memory: "15Gi"
    requests:
      cpu: "1100m"
      memory: "1500Mi"

Pod created by azure workspace deployment:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-04-19T04:06:00Z"
  generateName: deployname-134205-model-name-79c786f5c-
  labels:
    azuremlappname: deployname-134205-model-name
    isazuremlapp: "true"
    ml.azure.com/compute: aksnemlflowprod
    ml.azure.com/deployment-name: deployname-134205
    ml.azure.com/endpoint-name: model-name
    ml.azure.com/identity: deployname-134205-model-name
    ml.azure.com/resource-group: smarteye-ml-ne-prod
    ml.azure.com/scrape-metrics: "true"
    ml.azure.com/subscription-id: 89f996fd-5276-46d1-87c6-64544e672483
    ml.azure.com/workspace: ne-mlflow-prod
    pod-template-hash: 79c786f5c
  name: deployname-134205-model-name-79c786f5c-sd2ch
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: deployname-134205-model-name-79c786f5c
    uid: b72dae92-2086-44fa-899b-9aadedc0a1ec
  resourceVersion: "29057046"
  uid: c6ff4522-57a9-4cb9-963e-f8064a9daac8
spec:
  automountServiceAccountToken: true
  containers:
  - command:
    - runsvdir
    - /var/runit
    env:
    - name: AML_APP_ROOT
      value: /var/azureml-app/product-detection
    - name: AZUREML_ENTRY_SCRIPT
      value: azureml_score.py
    - name: AZUREML_MODEL_DIR
      value: /var/azureml-app/azureml-models/model_name/5
    - name: AZURE_STORAGE_CONNECTION_STRING
      value: SharedAccessSignature=;BlobEndpoint=https://account.blob.core.windows.net/;
    - name: CELERY_WORKER_QUEUE
      value: queue_name
    - name: CLASSIFICATION_MODEL_NAME
      value: model_name
    - name: CLASSIFICATION_MODEL_ROOT
      value: /opt/project/models
    - name: DEFAULT_CONSUMER
      value: InferenceDetector
    - name: DEFAULT_CONSUMER_MODULE
      value: model_extractor.queue_consumer
    - name: DETECTOR_MODEL_NAME
      value: name_of_package
    - name: DETECTOR_MODEL_PACKAGE
      value: https://url.tld/file
    - name: DETECTOR_MODEL_ROOT
      value: /opt/project/models
    - name: GPU_MEMORY_LIMIT
      value: "4096"
    - name: MLFLOW_MODEL_FOLDER
      value: /var/azureml-app/azureml-models
    - name: PREDICT_BATCH_SIZE
      value: "4"
    - name: SERVICE_NAME
      value: model-name
    - name: SERVICE_PATH_PREFIX
      value: api/v1/endpoint/model-name
    - name: TF_GPU_ALLOCATOR
      value: cuda_malloc_async
    - name: MSI_SECRET
      valueFrom:
        secretKeyRef:
          key: secret
          name: deployname-134205-model-name-sidecar
    - name: IDENTITY_HEADER
      valueFrom:
        secretKeyRef:
          key: secret
          name: deployname-134205-model-name-sidecar
    - name: MSI_ENDPOINT
      value: http://localhost:9999/token
    - name: IDENTITY_ENDPOINT
      value: http://localhost:9999/token
    image: registry.azurecr.io/image/path:v3.6.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 30
      httpGet:
        path: /
        port: 5001
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 180
    name: inference-server
    ports:
    - containerPort: 5001
      protocol: TCP
    readinessProbe:
      failureThreshold: 30
      httpGet:
        path: /
        port: 5001
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 2
    resources:
      limits:
        cpu: "1"
        memory: 20Gi
      requests:
        cpu: 100m
        memory: 512Mi
    securityContext:
      privileged: false
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/azureml-app
      name: model-mount-0
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-wjf8j
      readOnly: true
  - args:
    - --identity
    - /identity/identity_secret.json
    - --token
    - /identity/sidecar_token
    - --ca
    - /identity/ca.crt
    - --remote-host
    - https://amlarc-identity-proxy-service.azureml.svc.cluster.local
    - --port
    - "9999"
    - --path
    - /token
    - --remote-path
    - /token
    env:
    - name: MSI_SECRET
      valueFrom:
        secretKeyRef:
          key: secret
          name: deployname-134205-model-name-sidecar
    image: mcr.microsoft.com/azureml/amlarc/docker/identity-sidecar:1.1.25
    imagePullPolicy: IfNotPresent
    name: identity-sidecar
    resources:
      limits:
        cpu: 100m
        memory: 50Mi
      requests:
        cpu: 100m
        memory: 50Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /identity
      name: identity-secret-volume
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-wjf8j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: deployname-134205-model-name-registry.azurecr.io
  initContainers:
  - env:
    - name: STORAGE_MANIFEST_URL
      value: https://account.blob.core.windows.net/azureml/model-name_deployname-134205-model-name_model_config_map.json_f455169bb7e34736bc0c618741641388
    - name: STORAGE_DOWNLOAD_PATH
      value: /var/azureml-app
    - name: STORAGE_CREDENTIAL_CLIENTID
      value: uuid
    - name: STORAGE_CREDENTIAL_TYPE
      value: Amlarc
    - name: STORAGE_CREDENTIAL_ENDPOINT
      value: https://amlarc-identity-proxy-service.azureml.svc.cluster.local/token
    - name: STORAGE_CREDENTIAL_TOKENFILE
      value: /identity/sidecar_token
    image: mcr.microsoft.com/mir/mir-storageinitializer:46571814.1631244300887
    imagePullPolicy: IfNotPresent
    name: storageinitializer-modeldata
    resources:
      limits:
        cpu: 100m
        memory: 500Mi
      requests:
        cpu: 100m
        memory: 500Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/azureml-app
      name: model-mount-0
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-wjf8j
      readOnly: true
    - mountPath: /identity
      name: identity-secret-volume
      readOnly: true
  nodeName: aks-gpuproc2-28077253-vmss00000n
  nodeSelector:
    agentpool: gpuproc2
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: deployname-134205-model-name
  serviceAccountName: deployname-134205-model-name
  terminationGracePeriodSeconds: 30
  tolerations:
  - key: ml.azure.com/amlarc
    operator: Equal
    value: "true"
  - key: ml.azure.com/amlarc-workload
    operator: Equal
    value: "true"
  - key: ml.azure.com/resource-group
    operator: Equal
    value: smarteye-ml-ne-prod
  - key: ml.azure.com/workspace
    operator: Equal
    value: ne-mlflow-prod
  - key: ml.azure.com/compute
    operator: Equal
    value: aksnemlflowprod
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - emptyDir: {}
    name: model-mount-0
  - name: kube-api-access-wjf8j
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
  - name: identity-secret-volume
    secret:
      defaultMode: 420
      items:
      - key: identityStore
        path: identity_secret.json
      - key: token
        path: sidecar_token
      - key: ca
        path: ca.crt
      secretName: deployname-134205-model-name-sidecar
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-04-19T04:06:16Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-04-19T04:06:41Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-04-19T04:06:41Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-04-19T04:06:00Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://c6cc35cc482f7bdab5a6a5c1e5a2fb46ac14e82eac9a4e2684626cc05e175139
    image: mcr.microsoft.com/azureml/amlarc/docker/identity-sidecar:1.1.25
    imageID: mcr.microsoft.com/azureml/amlarc/docker/identity-sidecar@sha256:2aaae7af6e0bd3e5847c55feca123cf50d968fee52616f7e412b0a76b51ce035
    lastState: {}
    name: identity-sidecar
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-04-19T04:06:16Z"
  - containerID: containerd://8dc55c77556ee15f64321dd0c2967eef6002fdbbb92eccf1c0c9d76fc168f5a9
    image: registry.azurecr.io/image/path:v3.6.2
    imageID: registry.azurecr.io/image/path@sha256:8736a082e2f7698e6c6e5a055e9a17585d15bac3f57b833d889977d5dd2142e3
    lastState: {}
    name: inference-server
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-04-19T04:06:16Z"
  hostIP: 10.224.0.4
  initContainerStatuses:
  - containerID: containerd://4bfced7cb46a4f6e855cd613dbe87f54d226bff8871f6a3295b76dd49b116afb
    image: mcr.microsoft.com/mir/mir-storageinitializer:46571814.1631244300887
    imageID: mcr.microsoft.com/mir/mir-storageinitializer@sha256:c5dc758e64d7cf4a571c3877f9195eabd97e752f23ab01965e7c40abac95b83e
    lastState: {}
    name: storageinitializer-modeldata
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://4bfced7cb46a4f6e855cd613dbe87f54d226bff8871f6a3295b76dd49b116afb
        exitCode: 0
        finishedAt: "2023-04-19T04:06:16Z"
        reason: Completed
        startedAt: "2023-04-19T04:06:01Z"
  phase: Running
  podIP: 10.244.0.17
  podIPs:
  - ip: 10.244.0.17
  qosClass: Burstable
  startTime: "2023-04-19T04:06:00Z"