aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
424 stars 136 forks source link

Process Cannot Bind to Neuroncore #697

Closed ndenStanford closed 10 months ago

ndenStanford commented 1 year ago

I attempted to deploy torchserve application on the EKS cluster. In the deployment specification, I have requested 1 neuron device

resources:
            limits:
              aws.amazon.com/neuron: 1  # desired number of Inferentia devices.

However, I noticed that sometimes torchserve process did not get bound to the neuroncore successfully. When I ran the command neuron-ls, the following is the response from the pod that got initialized correctly.

model-server@textsum-mbart-inf2-v1-869d46f786-b8d77:~$ neuron-ls
instance-type: inf2.xlarge
instance-id: i-085a33ef2e3c6a9bf
+--------+--------+--------+---------+-------+---------+
| NEURON | NEURON | NEURON |   PCI   |  PID  | RUNTIME |
| DEVICE | CORES  | MEMORY |   BDF   |       | VERSION |
+--------+--------+--------+---------+-------+---------+
| 0      | 2      | 32 GB  | 00:1f.0 | 21626 | 2.12.23 |
|        |        |        |         | 21627 | 2.12.23 |
+--------+--------+--------+---------+-------+---------+

However, when the pod did not get initialized correctly, this is the response

model-server@textsum-mbart-inf2-v1-869d46f786-mhqrp:~$ neuron-ls
\instance-type: inf2.xlarge
instance-id: i-02646bfa0ed5c5bd8
+--------+--------+--------+---------+-------+---------+
| NEURON | NEURON | NEURON |   PCI   |  PID  | RUNTIME |
| DEVICE | CORES  | MEMORY |   BDF   |       | VERSION |
+--------+--------+--------+---------+-------+---------+
| 0      | 2      | 32 GB  | 00:1f.0 | 14146 | 2.12.23 |
+--------+--------+--------+---------+-------+---------+

Sometimes restarting the pod will help. Sometimes not. I'm having a trouble on where to start debugging this since the behavior of the pod is not deterministic.

This is the neuron device plugin specification

# https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: neuron-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name:  neuron-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: neuron-device-plugin-ds
    spec:
      serviceAccount: neuron-device-plugin
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: aws.amazon.com/neuron
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "node.kubernetes.io/instance-type"
                    operator: In
                    values:
                      - inf1.xlarge
                      - inf1.2xlarge
                      - inf1.6xlarge
                      - inf1.24xlarge
                      - inf2.xlarge
                      - inf2.4xlarge
                      - inf2.8xlarge
                      - inf2.24xlarge
                      - inf2.48xlarge
                      - trn1.2xlarge
                      - trn1.32xlarge
                      - trn1n.32xlarge
              - matchExpressions:
                  - key: "node.kubernetes.io/instance-type"
                    operator: In
                    values:
                      - inf1.xlarge
                      - inf1.2xlarge
                      - inf1.6xlarge
                      - inf1.24xlarge
                      - inf2.xlarge
                      - inf2.4xlarge
                      - inf2.8xlarge
                      - inf2.24xlarge
                      - inf2.48xlarge
                      - trn1.2xlarge
                      - trn1.32xlarge
                      - trn1n.32xlarge
      containers:
        #Device Plugin containers are available both in us-east and us-west ecr
        #repos
      - image: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin@sha256:48b29bf06338901bb621533b99912e5dc53084ea963ddba4f6e0b1cda29a2f04
        imagePullPolicy: Always
        name: neuron-device-plugin
        env:
        - name: KUBECONFIG
          value: /etc/kubernetes/kubelet.conf
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: infa-map
            mountPath: /run
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: infa-map
          hostPath:
            path: /run

Any clarification or help is appreciated. Thanks!

aws-donkrets commented 1 year ago

ndenStanford - Sorry you have having an issue with your inf2 instance. we are taking a look at the information your reported to determine next steps.

ndenStanford commented 1 year ago

Hello Donkrets,

Thanks for your response. The issue occurs with inf2.

james-aws commented 1 year ago

hi @ndenStanford, to understand and reproduce this issue it would be helpful to know more about how you are deploying your torchserve application, specifically I'd want your app's deployment manifest and more details about your cluster configuration. If you'd prefer to not share details publicly, please send a message to aws-neuron-support@amazon.com and we will help you out.

ndenStanford commented 1 year ago

Cluster details

EKS 1.24 ami-06bf8e441ff8de6c6 inf2.xlarge neuron device plugin: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin@sha256:48b29bf06338901bb621533b99912e5dc53084ea963ddba4f6e0b1cda29a2f04 (it is 1.9.3.0)

We can’t use 2.12.5.0 neuron device plugin due to the incompatibility of other existing services.

We trigger this userdata script in the node when it's first initialized

# Configure Linux for Neuron repository updates
sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com/
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

# Update OS packages
sudo yum update -y

################################################################################################################
# To install or update to Neuron versions 1.19.1 and newer from previous releases:
# - DO NOT skip 'aws-neuron-dkms' install or upgrade step, you MUST install or upgrade to latest Neuron driver
################################################################################################################

# Install OS headers
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

sudo yum remove aws-neuron-dkms -y

# install Neuron Driver
sudo yum install aws-neuronx-dkms -y

# Install Neuron Tools 
sudo yum install aws-neuronx-tools -y

# Add PATH
export PATH=/opt/aws/neuron/bin:$PATH

This is the deployment manifest

---
apiVersion: v1
kind: Service
metadata:
  name: textsum-bart-es-inf2
  labels:
    app: textsum-bart-es-inf2-v1
spec:
  selector:
    app: textsum-bart-es-inf2-v1
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: textsum-bart-es-inf2-v1
  labels:
    app: textsum-bart-es-inf2-v1
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0
  selector:
    matchLabels:
      app: textsum-bart-es-inf2-v1
  template:
    metadata:
      labels:
        app: textsum-bart-es-inf2-v1
      annotations:
        linkerd.io/inject: enabled
    spec:
      volumes:
        - name: sock
          emptyDir: {}
        - name: model-file
          hostPath:
            path: /opt/models/model-store
            type: Directory
      nodeSelector:
        workertype: pretrained_summarization_bart_es
      containers:
        - name: textsum-bart-es-inf2
          image: 484375727565.dkr.ecr.us-east-1.amazonaws.com/summarization_model:v1
          ports:
            - containerPort: 8080
          imagePullPolicy: Always # If need to refresh the container
          args:
            - serve
            - --models=neuron_summarizer_bart_es.mar
          livenessProbe:
            exec:
              command:
              - /bin/sh
              - -c
              - /home/model-server/liveness.sh
            initialDelaySeconds: 60
            periodSeconds: 15
            failureThreshold: 1
          readinessProbe:
            httpGet:
              path: /ping
              port: 8080
            initialDelaySeconds: 45
            periodSeconds: 5

          volumeMounts:
            - name: model-file
              mountPath: "/home/model-server/model-store"

          resources:
            limits:
              aws.amazon.com/neuron: 1  # desired number of Inferentia devices.

          securityContext:
            capabilities:
              add:
                - IPC_LOCK
james-aws commented 1 year ago

thank you, @ndenStanford. Looking this over to determine next steps.

james-aws commented 1 year ago

@ndenStanford One possibility is that the torch serve process is crashing, rather than failing to bind to neuron cores. If it turns out it's an issue with torch serve crashing, it would be helpful to look at the logs for the problematic pod. If you'd like to share logs it would be easiest to do through aws-neuron-support@amazon.com. It would also be helpful if you ran neuron-ls -a to see the process names matching the pids.

ndenStanford commented 1 year ago

Hello James. I sent details to the e-mail you mentioned. The subject name is "Process Cannot Bind to Neuroncore #697". Please let me know if you need any more details or clarification. Thanks!

james-aws commented 10 months ago

@ndenStanford this issue is resolved in Neuron 2.12.1

Release notes: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#id4

Details:

The issue is a contiguous physical page (memory) allocation failure during neuron runtime initialization causing a kernel Taint. Basically Neuron is asking for a chunk of contiguous physical memory during initialization and the kernel doesn't have anything available.

Why is this happening?

The size of AI containers and associated data are getting so large that they are causing the Kernel's page cache to fragment host physical memory during container deployment, reducing the number of large contiguous memory buffers available for drivers and the kernel to use. This has only been observed on memory constrained systems.

While the condition is temporary since the kernel reclaims pages from the page cache and compacts memory to replenish the contiguous memory buffers, the condition can persist long enough to prevent model load if a model is loaded immediately after the container is deployed.

Solution:

The Neuron driver has a reserve pool of physically continuous memory buffers. We've reduced the size of our contiguous allocations so that the buffers so that the allocation requests can always be satisfied from the reserve pool.