Closed ndenStanford closed 10 months ago
ndenStanford - Sorry you have having an issue with your inf2 instance. we are taking a look at the information your reported to determine next steps.
Hello Donkrets,
Thanks for your response. The issue occurs with inf2.
hi @ndenStanford, to understand and reproduce this issue it would be helpful to know more about how you are deploying your torchserve application, specifically I'd want your app's deployment manifest and more details about your cluster configuration. If you'd prefer to not share details publicly, please send a message to aws-neuron-support@amazon.com and we will help you out.
Cluster details
EKS 1.24 ami-06bf8e441ff8de6c6 inf2.xlarge neuron device plugin: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin@sha256:48b29bf06338901bb621533b99912e5dc53084ea963ddba4f6e0b1cda29a2f04 (it is 1.9.3.0)
We can’t use 2.12.5.0 neuron device plugin due to the incompatibility of other existing services.
We trigger this userdata script in the node when it's first initialized
# Configure Linux for Neuron repository updates
sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com/
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB
# Update OS packages
sudo yum update -y
################################################################################################################
# To install or update to Neuron versions 1.19.1 and newer from previous releases:
# - DO NOT skip 'aws-neuron-dkms' install or upgrade step, you MUST install or upgrade to latest Neuron driver
################################################################################################################
# Install OS headers
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y
sudo yum remove aws-neuron-dkms -y
# install Neuron Driver
sudo yum install aws-neuronx-dkms -y
# Install Neuron Tools
sudo yum install aws-neuronx-tools -y
# Add PATH
export PATH=/opt/aws/neuron/bin:$PATH
This is the deployment manifest
---
apiVersion: v1
kind: Service
metadata:
name: textsum-bart-es-inf2
labels:
app: textsum-bart-es-inf2-v1
spec:
selector:
app: textsum-bart-es-inf2-v1
ports:
- port: 80
targetPort: 8080
protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: textsum-bart-es-inf2-v1
labels:
app: textsum-bart-es-inf2-v1
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 0
selector:
matchLabels:
app: textsum-bart-es-inf2-v1
template:
metadata:
labels:
app: textsum-bart-es-inf2-v1
annotations:
linkerd.io/inject: enabled
spec:
volumes:
- name: sock
emptyDir: {}
- name: model-file
hostPath:
path: /opt/models/model-store
type: Directory
nodeSelector:
workertype: pretrained_summarization_bart_es
containers:
- name: textsum-bart-es-inf2
image: 484375727565.dkr.ecr.us-east-1.amazonaws.com/summarization_model:v1
ports:
- containerPort: 8080
imagePullPolicy: Always # If need to refresh the container
args:
- serve
- --models=neuron_summarizer_bart_es.mar
livenessProbe:
exec:
command:
- /bin/sh
- -c
- /home/model-server/liveness.sh
initialDelaySeconds: 60
periodSeconds: 15
failureThreshold: 1
readinessProbe:
httpGet:
path: /ping
port: 8080
initialDelaySeconds: 45
periodSeconds: 5
volumeMounts:
- name: model-file
mountPath: "/home/model-server/model-store"
resources:
limits:
aws.amazon.com/neuron: 1 # desired number of Inferentia devices.
securityContext:
capabilities:
add:
- IPC_LOCK
thank you, @ndenStanford. Looking this over to determine next steps.
@ndenStanford One possibility is that the torch serve process is crashing, rather than failing to bind to neuron cores.
If it turns out it's an issue with torch serve crashing, it would be helpful to look at the logs for the problematic pod. If you'd like to share logs it would be easiest to do through aws-neuron-support@amazon.com.
It would also be helpful if you ran neuron-ls -a
to see the process names matching the pids.
Hello James. I sent details to the e-mail you mentioned. The subject name is "Process Cannot Bind to Neuroncore #697". Please let me know if you need any more details or clarification. Thanks!
@ndenStanford this issue is resolved in Neuron 2.12.1
Release notes: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#id4
Details:
The issue is a contiguous physical page (memory) allocation failure during neuron runtime initialization causing a kernel Taint. Basically Neuron is asking for a chunk of contiguous physical memory during initialization and the kernel doesn't have anything available.
Why is this happening?
The size of AI containers and associated data are getting so large that they are causing the Kernel's page cache to fragment host physical memory during container deployment, reducing the number of large contiguous memory buffers available for drivers and the kernel to use. This has only been observed on memory constrained systems.
While the condition is temporary since the kernel reclaims pages from the page cache and compacts memory to replenish the contiguous memory buffers, the condition can persist long enough to prevent model load if a model is loaded immediately after the container is deployed.
Solution:
The Neuron driver has a reserve pool of physically continuous memory buffers. We've reduced the size of our contiguous allocations so that the buffers so that the allocation requests can always be satisfied from the reserve pool.
I attempted to deploy torchserve application on the EKS cluster. In the deployment specification, I have requested 1 neuron device
However, I noticed that sometimes torchserve process did not get bound to the neuroncore successfully. When I ran the command neuron-ls, the following is the response from the pod that got initialized correctly.
However, when the pod did not get initialized correctly, this is the response
Sometimes restarting the pod will help. Sometimes not. I'm having a trouble on where to start debugging this since the behavior of the pod is not deterministic.
This is the neuron device plugin specification
Any clarification or help is appreciated. Thanks!