Open leandregagnonlewis opened 7 months ago
I have the same issue if I try to increase the size of /dev/shm
using a emptyDir
volume with medium: Memory
. This is necessary for increasing the shared memory which postgres uses. For now I just removed the medium: Memory
for use with microk8s and only use it for the kubernetes production cluster.
I had a really hard time finding out that this was the issue because my pod just failed without any proper error message.
Summary
I am running microk8s on a single ubuntu VM with 32 Gi of RAM so memory is not an issue on the machine side. I am trying to deploy a single replica of Nvidia Triton Inference Server wich allows to serve ML models. I am migrating from EKS to an on-prem solution and I am using exactly the same deployment config I used of EKS.
Now the pod starts as usual. I can see in the logs that the files on s3 are properly downloaded so the problem is not with the credentials. But after a few seconds the pod crash without any indication.
What Should Happen Instead?
The server should become healthy and wait for inference request to serve. I have try to deploy the server using docker on the same VM and it worked flawlessly so I guess the problem is with microk8s.
Here is my compose.yml
Here are the logs, I have indicated the crash point when I run it on microk8s
Reproduction Steps
Introspection Report
inspection-report-20240209_114600.tar.gz
Can you suggest a fix?
Not sure, but I thinks this might be related to the high memory usage. Triton needs to access memory at /dev/shm so even in EKS, I needed to use the emptyDir strategy to mount memory to this path if not I had the same kind of crash. It is as if the strategy is not working in mircok8s.
Are you interested in contributing with a fix?
Sure