I'm starting a bunch of containers to grade student work / generate feedback forms / etc, and I'm seeing this error periodically when I try to start a docker container:
500 Server Error: Internal Server Error ("OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child\'s pid from pipe caused \\"EOF\\"": unknown")
This gets reported as a DockerAPIError and correctly reported to the instructor. But:
It's not deterministic -- if I run the exact same command again a second time, it doesn't happen
Only seems to happen when I'm creating lots of docker containers (about 177 of them, 6 running at a time)
Adding a 0.25s sleep between container creation seemed (?) to reduce the overall # of these errors but not get rid of them
Reducing the memory per container to 1GB and # concurrent containers to 4 didn't seem to do much
after testing for a while, it seems that once container creation starts throwing the above error once, it starts to do it more often
restarting containerd / docker does not help. Rebooting the VM resolves the issue (not sure if permanently...)
Still experimenting to figure out what's going on here.
Current status of this: I tried adding a 10 second pause after each thrown error + 5 attempts at starting each container. It does seem to help a lot at first (when an attempt fails, it usually fails once or twice and then succeeds...). But after a while starting containers fails more and more and even a 10 second pause between starting containers doesn't help.
Here's /var/log/messages for what looks like one failed container startup...
Based on the runc page allocation failure, I assume this must be due to some memory allocation problem.
I'm starting a bunch of containers to grade student work / generate feedback forms / etc, and I'm seeing this error periodically when I try to start a docker container:
This gets reported as a
DockerAPIError
and correctly reported to the instructor. But:Still experimenting to figure out what's going on here.
Current status of this: I tried adding a 10 second pause after each thrown error + 5 attempts at starting each container. It does seem to help a lot at first (when an attempt fails, it usually fails once or twice and then succeeds...). But after a while starting containers fails more and more and even a 10 second pause between starting containers doesn't help.
Here's /var/log/messages for what looks like one failed container startup...
Based on the runc page allocation failure, I assume this must be due to some memory allocation problem.
https://community.gravitational.com/t/pods-memory-allocation-failures/675