Open GitGuruGangsta opened 2 months ago
As I encountered a simular situation, I am curious: Did you eventually wait for about 15 minutes to see if the connection comes up late? How much CPU is the container consuming?
we tried some things on configuration and firewalls and yes it ran more than 15 min several times. But it didn't work. Incoming packets arrived at the container, but we could not see any outbound tcp packets. Both alpine and debian images in newest version rely on kernel 6.1.+, so that should not make a difference on a node running on kernel 5.14. So that shouldn't be sth. like kernel incompatibilty problem Maybe any strange security measure of Redhat Linux (Rocky Linux) that prevents outbound ssh, but just from containers???
but the diabian image works fine, so we are using this one now
I was asking, because what I found out is, that the issue seems to only really happen if your host OS supports large FD numbers.
After login and chrooting, OpenSSH tries to close all file descriptors up to the largest used FD number. However, on Linux, to find the largest used FD requires the process to either access /proc filesystem or have the libproc library available. If this is not the case, then ALL file descriptors until the largest FD number possible are trying to be closed.
For some reason, the sshd in the alpine image cannot access FD information in the /proc filesystem. Also, the image does not contain libproc. So, OpenSSH tries to close all FD (see below).
The debian image has libproc available, so the OpenSSH uses simply this, to find the largest used FD and only closes a hand full of FDs.
If you are using the alpine image on a host with a low maximum file descriptor number, then the issue might never occur for you. But if you are on a host with billions of file descriptor numbers avaialbe, all those billions of file descriptors are tried to be closed by OpenSSH after authentication and before presenting the prompt, resulting in ridiculous waiting times (in my case about 15 minutes).
We had the same issue after switch from centos to AlmaLinux, for us this was massive issue as we have around 50 SFTP pods for clients. We had to disable chroot to get them working whilst I tried to figure out how to fix it. Info came from here
I copied the code from this repo so that I could try things and build our own images. I spent hours trying to find a way to reduce the max open file limit but the pod (running in kubernetes) ignored the limits.conf file I added and also the sysctl.conf file, it just refused to reduce the limit from the rediculous max open file limit of 1073741816.
Eventually I tried adding this to the bottom of the entrypoint file, just above the call to run sshd
log "Setting ulimit to 512"
ulimit -n 512
It worked!!!!!! I was so relieved, although now I'm kicking myself that the answer was so simple!!
We actually had this issue with the Debian image as well so strange that you don't but maybe our file limit is higher. I stuck with the Alpine image because we want the images as small as possible but don't see any reason the same wouldn't work for Debian.
Maybe an environment variable could be added to this repo to 'optionally' allow the setting of a specified open file limit?
I tried adding the ulimit command to a script in /etc/sftp.d which are automatically run anyway but it didn't work, guess cause it was different bash process, don't know.
On connection container terminal shows: "Accepted password for from 10.0.10.150 port 42379 ssh2"
But there is no connection established. It seems that outbound connection is prevented.
This appears only on alpine images (for a year old image and also for the newest one) and only on a new Linux kernel version of the host we assume.
The debian images work fine, no matter which Linux kernel is installed on the container host.
The problem occured since we updated our kubernetes cluster from older CentOS nodes to Rocky Linux (5.14.0-427.16.1.el9_4.x86_64) using the alpine images.