nixops uses select() - Githubissues

nh2 commented 6 years ago

Nixops currently uses the select() syscall, which will fail when any fd given to it is a number bigger than 1023.

I suspect this can lead to random failures, as there's no code in nixops that guarantees that at no given time less than 1024 files are opened.

The simplest fix is probably to use poll() instead.

nh2 commented 6 years ago

poll() isn't supported on all platforms, but nixops's installation.xml says it runs only on Linux and Mac OS, both of which support poll(), so we can probably trivially switch to that.

poll() was completely broken on Mac OS for a while, but apparently it works since January 2017, so if it's OK to require that version for nixops, we can ditch select().

nh2 commented 6 years ago

PR in #799

cleverca22 commented 6 years ago

according to ulimit -a, on a nixos machine, the max number of open files is 1024 i recently ran into a problem with chromium trying to open more then 1024 sockets when loading fonts and had to bump that up on my own machine

nh2 commented 6 years ago

On my machine, ulimit -n is 10000; for many use cases ulimit has to be bumped as @cleverca22 says, so I think we can't rely on the 1024 in nixops code.

domenkozar commented 6 years ago

@nh2 correct me if I'm wrong, but select would only fail if more than 1023 descriptors would be used, given we have only stdout/stderr this should never happen?

ip1981 commented 6 years ago

Were there any real issues?

ip1981 commented 6 years ago

I'd consider many open descriptors as a bug. And allowing more of them is not a solution.

ip1981 commented 6 years ago

This might be a better way https://pypi.python.org/pypi/pyev. Portable.

nh2 commented 6 years ago

@domenkozar

select would only fail if more than 1023 descriptors would be used

Yes.

given we have only stdout/stderr this should never happen?

No. You have no control over how many file descriptors the python interpreter or libraries you use opens.

Consider:

niklas@ares ~ % ps aux | grep nixops
niklas   23116  3.0  0.2 585452 70592 pts/14   SNl+ 23:06   0:01 /nix/store/cs3g9k4vgazvlv25kp3akscdvdjc8675-python-2.7.14/bin/python2.7 /nix/store/50yf5xhygabbmsydaab104lyqb9dvv6g-nixops-1.5.2pre0_abcdef/bin/..nixops-wrapped-wrapped deploy -d mydeployment
niklas@ares ~ % ls -lah /proc/23116/fd
total 0
dr-x------ 2 niklas niklas  0 Dec 22 23:06 ./
dr-xr-xr-x 9 niklas niklas  0 Dec 22 23:06 ../
lrwx------ 1 niklas niklas 64 Dec 22 23:06 0 -> /dev/pts/14
lrwx------ 1 niklas niklas 64 Dec 22 23:06 1 -> /dev/pts/14
lrwx------ 1 niklas niklas 64 Dec 22 23:08 10 -> socket:[73432261]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 11 -> socket:[73430996]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 12 -> socket:[73431020]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 13 -> socket:[73432098]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 14 -> socket:[73426848]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 15 -> socket:[73430085]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 16 -> socket:[73432125]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 17 -> socket:[73429130]
lrwx------ 1 niklas niklas 64 Dec 22 23:06 2 -> /dev/pts/14
lrwx------ 1 niklas niklas 64 Dec 22 23:08 20 -> socket:[73430109]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 21 -> socket:[73428544]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 22 -> socket:[73429163]
lrwx------ 1 niklas niklas 64 Dec 22 23:06 3 -> socket:[73427843]
lrwx------ 1 niklas niklas 64 Dec 22 23:06 4 -> /home/niklas/.../localstate.nixops
lrwx------ 1 niklas niklas 64 Dec 22 23:08 5 -> /home/niklas/.../localstate.nixops-wal
lr-x------ 1 niklas niklas 64 Dec 22 23:08 6 -> /dev/urandom
lr-x------ 1 niklas niklas 64 Dec 22 23:08 7 -> /dev/null
lrwx------ 1 niklas niklas 64 Dec 22 23:08 8 -> /home/niklas/.../localstate.nixops-shm
l-wx------ 1 niklas niklas 64 Dec 22 23:08 9 -> /home/niklas/.nixops/locks/0fecdb3a-0779-11e7-9874-0242e736c6a1

nh2 commented 6 years ago

I'd consider many open descriptors as a bug.

@ip1981 It's not.

There will be at least one FD open for each machine nixops connects to. If my cluster has more than ~1000 machines, this will magically break.

Arbitrary low limits on inputs controlled by users don't make for useful tools. That's (one of the reasons) why poll() was added to POSIX.

coretemp commented 6 years ago

@AmineChikhaoui Can we merge this? Also, can we give more people commit access to this repository?

I would likely contribute more if it didn't look like abandonware and I can imagine many people are frustrated that patches take so long to be picked up.

ip1981 commented 6 years ago

There will be at least one FD open for each machine nixops connects to. If my cluster has more than ~1000 machines, this will magically break.

Then the deploy host would be the bottleneck :)

nh2 commented 6 years ago

There will be at least one FD open for each machine nixops connects to. If my cluster has more than ~1000 machines, this will magically break.

Then the deploy host would be the bottleneck :)

Not necessarily: You can have 1000s of TCP connections open without problem, and when you deploy with nixops, the target machines can fetch from substitutors (such as cache.nixos.org), so the heavy data doesn't have to flow through the deploy host.

In any case, things being slow because you make them do heavy lifting is OK/expected, while things breaking due to arbitrary and low limits isn't.

NixOS / nixops

nixops uses select() #798