Open nh2 opened 6 years ago
poll()
isn't supported on all platforms, but nixops's installation.xml
says it runs only on Linux and Mac OS, both of which support poll()
, so we can probably trivially switch to that.
poll()
was completely broken on Mac OS for a while, but apparently it works since January 2017, so if it's OK to require that version for nixops, we can ditch select()
.
PR in #799
according to ulimit -a
, on a nixos machine, the max number of open files is 1024
i recently ran into a problem with chromium trying to open more then 1024 sockets when loading fonts and had to bump that up on my own machine
On my machine, ulimit -n
is 10000; for many use cases ulimit has to be bumped as @cleverca22 says, so I think we can't rely on the 1024 in nixops code.
@nh2 correct me if I'm wrong, but select
would only fail if more than 1023 descriptors would be used, given we have only stdout/stderr this should never happen?
Were there any real issues?
I'd consider many open descriptors as a bug. And allowing more of them is not a solution.
This might be a better way https://pypi.python.org/pypi/pyev. Portable.
@domenkozar
select
would only fail if more than 1023 descriptors would be used
Yes.
given we have only stdout/stderr this should never happen?
No. You have no control over how many file descriptors the python interpreter or libraries you use opens.
Consider:
niklas@ares ~ % ps aux | grep nixops
niklas 23116 3.0 0.2 585452 70592 pts/14 SNl+ 23:06 0:01 /nix/store/cs3g9k4vgazvlv25kp3akscdvdjc8675-python-2.7.14/bin/python2.7 /nix/store/50yf5xhygabbmsydaab104lyqb9dvv6g-nixops-1.5.2pre0_abcdef/bin/..nixops-wrapped-wrapped deploy -d mydeployment
niklas@ares ~ % ls -lah /proc/23116/fd
total 0
dr-x------ 2 niklas niklas 0 Dec 22 23:06 ./
dr-xr-xr-x 9 niklas niklas 0 Dec 22 23:06 ../
lrwx------ 1 niklas niklas 64 Dec 22 23:06 0 -> /dev/pts/14
lrwx------ 1 niklas niklas 64 Dec 22 23:06 1 -> /dev/pts/14
lrwx------ 1 niklas niklas 64 Dec 22 23:08 10 -> socket:[73432261]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 11 -> socket:[73430996]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 12 -> socket:[73431020]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 13 -> socket:[73432098]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 14 -> socket:[73426848]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 15 -> socket:[73430085]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 16 -> socket:[73432125]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 17 -> socket:[73429130]
lrwx------ 1 niklas niklas 64 Dec 22 23:06 2 -> /dev/pts/14
lrwx------ 1 niklas niklas 64 Dec 22 23:08 20 -> socket:[73430109]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 21 -> socket:[73428544]
lrwx------ 1 niklas niklas 64 Dec 22 23:08 22 -> socket:[73429163]
lrwx------ 1 niklas niklas 64 Dec 22 23:06 3 -> socket:[73427843]
lrwx------ 1 niklas niklas 64 Dec 22 23:06 4 -> /home/niklas/.../localstate.nixops
lrwx------ 1 niklas niklas 64 Dec 22 23:08 5 -> /home/niklas/.../localstate.nixops-wal
lr-x------ 1 niklas niklas 64 Dec 22 23:08 6 -> /dev/urandom
lr-x------ 1 niklas niklas 64 Dec 22 23:08 7 -> /dev/null
lrwx------ 1 niklas niklas 64 Dec 22 23:08 8 -> /home/niklas/.../localstate.nixops-shm
l-wx------ 1 niklas niklas 64 Dec 22 23:08 9 -> /home/niklas/.nixops/locks/0fecdb3a-0779-11e7-9874-0242e736c6a1
I'd consider many open descriptors as a bug.
@ip1981 It's not.
There will be at least one FD open for each machine nixops connects to. If my cluster has more than ~1000 machines, this will magically break.
Arbitrary low limits on inputs controlled by users don't make for useful tools. That's (one of the reasons) why poll()
was added to POSIX.
@AmineChikhaoui Can we merge this? Also, can we give more people commit access to this repository?
I would likely contribute more if it didn't look like abandonware and I can imagine many people are frustrated that patches take so long to be picked up.
There will be at least one FD open for each machine nixops connects to. If my cluster has more than ~1000 machines, this will magically break.
Then the deploy host would be the bottleneck :)
There will be at least one FD open for each machine nixops connects to. If my cluster has more than ~1000 machines, this will magically break.
Then the deploy host would be the bottleneck :)
Not necessarily: You can have 1000s of TCP connections open without problem, and when you deploy with nixops, the target machines can fetch from substitutors (such as cache.nixos.org), so the heavy data doesn't have to flow through the deploy host.
In any case, things being slow because you make them do heavy lifting is OK/expected, while things breaking due to arbitrary and low limits isn't.
Nixops currently uses the
select()
syscall, which will fail when any fd given to it is a number bigger than 1023.I suspect this can lead to random failures, as there's no code in nixops that guarantees that at no given time less than 1024 files are opened.
The simplest fix is probably to use
poll()
instead.