GNS3 / gns3-server

GNS3 server
GNU General Public License v3.0
806 stars 263 forks source link

Error while looking for uBridge version: [Errno 24] Too many open files #2214

Closed itspa1 closed 1 year ago

itspa1 commented 1 year ago

GNS3 Version: GNS3 v2.2.37 Operating System: Linux Debian 11

Hi,

I have a GNS3 project which uses docker to simulate some clients performing some actions on the network. I'm using a Cisco c7200 router with DHCP enabled on its gigabit interface. The clients are just ubuntu docker images with python installed on it. They just start and run a script inside them to simulate the network interactions.

I've run several simulations with 100 of these clients on the network with now issues. However, when I try to setup 250 clients. The clients get created, but, when I try to start them, after about 145 clients, I get the error "Error while looking for uBridge version: [Errno 24] Too many open files".

The server infra we're using is very powerful (320G ram and 80 core CPU) and I don't think that is an issue. My first thought was the ulimit, which in my case was set to 1024. I raised that to 4096 to my user and ran but resulted in the same error. So, I then went ahead and raised the ulimit for the files for all users to 4096, but it still results in the same error.

am I missing something or does the ulimit need to be way higher?

Thanks, Pavan

grossmj commented 1 year ago

Do you have the same problem if you start 50 clients first, then 50 more etc.

Also, try this command sysctl -w fs.file-max=100000

itspa1 commented 1 year ago

hi @grossmj Thanks for the response and sorry about the late reply. Yes, starting 50 clients first and then 50 more gives the same error. I've already set the value to more than 100000, but I still get the same issue.

Also, more context, I start the clients individually one after the other.

spikefishjohn commented 1 year ago

ps -axwu | grep ubridge find out what user bridge is running as. su - UBRIDGEUSER ulimit -a | grep open

itspa1 commented 1 year ago

Hi @spikefishjohn ubridge is running as the local user that is using gns3. i.e me. and the ulimit for me is the latest value I was checking with i.e 65535

spikefishjohn commented 1 year ago

oh my bad. I didn't scroll up.

spikefishjohn commented 1 year ago

I read over the post again. I'm not sure what process is throwing the error now.

I'm assuming the error is being printed by the GNS3 client? Is that error logged by the gns3server process as well? Maybe its just being relayed. I was thinking it was ubridge that was throwing the error but i'm thinking its the gns3server process.

Can you lsof -p the pid of the gns3server process and pipe to wc -l ?

also is there anything in dmesg output about max file limit?

itspa1 commented 1 year ago

so, in the current run that I have of which there are 50 clients running a simulation, the output of lsof -p the pid of the gns3server process and pipe to wc -l is 434.

Also, there is some related output about gns3server being killed in dmesg

[452836.367225] Out of memory: Killed process 6692 (gns3server) total-vm:330507920kB, anon-rss:326676772kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:641456kB oom_score_adj:0
[452848.922760] oom_reaper: reaped process 6692 (gns3server), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
                  active_file:544 inactive_file:0 isolated_file:0

I assume this was from an old run, which was killed because of the file limits being hit.

grossmj commented 1 year ago

Like @spikefishjohn I am suspecting a problem with the gns3server itself. The issue seems to happen when we check the uBridge version in https://github.com/GNS3/gns3-server/blob/master/gns3server/ubridge/hypervisor.py#L133L153

uBridge is not supposed to open files when you check for its version, I think the issue is with the call to subprocess_check_output() since we are using a pipe to read stdout. I still need to investigate this.

grossmj commented 1 year ago

This look like a possible explanation: https://stackoverflow.com/questions/24682167/too-many-open-files-error-with-popen-of-subprocess, especially since you mentioned that really fast server.

As a temporary workaround, can you try to solution provided in https://stackoverflow.com/questions/69767419/subprocess-too-many-open-files-in-subprocess?

spikefishjohn commented 1 year ago

I'm not sure. I feel like the open files could be a misleading error. Hard to say. Is it gns3server hit a memory leak because opening files stopped working or it hit a memory leak and that prevented it from opening files. Or its just legit out of memory.

@itspa1 I threw together a trouble shooting script that might shed some light. Feel free to modify if you have a better idea.

I'm thinking run this and watch the screen as you start maybe 10 VMs at a time until you get the error.

Run this as root btw.

 watch 'echo -n "global open files: " ; lsof -n | wc -l ; echo "Mem report" ; free ; echo ; echo "top 10 most open files per process" ; lsof -n | awk "{print \$1}" | sort | uniq -c | sort -rn | head -10'

I was thinking you could pipe to tee to log to file also but you just end up with ascii junk.

spikefishjohn commented 1 year ago

It might not hurt to do a dmesg -C before starting. After the error do a dmesg -T and see whats new.

spikefishjohn commented 1 year ago

One other thing.. maybe throw away the 7200. IOS has come a long way since 12.x. I'm not sure what the need for the 7200 is, but i would expect a FRR or similar docker would have much lower overhead.

If this happens to be a legit case of too much memory in use make sure you have swap enabled, I haven't played with it but i think zram might help? Also there is a also something called ksmd that can free up memory at the cost of cpu overhead. No idea how ksmd and zram/zswap might work together.

itspa1 commented 1 year ago

hi @spikefishjohn I'm currently running a simulation for something. I can wrap that up and run this tonight. I'll let you know what happens. Also, quick question about the 7200 router. So, what is the alternative if I should not use 7200 iOS router? if you could point me at some links, that would be helpful.

spikefishjohn commented 1 year ago

FRR

I personally haven't used the docker version before. Was just thinking it would have less overhead then emulating a 7200 (which is a mips cpu) on a intel hypervisor.

I doubt that will address your issue above just throwing it out there.

spikefishjohn commented 1 year ago

Well well, i just replicated this. I have a replication with around 50 devices. I've been deleting everything and re-creating multiple times.

gns3server hit around 300,000 open files and I started getting that error (no memory issues).

When i restarted, re-created and started everything this is where things are now.

top 10 most open files per process 16018 qemu-syst 15743 gns3serve 8970 dynamips 5170 ubridge 2394 libvirtd 552 zebra 510 snapd 483 sshd 285 packageki 259 multipath

spikefishjohn commented 1 year ago

I think all the open connection were dead network connections. I'll grab more info the next time it happens.

That being said i've been killing and recreating this lab almost all day long for 3 or 4 days.

grossmj commented 1 year ago

Hi @spikefishjohn, thanks for debugging this :+1:

I made a PR to attempt to fix this issue: https://github.com/GNS3/gns3-server/pull/2223

Do you think you can run the GNS3 server with that PR to see if you can reproduce the problem?

Thanks for you help.

spikefishjohn commented 1 year ago

Sure thing, i'll see if i can patch it into my gns3server. I'm still running a hacked 2.2.331 until 3.0.0 gets in a working state.

grossmj commented 1 year ago

Sure thing, i'll see if i can patch it into my gns3server. I'm still running a hacked 2.2.331 until 3.0.0 gets in a working state.

@spikefishjohn have you got a chance to try?

spikefishjohn commented 1 year ago

patching now

spikefishjohn commented 1 year ago

ok patched. I have a script setup to count and dump all open files of the gns3server process every 5 mins.

I also have a 2nd script setup to create 50 nodes, start them one at a time and then delete them in a loop.

Good luck everybody!

spikefishjohn commented 1 year ago

oh it also links everything in case thats related.

spikefishjohn commented 1 year ago

ok I modified this to show all open files instead of just gns3server.

spikefishjohn commented 1 year ago
root@compute01:~# lsof -n +c0 | egrep gns3server | egrep CLOSE_WAIT | wc -l
41664
root@compute01:~# lsof -n +c0 | egrep gns3server | egrep -v CLOSE_WAIT | wc -l
14026
root@compute01:~#
grossmj commented 1 year ago

Would CLOSE_WAIT be the normal behavior?

spikefishjohn commented 1 year ago

ok looks like it the too many files issue happened again. I'll need sometime to do the day job thing before I can look at this again, but I should have a listing of lsof from when the issue started.

spikefishjohn commented 1 year ago

I took a quick glance. Its all loop back connections that are causing the problem.

root@compute01:~# lsof -n +c0 | grep gns3server | egrep CLOSE_WAIT | egrep 127.0.0.1 | wc -l
72960
root@compute01:~# lsof -n +c0 | grep gns3server | egrep CLOSE_WAIT | egrep -v 127.0.0.1 | wc -l
0
root@compute01:~#

here is a small sample

gns3server                            3400362 1675388 gns3server                   gns3  359u     IPv4            6088039        0t0        TCP 127.0.0.1:52518->127.0.0.1:7041 (CLOSE_WAIT)
gns3server                            3400362 1675388 gns3server                   gns3  360u     IPv4            6308798        0t0        TCP 127.0.0.1:36244->127.0.0.1:7032 (CLOSE_WAIT)
gns3server                            3400362 1675388 gns3server                   gns3  361u     IPv4            5885963        0t0        TCP 127.0.0.1:49306->127.0.0.1:7061 (CLOSE_WAIT)
gns3server                            3400362 1675388 gns3server                   gns3  362u     IPv4            9631723        0t0        TCP 127.0.0.1:59148->127.0.0.1:7006 (CLOSE_WAIT)
gns3server                            3400362 1675388 gns3server                   gns3  363u     IPv4            6070158        0t0        TCP 127.0.0.1:44682->127.0.0.1:7043 (CLOSE_WAIT)
gns3server                            3400362 1675388 gns3server                   gns3  364u     IPv4            5882190        0t0        TCP 127.0.0.1:50136->127.0.0.1:7063 (CLOSE_WAIT)
gns3server                            3400362 1675388 gns3server                   gns3  365u     IPv4           10996008        0t0        TCP 127.0.0.1:55006->127.0.0.1:7004 (CLOSE_WAIT)

As far as is that normal? I'm not sure. I'm not about to say I understand socket programing.

spikefishjohn commented 1 year ago

So i'm reading this.

https://web.archive.org/web/20170113135705/http://unix.derkeiler.com/Mailing-Lists/SunManagers/2006-01/msg00367.html

Explanation:
Crist Clark
------------

CLOSE_WAIT means that the local end of the connection has received
a FIN from the other end, but the OS is waiting for the program at the
local end to actually close its connection.

The problem is your program running on the local machine is not closing
the socket. It is not a TCP tuning issue. A connection can (and quite
correctly) stay in CLOSE_WAIT forever while the program holds the
connection open.

Once the local program closes the socket, the OS can send the FIN to
the remote end which transitions you to LAST_ACK while you wait for
the ACK of the FIN. Once that is received, the connection is finished
and drops from the connection table (if you're end is in CLOSE_WAIT
you do _not_ end up in the TIME_WAIT state).

Eric Voisard
-------------

Afaik, there is no ndd parameter which affects the tcp CLOSE_WAIT duration.
There was "tcp_close_wait_interval" but it has been obsoleted and renamed to
"tcp_time_wait_interval" because in reality it affects the TIME_WAIT timeout
and not the CLOSE_WAIT. So, you can try to change it but I doubt it'll have
any effect since they're different things...

Otoh, from what I know, it's the responsibility of an application (i.e. not
to the OS) to close its socket once the remote computer closes its side of
the TCP communication.
RF793 says CLOSE_WAIT is the TCP/IP stack waiting for the local application
to release the socket. So, it hangs because it has received the information
that the remote host has initiated a disconnection and is closing its
socket, upon what the local application did not close its own side.

So maybe the solution consists in finding a bug fix for your application...

Or more dangerously because they still have right to send remaining data in
queue, to kill processes in CLOSE_WAIT state...

Which makes it sounds like gns3 isn't releasing the socket. Since its a loop connection I'm assuming it should be closed in two places right? Like the client side and the server side?

spikefishjohn commented 1 year ago

I'm seeing a pattern that all the ports seem to be in the console range (looks like around 7000).

; First console port of the range allocated to devices ; stay away from 5000 range so 5060 doesn't get hit. console_start_port_range = 7000 ; Last console port of the range allocated to devices console_end_port_range = 9000

spikefishjohn commented 1 year ago
root@compute01:~# lsof -n | egrep 127.0.0.1 | grep CLOSE_WAIT | wc -l
72960
root@compute01:~# lsof -n | egrep 127.0.0.1 | egrep '127.0.0.1:[789][0-9]+ \(CLOSE_WAIT\)' | wc -l
72960
spikefishjohn commented 1 year ago

I did a little reading on python debugging and put together this debug.

/usr/share/gns3/gns3-server/bin/python -X dev -X tracemalloc=100  -bb  /usr/bin/gns3server --log /var/log/gns3/gns3.log --record /var/log/gns3/curl.log -d >& gns3-output.txt 

gns3-output.txt contains the following. What i'm doing to trigger this is create a node, start it then delete it. I'll see if it happens if I stop it before deleting next. The output below is triggered after I stop the gns3server process.

/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/route.py:167: DeprecationWarning: "@coroutine" decorator is deprecated since Python 3.8, use "async def" instead
  func = asyncio.coroutine(func)
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/file_watcher.py:63: DeprecationWarning: There is no current event loop
  asyncio.get_event_loop().call_later(self._delay, self._check_config_file_change)
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/crash_report.py:96: DeprecationWarning: distro.linux_distribution() is deprecated. It should only be used as a compatibility shim with Python's platform.linux_distribution(). Please use distro.id(), distro.version() and distro.name() instead.
  "os:linux": " ".join(distro.linux_distribution()),
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/web_server.py:276: DeprecationWarning: There is no current event loop
  self._loop = asyncio.get_event_loop()
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/web_server.py:320: DeprecationWarning: Application.make_handler(...) is deprecated, use AppRunner API instead
  self._handler = self._app.make_handler()
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/aiohttp/web_app.py:221: DeprecationWarning: There is no current event loop
  loop = asyncio.get_event_loop()
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/aiohttp/web_app.py:387: DeprecationWarning: The object should be created within an async function
  return Server(
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/web_server.py:86: DeprecationWarning: There is no current event loop
  self._server, startup_res = self._loop.run_until_complete(asyncio.gather(srv, self._app.startup()))
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/controller/compute.py:490: DeprecationWarning: with timeout() is deprecated, use async with timeout() instead
  with async_timeout.timeout(timeout):
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/route.py:43: DeprecationWarning: Setting custom Request.json attribute is discouraged
  request.json = {}
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/route.py:48: DeprecationWarning: Setting custom Request.json attribute is discouraged
  request.json = json.loads(body.decode('utf-8'))
/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/web_server.py:225: DeprecationWarning: There is no current event loop
  loop = asyncio.get_event_loop()
/usr/lib/python3.10/asyncio/selector_events.py:702: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=17>
  _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
Object allocated at (most recent call last):
  File "/usr/bin/gns3server", lineno 8
    sys.exit(main())
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/main.py", lineno 83
    run()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/run.py", lineno 261
    server.run()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/web_server.py", lineno 332
    self._loop.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", lineno 600
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", lineno 1888
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", lineno 80
    self._context.run(self._callback, *self._args)
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/aiohttp/web_protocol.py", lineno 435
    resp = await request_handler(request)
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/aiohttp/web_app.py", lineno 504
    resp = await handler(request)
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/route.py", lineno 277
    response = await control_schema(request)
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/route.py", lineno 198
    await func(request, response)
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/handlers/api/compute/qemu_handler.py", lineno 214
    await vm.start()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/compute/qemu/qemu_vm.py", lineno 1105
    await self.start_wrap_console()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/compute/base_node.py", lineno 379
    (reader, writer) = await asyncio.open_connection(host="127.0.0.1", port=self._internal_console_port)
  File "/usr/lib/python3.10/asyncio/streams.py", lineno 47
    transport, _ = await loop.create_connection(
  File "/usr/lib/python3.10/asyncio/base_events.py", lineno 1089
    transport, protocol = await self._create_connection_transport(
  File "/usr/lib/python3.10/asyncio/base_events.py", lineno 1116
    transport = self._make_socket_transport(sock, protocol, waiter)
  File "/usr/lib/python3.10/asyncio/selector_events.py", lineno 61
    return _SelectorSocketTransport(self, sock, protocol, waiter,
spikefishjohn commented 1 year ago

Stopping before deleting doesn't change anything.

spikefishjohn commented 1 year ago

I'm looking at my code (v2.33.1) and i'm noticing this was changed in 2.2.36. Looks like @itspa1 is running 2.2.37.

I'll spin up a new box and see if I can replicate with the most recent build.

spikefishjohn commented 1 year ago

Can't replicate the first error I listed above while running 2.2.40 (the bit about unclosed transport). I'm going to have to upgrade my main rig to fully test this.

grossmj commented 1 year ago

@itspa1 do you still have this issue? have you had a chance to test with a recent version of GNS3? Thanks.

grossmj commented 1 year ago

I believe this has been fixed. Please comment back if you still have the problem.