MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
46 stars 22 forks source link

AM socket forking connections despite remaining ESTABLISHED #908

Open andreleblanc11 opened 10 months ago

andreleblanc11 commented 10 months ago

It was noticed through the logs overnight that the AM socket will occasionally fork connections for some IPs, despite nothing wrong showing in the logs. Looking at netstat, it shows that the connections remain ESTABLISHED despite still forking.

It's also worth noting that there isn't any data loss and the forked connections will take up the workload from its predecessor.

Stats

Last log lines of an unexpected forked connection (No indications of errors)

2024-01-26 13:40:26,888 [DEBUG] sarracenia.flow run  not throttling: limit: 0
2024-01-26 13:40:26,888 [DEBUG] sarracenia.flow run sleeping for 0.08

At the time, there were 20 total connections forked from the parent. Looking at the IPs in netstat, they match with the forking order shown inside the parent log.

netstat -anp | grep :5005 | grep ESTABLISHED | wc -l
21

Initial analysis

My initial guess is that we aren't losing connections from the connected hosts. However, they are still trying to connect on the same port (for whatever reason?) and we need to find a way to deal with those connections.

If it's the case, a simple solution for this could be to introduce a list of alreadyAcceptedIPs and have it checked before forking. Would need to test to confirm theory.

petersilva commented 10 months ago

This could be correct behaviour.... are all of the connections live, or only some? Someone could configure multiple connections from the same host. It's not necessarily an error.

petersilva commented 10 months ago

"continuously" ... that means forking a lot... but you only report 20 over a night ... 12 hours? need a better handle on frequency... to me continuously would mean forking every minute or so... constantly would be every second... forking a couple of times a night? "occasionally" ... ?

andreleblanc11 commented 10 months ago

The connection on the forked orphan gets dropped once the new fork gets created.

Comparing with the current Sundew AM sockets, it looks like the connections get broken every so often and sundew auto-restarts them. This might correlate with what is happening with the sr3 AM sockets.

For the same IP,

On sundew, 

`[ERROR] connection broken` gets stated in the logs at : 3:23 , 6:26 and 9:13 Z last night

On sr3,

Connections got forked at : 2:11 , 3:11 , 5:11, 8:11 and 10:11 Z last night
petersilva commented 10 months ago

*.11 ? that looks like a cron job... oh... and it happens on Sundew as well? hmm... OK, so that means it's not really a problem... jwe're just seeing how it has "always" worked... Do all the remotes get re-forked at the same time? or do different remotes have different patterns?

andreleblanc11 commented 10 months ago

No the timestamps of the forks are different for different hosts. Here is the pattern of the other host from last night,

On sundew, 

`[ERROR] connection broken` gets stated in the logs at : 0:01 , 4:54, 7:35, 10:17, 15:40 Z on this day

On sr3,

Connections got forked at : 0:41 , 2:53 , 5:04, 7:15, 9:26, 11:38 and 13:49 Z on this day
andreleblanc11 commented 10 months ago

I need to look at how sundew handles broken connections

andreleblanc11 commented 10 months ago

So there is a cronjob that runs every minute that restarts the processes when they are hung/stopped. Looking at it's logs, it is restarting the AM feeds when the connection gets lost it seems.

petersilva commented 10 months ago

OK... so far as I understand it:

So... on this basis, sr3 is doing better than Sundew (from an availability stand point?) The problem is that we get higher and higher instance counts over time?

We need for the sr3 stuff to "notice" that the other end has died, and free up lower instance numbers for amserver to use, so it doesn't monotonically increase?

Is that the problem we currently see? or is there something else?

andreleblanc11 commented 10 months ago

Not quite. I don't think Sundew is noticing sr3 connections and then going down. Rather, it is handling socket errors worst then sr3.

Looking at the Sundew logs, when the connection gets interrupted,

2024-01-26 20:29:24,151 [INFO] Routing table used: /apps/px/etc/my-config.conf
Traceback (most recent call last):
  File "/usr/bin/pxReceiver", line 160, in <module>
    gateway.run()
  File "/usr/lib/px/gateway.py", line 141, in run
    data = self.read()
  File "/usr/lib/px/receiverAm.py", line 89, in read
    data = self.unSocketManagerAm.getNextBulletins()
  File "/usr/lib/px/socketManager.py", line 232, in getNextBulletins
    self.__syncInBuffer()
  File "/usr/lib/px/socketManager.py", line 287, in __syncInBuffer
    raise socketManagerException('connection lost')
socketManager.socketManagerException: connection lost

Looking at the Sundew code, https://github.com/MetPX/Sundew/blob/main/lib/socketManager.py#L280-L287 , it looks like the method used to determine if a connection is lost doesn't work well. Looking at the pxChecker cronjob on NCP, it restarts AM servers 40 to 45 times per day, on average.

Sr3 is just better at handling these socket failures then Sundew. Our code is about the same as Sundew, except we don't display error messages when connections get lost, we should though.

https://github.com/MetPX/sarracenia/blob/d13f5c7f8169fd95298af4550d567f6ef0c3f941/sarracenia/flowcb/amserver.py#L195-L197

And, instead of having a cron that restarts the process, sr3 constantly listens for new connections, which would be why the instance number keeps climbing.

petersilva commented 10 months ago

ok, so the question is... we see these errors are happenning... we understand them... is there some other behaviour that we think we need to have? Do we need to

Is this something we want/need to do? if so, why... what problems are causing pain... how bad is it.

andreleblanc11 commented 10 months ago

For your first remark, we could add socket.shutdown before calling socket.close as the connection is deemed dead from that point on.

As per the socket python3 documentation

Note close() releases the resource associated with a connection but does not necessarily close the connection immediately. If you want to close the connection in a timely fashion, call shutdown() before close().

I don't think this would kill the process however. May need to add a SIGTERM/SIGKILL to try and remove the process.

For your second remark, I think we would want to have the log # low because, as an analyst, going through the logs when there's 50 versions of flow_amserver_XX.log can be annoying and can get confusing rather quick.

To give an example, in about 24 hours, the current NCP has forked already existing connections 27 times, which resulted in having log files for 36 instances in total.

petersilva commented 10 months ago

after the shutdown and the close, you do:


self.please_stop()

that will get the flowcb to trigger things to shutdown cleanly... and it should remove it's pid file while it is shutting down. so when forking a new connection, you can check for existence of pid files, and if there are some missing ones, then you can re-used that instance for the new connection.

If that doesn't work... can just have sys.exit(0) after the socket close()... but that might be messier (will not remove pid file...)

petersilva commented 10 months ago

oh... and socket.shutdown sounds like a good thing to do... also.

petersilva commented 10 months ago

my last post is wrong... the please_stop thing won't do anything useful...

petersilva commented 10 months ago

might be self.on_stop() that you need.

andreleblanc11 commented 9 months ago

Adding the on_stop and the additional logging doesn't seem to have caught the problem. The orphaned instance shows nothing in its logs indicative of an error. Sometimes the instance still be waiting for its first data reception and a new connection to the same IP will still overtake it.

2024-02-01 05:45:46,894 [INFO] sarracenia.flowcb.gather.am __WaitForRemoteConnection__ Starting up service with host my.ip.address
2024-02-01 05:45:46,895 [INFO] sarracenia.flowcb.gather.am __WaitForRemoteConnection__ Connection accepted with IP my.ip.address on port 5005. Starting service.

This is all that is found inside of such a log file.

From this symptom, I think the socket connection isn't "failing", but rather we're getting new connection attempts from the remote host, and because we still permit new connections from the same IP (as it's allowed in self.o.AllowIPs , the new connection overtakes the first one. Will try to check for this.

petersilva commented 9 months ago

we had a meeting to chat about this, and found: https://docs.python.org/3/library/socket.html

Changed in version 3.5: If the system call is interrupted and the signal handler does not raise an exception, the method now retries the system call instead of raising an InterruptedError exception (see PEP 475 for the rationale).

From what we could see, the am server is waiting in a recv() call and when the connection drops it returns 0 bytes but give no indication of a connection failure, so the program keeps running obliviously.

Still studying this problem.

andreleblanc11 commented 7 months ago

Even with the new code added from #997 , which raises an exception and exits when an empty byte string is received on the socket, it still isn't able to close the interrupted socket line properly.

I added a logger.critical to see the output of the buffer after the socket gets interrupted, and it will just stop receiving data after the last bulletin arrives, with no indication of interruption from the other end.

andreleblanc11 commented 7 months ago

Last night, of the 15 interrupted sockets, only once was it able to close the connection by receiving an empty byte string. I have no idea why this is the case.

petersilva commented 7 months ago

things one could try... I read the socket documentation linked above... these are things that can be tried:

     recv(self.limit,socket.MSG_DONTWAIT)

to make it truly async. got this from here: https://manpages.debian.org/bookworm/manpages-dev/recv.2.en.html

also... there is a line "failed to launch child" ... after that happens nothing good can happen in that process... it's a child process running the waiting daemoon code... just nothing good can happen... probably want a sys.exit(1) after that message is printed.

I also think the code might be better off if the WaitForRemoteConnection got revamped a bit:

from typing import NoReturn

def __WaitForRemoteConnections__(self) -> NoReturn:

reading about NoReturn, it was introduced in 3.6.2 and the python on ubuntu 18.04 is 3.6.8 so we should be ok for that.

also:

andreleblanc11 commented 7 months ago

I've tried two things on the NCPs.

I've tried to remove the timeout logic and replace it with the MSG_DONTWAIT flag and the appropriate condition in the Exception handler.

        try:
            tmp = self.conn.recv(self.limit, socket.MSG_DONTWAIT)

...

        except Exception as e:
            err = e.args[0]
            # For non blocking connections, the errors below are raised if the blocking connection would normally block.
            if err == errno.EAGAIN or errno.EWOULDBLOCK:
                logger.critical(f"No data available. Keep going. Error message {err=} , {errno.EAGAIN=} , {errno.EWOULDBLOCK=}")
                return
...

I've also tried keeping the timeout and have the TimeoutError exception close the socket whenever 'timed out' isn't found inside the Exception string. This was inspired on what I've found under This stack exchange thread

        # We don't want to wait on a hanging connection. We use the timeout error to exit out of the reception if there is nothing.
        # This in turn makes the whole flow the same as any other sarracenia flow.
        except TimeoutError as e:
            err = e.args[0]
            if err == 'timed out':
                logger.critical("Timeout invoked!")
                return
            else:
                logger.critical(f"Wrong timeout error invoked. Error : {err}")
                self.conn.close()
                sys.exit(1)

...

Both of these methods were unfortunately unable to make any difference in closing the socket :(

andreleblanc11 commented 7 months ago

If we decide to stay with timeout non-blocking sockets, having that extra guard inside the Exception would only be good.

petersilva commented 7 months ago

I don't have any other ideas... you're saying it just hangs there not producing any error messages... or that it just returns with an empty buffer just like a timeout, so we cannot tell the difference between connection being broken, and no message available?

You're sure the instance is still there hanging around? or did it just exit... and nothing is cleaning up the pid files.

Need to identify some way to differentiate those two cases.

fwiw... the call in on_start ... self.conn = self.__waitFor... the "self.conn =" assignment does not make any sense (any more.) I wonder if linting tools will eventually pick up assigning the return value of a NoReturn...

petersilva commented 7 months ago

I think you should try putting shutdown logic in the tmp==b'': thing... instead of raising an exception, do conn.close and exit right there.... I worry about exception handling being wonky.

petersilva commented 7 months ago

socktest.tgz.txt

If you're bored... I made small test-bed to isolate just the socket related stuff. You can experiment with behaviours a lot more easily/quickly with the small sample.

un-tar the the thing, then have:


socktest/receiver.py
socktest/listener.py
socktest/sender.py

cd socktest
python3 listener.py

will start up a listener... then open another shell and python3 sender.py and the listener will fork/exec a receiver, and the sender will send an ascii message to the listener every few seconds. there's no AM stuff in there... just straight ascii payloads.

It seems to work fine, using all the things we do, except it exits immediately on recv returning an empty string. Didn't have any other ideas immediately.

andreleblanc11 commented 7 months ago

What's strange is that when I perform tests, I will start up a socket writer (amtcpwriter) from the regions and send to the sr3 AM server. Once my test is completed, I'll 'Ctrl + C' the running process in the region and the socket connection will close fine from the sr3 AM receiver.

Out of the 8 IP's currently sending to NCP, 2 will get there socket connections reset? close/shutdown? regularly, every couple of hours. I wonder if this may be isolated on the regional servers themselves.

Also, I added some logging in the empty byte string condition while testing the socket timeout, and I was able to find that the socket doesn't receive any data after a new connection from the same IP forks.

So,

  1. A new connection gets detected from a destination IP and gets forked by the AM server.
  2. The previous connection from that same IP doesn't receive any data after its last bulletin reception and just waits for new data indefinetly.
petersilva commented 7 months ago

did you try replacing the "raise Exception" code for the null recv return case with code to do an explicit immediate exit()? I'm wondering if surrounding try/except stuff is weird in a way that takes a while to blow up.

andreleblanc11 commented 7 months ago

Yes, I tried that and it didn't change anything in the outcome 😢 . I worked with @kurt2x on trying to find if the MetManager in the regionals could be the source of the problem, but we couldn't find anything obvious.

Temporary workaround for now:

After this, we can look at recycling instance numbers to have avoid having exceedingly high numbers when the server runs for long periods of times.

andreleblanc11 commented 7 months ago

While testing my orphaned instance kill script, I discovered that child processes need "parents approval" to be properly killed.

What this means is that, the parent process needs to receive the status of the zombified (already killed, but waiting) child process to be able to kill it. Child zombie processes will just wait indefintly for either

  1. The parent process to die
  2. The parent process to acknowledge the zombies presence so it can be disposed.

I'm testing a fix for this now. I'm integrating os.waitpid after the socket forking to check for zombified children.

andreleblanc11 commented 6 months ago

It turns out that there are more then 2 regional servers with this problem. There are two more that I was able to identify while running my kill script in simulation mode (do everything but kill the instances).

The difference with the 2 newly identified servers is that the orphaned children get spawned significantly less often then the other 2.

andreleblanc11 commented 6 months ago

It looks like for a 4GB memory, 4GB swap server that the maximum AM server instances it can acquire before maxing out the usage capacity is in between 550-600 instances.

The kill script + most recent patch work. I'm working on some #980 patches right now before merging to the operational code.

andreleblanc11 commented 1 week ago

Temporary workaround for now:

  • Have a script that maps the pid # from the socket connections (netstat) to the sr3 amserver instance numbers (ps) for a given IP.
  • Sort the sr3 amserver instance numbers by lowest to highest and kill all the processes except the highest numbered one. The highest instance number will always be the active connection.

With the changes in https://github.com/MetPX/sarracenia/commit/9155d52e793ab21f1a7b556e6da168b81853fafc, this logic doesn't work anymore.

Would probably need to sort by start time now.

petersilva commented 1 week ago

This script does not sound like it ever made much sense... the amserver will often have >1 instance active... like ncp should have upto 8 servers connected at a time... It sounds like killing all but the last one... would kill 7 good ones (if they were using AM to uplink... which I think they all are?) ??

andreleblanc11 commented 1 week ago

I've just changed the script, and am testing now. The way that it works is.

For each IP we want to check

  1. If there is more then one socket connection found ( with netstat), get the EPOCH start time of the associated PID for each socket (stat -c %Y /proc/$PID/status). i. This is the part that I've changed in the script. This used to get the AM server sr3 instance numbers
  2. Sort the entries
  3. Kill all processes except one witht the most recent EPOCH start time.