MDSplus / mdsplus

The MDSplus data management system
https://mdsplus.org/
Other
75 stars 45 forks source link

Fix: reduce open files due to dispatcher #2740

Closed zack-vii closed 7 months ago

zack-vii commented 7 months ago

This is related to issue #2731 and may fix some if not all of the related problems.

zack-vii commented 7 months ago

bummer, i cannot access jenkins so its unclear to me what went wrong. I will test locally tomorrow.

mwinkel-dev commented 7 months ago

Hi @zack-vii,

Thanks for submitting this proposed fix for GA's leaking sockets, Issue #2731.

In addition to the files you've changed, I've also found it useful to change some additional files. My conjecture, perhaps wrong, is that the final solution will be a melding of our two proposed fixes.

I am now building your PR locally on my dev system and running it through my suite of tests. Will post results here within an hour or two.

mwinkel-dev commented 7 months ago

Hi @zack-vii,

Testing of this PR demonstrated that this PR fixes the primary case, but fails on an edge case.

Pro -- This PR is a sever-side fix for the primary cause of leaked sockets (i.e., the situation GA encountered). It is thus a more elegant solution than the client-only fix that I created. I agree that this PR should be used. On Thursday or Friday, I will complete a review of this PR.

Con -- If mdstcl dispatches actions so quickly that it overloads the action server, then some actions will be marked as "failed" by the client. Failed actions remain connected to the action server, but never receive a SrvJobFINISHED so are zombie actions. Thus they leak sockets. Even with this PR 2740, my testing was able to cause ~100 sockets to leak. (I am working on a client-side fix for that edge case. However, you are more familiar with this code than I am. A-OK for us both to work on it and then compare our solutions.)

Other -- Looks like the Client_do_message() of Client.h needs a Client_remove(c, factive) call in the case SrvJobFINISHED section.

Next Steps -- I appreciate the assistance and collaboration. Here is my suggestion on how we should proceed:

Addendum Many of the following posts by @mwinkel-dev are conjectures and thus wrong. For the summary of the investigation, refer to the post in Issue #2731 referenced by this link. https://github.com/MDSplus/mdsplus/issues/2731#issuecomment-2067188950

zack-vii commented 7 months ago

Regarding the Con. I think the expected behavior seems to be that the server hold one connection to each action server. That is even after the phase, the shot is over and the cycle begins anew the connection may be reused. The issue on the client side was that it did not properly detect if the socket was disconnected. I far as I oversaw the current implementation there is only one detached thread that handles the listening socket for incoming action_server replies as well as established connection.

My simple but effective tdi script for testing is;

_root=getenv('MDSPLUS_DIR')
setenv(CONCAT('test_path=',_root,'/test_path'))
_dispatch=BUILD_DISPATCH(2, 'DUMMY', 'INIT', 50, "")
_task=BUILD_PROGRAM(1, "echo test")
_action=BUILD_ACTION(`_dispatch, `_task)
treeopennew('test', 1)
treeaddnode("act", _n, 'ACTION')
treeputrecord("act", `_action)
FOR (_i=1;_i<=3000;_i++) EXECUTE("treeaddnode($1, _n, 'ACTION');treeputrecord(`$1, `$2);", EXECUTE("DECOMPILE(`$)", _i), _action)
treewrite()
treeclose()
setenv('DUMMY=localhost:30000')
tcl('set tree test/shot=1')
tcl('dispatch act')
tcl('dispatch/build')
SPAWN(CONCAT('mdsip -s -p 30000 -h ', _root, '/testing/mdsip.hosts&'))
WRITE(1, _out)
WAIT(1)
tcl('dispatch/phase INIT')
WAIT(1)
SPAWN('killall mdsip')
SPAWN(CONCAT('mdsip -s -p 30000 -h ', _root, '/testing/mdsip.hosts&'))
WAIT(1)
tcl('dispatch/phase INIT')
WAIT(1)
tcl('dispatch/close')
SPAWN('killall mdsip')

and can be invoked from the development environment

# after checkout; usual setup
./bootstrap
./configure --debug  # . . .
# enter development environment with all env vars set
make tests-env
# ch dir to root of repo
cd $MDSPLUS_DIR
# create folder for test tree
mkdir test_path
# update bins
make
# run test
gdb --args tditest dispatch-test.tdi
# update source ; goto update bins

During the WAIT(1) are good spots where you can inspect (gdb: Ctrl+C) p *Clients, p *Clients->next and will hopefully find: Clients->next becomes NULL soon after the mdsip service was terminated. You can comment # or move around the spawn calls or even fire up an external server (make sure the env vars are set) to observe its behavior when the dispatch server restarts or stays alive in-between dispatches.

Interesting to see would be how this hold if you add an active monitor server or more involved action_servers.

zack-vii commented 7 months ago

I may have found the issue with the python dcl_dispatcher_test.py. When checking for an existing connection, we need to check if the conid is still valid.

mwinkel-dev commented 7 months ago

Hi @zack-vii,

Thanks for the additional detail.

My test harness is comparable, and I do see one retained connection that you describe. That works well and is not a concern.

The edge cases I am investigating will likely not arise often during practical use of mdstcl. However, it is a big deal if leaked sockets force GA to reboot the entire physical server. Thus, I have been stress testing the dispatch feature of mdstcl to characterize its failure modes. (My goal is to confirm that if the edge cases do arise, that it won't be necessary to reboot the server every day.)

The current edge case I have been investigating consists of slow actions (e.g., wait(2.0);1;) with mdstcl dispatching at 10 actions per second or faster. That is how I ended up with ~100 leaked sockets.

Thanks for mentioning the action monitor and different dispatch phases. I will also do experiments with those features.

mwinkel-dev commented 7 months ago

Hi @zack-vii,

I've studied every line of this PR -- it is a nice fix! And the code refactoring also adds clarity.

After my lunch break, I will submit a review and approval for this PR.

mwinkel-dev commented 7 months ago

Hi @zack-vii,

Regarding ServerQAction.c file, I like the changes you made.

In the ServerDispatchPhase.c file, moving the actions[i].dispatched = 1 line eliminates a race condition.

Note though that on the client-side, I encountered a different race condition. After the client calls SendArg(), it then immediately calls GetAnswerInfoTS() to read the status handshake (i.e., the value 1 for success). However, the action was completed by the server before the client had finished processing the GetAnswerInfoTS(). Normally that would not occur. But the stress test of hundreds of actions dispatched at once caused that scenario to arise. With my client-only fix (and actions that evaluate very quickly), occasionally one of the receiver threads was tearing down the connection / socket at the same time that the main thread's GetAnswerInfoTS() was using the connection, which usually resulted in a crash.

We won't be using my client-side fix, thus the specific problem I created with that fix vanishes.

However, I do wonder what will happen if the server-side fix kills both sockets (i.e., cleans up the client) while the client's main thread is in the midst of the GetAnswerInfoTS(). I believe that the client will detect that as an error and will correctly handle that error. However, I will do additional testing to confirm that is true.

mwinkel-dev commented 7 months ago

I will approve this PR after Jenkins is able to build it on all platforms. It presently fails on on RHEL7 and Windows.

RHEL7

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c: In function 'find_client':

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c:897:3: error: 'for' loop initial declarations are only allowed in C99 mode

   for (ClientList **p = &Clients; *p != NULL; p = &(*p)->next)

   ^

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c:897:3: note: use option -std=c99 or -std=gnu99 to compile your code

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c: In function 'remove_client':

/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c:924:3: error: 'for' loop initial declarations are only allowed in C99 mode

   for (ClientList **p = &Clients; *p != NULL; p = &(*p)->next)

   ^

Windows

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c: In function 'setup_client':

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:14: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |              ^~~~~~~~~~~~~~~~~~~~~~             ~~~~

      |                                                 |

      |                                                 SOCKET {aka long long unsigned int}

/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'

   79 |     pos += sprintf(pos, __VA_ARGS__);              \

      |                         ^~~~~~~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:7: note: in expansion of macro 'MDSMSG'

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |       ^~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:33: note: format string is defined here

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |                                ~^

      |                                 |

      |                                 int

      |                                %I64d

In file included from /opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:46:

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:14: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |              ^~~~~~~~~~~~~~~~~~~~~~             ~~~~

      |                                                 |

      |                                                 SOCKET {aka long long unsigned int}

/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'

   79 |     pos += sprintf(pos, __VA_ARGS__);              \

      |                         ^~~~~~~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:7: note: in expansion of macro 'MDSMSG'

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |       ^~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:33: note: format string is defined here

  949 |       MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |                                ~^

      |                                 |

      |                                 int

      |                                %I64d

In file included from /opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:46:

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c: In function 'cleanup_client':

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:12: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |            ^~~~~~~~~~~~~~~~~~~~~~~~             ~~~~

      |                                                 |

      |                                                 SOCKET {aka long long unsigned int}

/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'

   79 |     pos += sprintf(pos, __VA_ARGS__);              \

      |                         ^~~~~~~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:5: note: in expansion of macro 'MDSMSG'

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |     ^~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:33: note: format string is defined here

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |                                ~^

      |                                 |

      |                                 int

      |                                %I64d

In file included from /opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:46:

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:12: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |            ^~~~~~~~~~~~~~~~~~~~~~~~             ~~~~

      |                                                 |

      |                                                 SOCKET {aka long long unsigned int}

/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'

   79 |     pos += sprintf(pos, __VA_ARGS__);              \

      |                         ^~~~~~~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:5: note: in expansion of macro 'MDSMSG'

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |     ^~~~~~

/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:33: note: format string is defined here

  962 |     MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));

      |                                ~^

      |                                 |

      |                                 int

      |                                %I64d
zack-vii commented 7 months ago

However, I do wonder what will happen if the server-side fix kills both sockets (i.e., cleans up the client) while the client's main thread is in the midst of the GetAnswerInfoTS(). I believe that the client will detect that as an error and will correctly handle that error. However, I will do additional testing to confirm that is true.

the server should never cancel out of a regular mdsip request unless it times out between the messages of the same request or is terminated or interrupted of some sort out of the ordinary. hence the result of a request should be independant of the state of the task (scheduled, executing, done). moreover if a job is scheduled it should return success. a race condition may arise only if the reply is lost. due to tcp we would probably run into a timeout assuming the dispatch was incomplete but actually was not. that should be very raw and requires an underperforming network considering the trafic.

zack-vii commented 7 months ago

@mwinkel-dev thanks for pointing out the issues with the jenkins checks.. looks like they are knowen issues that i simply payed not attention to. i will see if i can sort them out over the weekend... the power of docker ;)

mwinkel-dev commented 7 months ago

Hi @zack-vii,

Thanks for answering my questions. And also for coming up with a much better fix than I had on the client-side.

Monday is a holiday for us. So we'll resume work on this next week.

zack-vii commented 7 months ago

fingers crossed, it went through on my machine but if it fails please hint me the failing platforms.

smithsp commented 7 months ago

@zack-vii @mwinkel-dev It looks like it passed. :-)

mwinkel-dev commented 7 months ago

Hi @zack-vii,

I have just approved this PR. It successfully passed the Jenkins build. And also passed much (but not all) of my testing. It can now be merged to the "alpha" branch.

Hi @smithsp,

This PR is a partial fix of Issue #2731. In my opinion, it is robust enough to handle GA's normal workflows. However, there are some "edge cases" that sill leak sockets (albeit at a much slower rate than when GA had to reboot a server on 25-Mar-2024). You should decide if you want this partial fix now, or instead wish to wait until the full fix is available.

If you want this partial fix now, it will take us a day or so to cherry-pick it into the GA branch, build the packages, do another round of testing, and distribute the build to GA.

zack-vii commented 7 months ago

@mwinkel-dev : can you give me some details about the edge cases that are still leaking.

mwinkel-dev commented 7 months ago

Hi @zack-vii,

I am testing the "action server" as I would any web server -- normal load (which is A-OK), spike load (fails), heavy continuous load (still to do) and so forth.

For the spike load test, I have mdstcl dispatch hundreds of actions faster than the action server can execute them. In that scenario, some of the dispatched actions are reported as failing because they've lost connection to the action server. Each of those failed actions leaks a socket. I have generated ~100 leaked sockets with that test. https://github.com/MDSplus/mdsplus/pull/2740#issuecomment-2048811164 https://github.com/MDSplus/mdsplus/pull/2740#issuecomment-2049854703

I have also noticed in the code that the action server has a limited port range of 100 to 200 ports. It is possible (probable?) that the spike load test consumes all of those ports.

Although the spike load condition is unlikely to arise during normal workflows at GA, if it ever does then it might force GA to reboot the physical server. Which disrupts the work of many users and thus is a serious problem.

My hunch is that the fix for the spike load test will involve a client-side (mdstcl) fix. And instead of doing load balancing and dynamically adjusting the number of available receiver threads, a simpler approach would simply be to add a throttle to mdstcl so it can never dispatch more actions than available receiver threads. (I already wrote such a throttle as part of my client-side only approach for solving Issue #2731.)

Now that your client-side fix (i.e., this PR) has been merged to alpha, I will continue my investigation / testing. When I find the root cause of the problem, I will report my findings via GitHub (i.e., in a new issue).

smithsp commented 7 months ago

@mwinkel-dev We would like to take you up on your offer to get a cherry-pick version of this partial fix to our GA version with RPM kits. Thank you for your efforts.

mwinkel-dev commented 7 months ago

Hi @smithsp,

OK, will do. Here are the steps.

Note: -- Unless GA objects, we will also include PR #2735 in the cherry-pick to the GA branch. That is a simple change that can be useful when troubleshooting multi-threaded code. It would be useful to have that feature in the GA branch too.

mwinkel-dev commented 7 months ago

Hi @smithsp and @sflanagan,

Before this PR was merged to the alpha branch, it passed all automated tests that Jenkins runs (on all platforms).

After it was merged to alpha, it also passed the following manual tests (performed on Ubuntu 20.04).

Caveats

Summary This PR worked well after being merged to alpha. It likely covers all normal workflows. If it fails, it will likely be when dealing with edge cases.

zack-vii commented 7 months ago

I have also noticed in the code that the action server has a limited port range of 100 to 200 ports. It is possible (probable?) that the spike load test consumes all of those ports.

the 100 ports are given by the default of the MDSIP_REPLY_PORT_RANGE (? or similar) env var which i thing ranges 8800-8899. it was used as one port per actionserver. I think this is not the case anymore as there is on listening port that handles all replies.

I will try to replicate your setup. Do you dispatch the actions in a single thread or multi-thread?

mwinkel-dev commented 7 months ago

Hi @zack-vii,

This post has three topics: PR #2740 behavior (versus prior), client-side cleanup, and my test harness.

1) With PR 2740 versus Without Regarding last evening's testing of PR #2740 (in the alpha branch), I just realized that all 1,200 dispatched actions shared a single connection ID. My recollection (perhaps wrong) was that prior to the PR, that the "action service" created a separate connection ID for each action. Does sharing the same connection ID across all actions pose any problems? (The log file of the test indicates that the 1,200 actions probably did all execute correctly.). I will repeat this test after breakfast today to make sure my recollection is correct.

2) Client-Side Cleanup The mdstcl client also maintains data structures. But they probably aren't being cleaned-up. As per my earlier post -- "Looks like the Client_do_message() of Client.h needs a Client_remove(c, factive) call in the case SrvJobFINISHED section. https://github.com/MDSplus/mdsplus/pull/2740#issuecomment-2048811164

3) My Test Harness I use mdstcl to dispatch actions via the dispatch phase command. My test cases typically dispatch in multiples of 400 actions. Although mdstcl is a multi-threaded program, it only uses a single thread to dispatch actions.

However, my test harness is also based on the client-side only fix that I created for Issue #2731. I have made numerous changes to my test harness to eliminate features that are now handled by your PR #2740. It is thus possible (probable?) that I am observing a bug in my test harness and not in the dispatch phase feature of mdstcl. After the build / release of the GA branch has been completed, I will investigate and let you know whether it is a bug in my code or not. And when my test harness is stable, I will also make it available to you.

Summary For GA, the most important questions to answer are 1) and 2) above. Item 3) can be ignored for now.

Addendum Regarding item 1), the errors log file for the "action service" only shows a single connection was made. I was instead expecting 400 connections made (even if the connection ID was reused).

For 2), I experimented with an approach that allowed the "receiver" threads to access the "thread-static information" in the "main" thread. My experimental code worked OK when it was configured to just read the linked list of connections. However, if it was configured to delete connections, then it caused deadlock when under heavy load. My hope is that with this PR #2740, we can ignore the client-side data structures -- however more testing should be done to confirm that doing so is OK.

Regarding 3), my test harness presently consists of a single instance of mdstcl. It is thus not a multi-threaded test (i.e., not running multiple instances of mdstcl by using multiple terminal windows).

mwinkel-dev commented 7 months ago

Hi @zack-vii -- Likely root cause of the edge case that leaks sockets is simply the difference between re-using an existing connection and creating new connections.

I will do a few more experiments this evening to confirm if this conjecture is indeed true.

mwinkel-dev commented 7 months ago

Hi @zack-vii,

Major mystery solved. And yes, my test harness was too extreme.

The crux of the matter is that the architectures of the mdsip and action services are not well documented by the comments in the source code. Thus as a maintenance programmer, one makes guesses based on patterns seen in small regions of the code without understanding the broader context. I guessed wrong. However, as those guesses fail, one learns the architecture of the services by trial and error.

Now that I have a better understanding of the services, I can answer my own questions.

1) With PR 2740 and Without Yes, the architecture is designed to re-use a single connection to dispatch all actions. Thus stable_7.96.9 (the version GA was running prior to November 2023), alpha of 14-Apr-2024 (prior to PR #2740), and alpha of 16-Apr-2024 (with PR #2740) are all working as designed.

2) Client-side Cleanup Definitely should not have a Client_remove() call in the SrvJobFINISHED case. The architecture is designed to re-use the connection, and thus the connection should be retained no matter how many actions have completed. The cases that do use Client_remove() are those that are associated with a defective connection. Although connections should not be removed when actions finish, perhaps there are other client-side cleanup tasks that should be done.

3) My Test Harness My test harness was closing sockets and creating a new connection after each action executed (i.e., whenever an action's "receiver" thread received the SrvJobFINISHED reply). And thus was creating hundreds of connections at the same time that hundreds of actions were being executed. That is an abnormal work flow for mdstcl, and thus a very extreme stress test. Yes, mdstcl and the "action service" should be able to handle that stress test without leaking sockets. But it is an improbable workflow, and thus the release of PR #2740 is low risk.

Addendum Apparently the "action service" only has one "receiver" thread. Which is a standard design for a web server. (But when I glanced at the high-level routines, I assumed that the service was starting a "receiver" thread for each action.)

mwinkel-dev commented 7 months ago

Prior to this PR, the sockets were being leaked in the server_connect() routine of ServerSendMessage.c. For each action, that function was always calling ConnectToMds() which in turn always creates a new connection.

This PR fixes that by changing server_connect() to reuse connections. It only calls ConnectToMds() in the occasional instance when it is unable to reuse a connection.

The other changes made by this PR are also useful (refactoring for clarity, using mutex to protect data structures, and so forth).