Closed zack-vii closed 7 months ago
bummer, i cannot access jenkins so its unclear to me what went wrong. I will test locally tomorrow.
Hi @zack-vii,
Thanks for submitting this proposed fix for GA's leaking sockets, Issue #2731.
In addition to the files you've changed, I've also found it useful to change some additional files. My conjecture, perhaps wrong, is that the final solution will be a melding of our two proposed fixes.
I am now building your PR locally on my dev system and running it through my suite of tests. Will post results here within an hour or two.
Hi @zack-vii,
Testing of this PR demonstrated that this PR fixes the primary case, but fails on an edge case.
Pro -- This PR is a sever-side fix for the primary cause of leaked sockets (i.e., the situation GA encountered). It is thus a more elegant solution than the client-only fix that I created. I agree that this PR should be used. On Thursday or Friday, I will complete a review of this PR.
Con -- If mdstcl
dispatches actions so quickly that it overloads the action server, then some actions will be marked as "failed" by the client. Failed actions remain connected to the action server, but never receive a SrvJobFINISHED
so are zombie actions. Thus they leak sockets. Even with this PR 2740, my testing was able to cause ~100 sockets to leak. (I am working on a client-side fix for that edge case. However, you are more familiar with this code than I am. A-OK for us both to work on it and then compare our solutions.)
Other -- Looks like the Client_do_message()
of Client.h
needs a Client_remove(c, factive)
call in the case SrvJobFINISHED
section.
Next Steps -- I appreciate the assistance and collaboration. Here is my suggestion on how we should proceed:
Addendum Many of the following posts by @mwinkel-dev are conjectures and thus wrong. For the summary of the investigation, refer to the post in Issue #2731 referenced by this link. https://github.com/MDSplus/mdsplus/issues/2731#issuecomment-2067188950
Regarding the Con. I think the expected behavior seems to be that the server hold one connection to each action server. That is even after the phase, the shot is over and the cycle begins anew the connection may be reused. The issue on the client side was that it did not properly detect if the socket was disconnected. I far as I oversaw the current implementation there is only one detached thread that handles the listening socket for incoming action_server replies as well as established connection.
My simple but effective tdi script for testing is;
_root=getenv('MDSPLUS_DIR')
setenv(CONCAT('test_path=',_root,'/test_path'))
_dispatch=BUILD_DISPATCH(2, 'DUMMY', 'INIT', 50, "")
_task=BUILD_PROGRAM(1, "echo test")
_action=BUILD_ACTION(`_dispatch, `_task)
treeopennew('test', 1)
treeaddnode("act", _n, 'ACTION')
treeputrecord("act", `_action)
FOR (_i=1;_i<=3000;_i++) EXECUTE("treeaddnode($1, _n, 'ACTION');treeputrecord(`$1, `$2);", EXECUTE("DECOMPILE(`$)", _i), _action)
treewrite()
treeclose()
setenv('DUMMY=localhost:30000')
tcl('set tree test/shot=1')
tcl('dispatch act')
tcl('dispatch/build')
SPAWN(CONCAT('mdsip -s -p 30000 -h ', _root, '/testing/mdsip.hosts&'))
WRITE(1, _out)
WAIT(1)
tcl('dispatch/phase INIT')
WAIT(1)
SPAWN('killall mdsip')
SPAWN(CONCAT('mdsip -s -p 30000 -h ', _root, '/testing/mdsip.hosts&'))
WAIT(1)
tcl('dispatch/phase INIT')
WAIT(1)
tcl('dispatch/close')
SPAWN('killall mdsip')
and can be invoked from the development environment
# after checkout; usual setup
./bootstrap
./configure --debug # . . .
# enter development environment with all env vars set
make tests-env
# ch dir to root of repo
cd $MDSPLUS_DIR
# create folder for test tree
mkdir test_path
# update bins
make
# run test
gdb --args tditest dispatch-test.tdi
# update source ; goto update bins
During the WAIT(1)
are good spots where you can inspect (gdb: Ctrl+C) p *Clients
, p *Clients->next
and will hopefully find: Clients->next becomes NULL soon after the mdsip service was terminated. You can comment #
or move around the spawn calls or even fire up an external server (make sure the env vars are set) to observe its behavior when the dispatch server restarts or stays alive in-between dispatches.
Interesting to see would be how this hold if you add an active monitor server or more involved action_servers.
I may have found the issue with the python dcl_dispatcher_test.py
. When checking for an existing connection, we need to check if the conid is still valid.
Hi @zack-vii,
Thanks for the additional detail.
My test harness is comparable, and I do see one retained connection that you describe. That works well and is not a concern.
The edge cases I am investigating will likely not arise often during practical use of mdstcl
. However, it is a big deal if leaked sockets force GA to reboot the entire physical server. Thus, I have been stress testing the dispatch
feature of mdstcl
to characterize its failure modes. (My goal is to confirm that if the edge cases do arise, that it won't be necessary to reboot the server every day.)
The current edge case I have been investigating consists of slow actions (e.g., wait(2.0);1;
) with mdstcl
dispatching at 10 actions per second or faster. That is how I ended up with ~100 leaked sockets.
Thanks for mentioning the action monitor
and different dispatch phases. I will also do experiments with those features.
Hi @zack-vii,
I've studied every line of this PR -- it is a nice fix! And the code refactoring also adds clarity.
After my lunch break, I will submit a review and approval for this PR.
Hi @zack-vii,
Regarding ServerQAction.c
file, I like the changes you made.
send_reply()
function now always calls cleanup_client()
thereby preventing the leaking sockets. (The old code only cleaned up when an error occurred.)AttachPort()
into two functions makes the code easier to understand.RemoveClient()
into two functions also adds clarity.Clients
linked list in these functions: add_client()
, find_client()
and remove_client()
.In the ServerDispatchPhase.c
file, moving the actions[i].dispatched = 1
line eliminates a race condition.
Note though that on the client-side, I encountered a different race condition. After the client calls SendArg()
, it then immediately calls GetAnswerInfoTS()
to read the status handshake (i.e., the value 1
for success). However, the action was completed by the server before the client had finished processing the GetAnswerInfoTS()
. Normally that would not occur. But the stress test of hundreds of actions dispatched at once caused that scenario to arise. With my client-only fix (and actions that evaluate very quickly), occasionally one of the receiver threads was tearing down the connection / socket at the same time that the main thread's GetAnswerInfoTS()
was using the connection, which usually resulted in a crash.
We won't be using my client-side fix, thus the specific problem I created with that fix vanishes.
However, I do wonder what will happen if the server-side fix kills both sockets (i.e., cleans up the client) while the client's main thread is in the midst of the GetAnswerInfoTS()
. I believe that the client will detect that as an error and will correctly handle that error. However, I will do additional testing to confirm that is true.
I will approve this PR after Jenkins is able to build it on all platforms. It presently fails on on RHEL7 and Windows.
RHEL7
/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c: In function 'find_client':
/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c:897:3: error: 'for' loop initial declarations are only allowed in C99 mode
for (ClientList **p = &Clients; *p != NULL; p = &(*p)->next)
^
/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c:897:3: note: use option -std=c99 or -std=gnu99 to compile your code
/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c: In function 'remove_client':
/opt/jenkins/workspace/MDSplus_PR-2740/rhel7/servershr/ServerQAction.c:924:3: error: 'for' loop initial declarations are only allowed in C99 mode
for (ClientList **p = &Clients; *p != NULL; p = &(*p)->next)
^
Windows
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c: In function 'setup_client':
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:14: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]
949 | MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ^~~~~~~~~~~~~~~~~~~~~~ ~~~~
| |
| SOCKET {aka long long unsigned int}
/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'
79 | pos += sprintf(pos, __VA_ARGS__); \
| ^~~~~~~~~~~
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:7: note: in expansion of macro 'MDSMSG'
949 | MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ^~~~~~
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:33: note: format string is defined here
949 | MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ~^
| |
| int
| %I64d
In file included from /opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:46:
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:14: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]
949 | MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ^~~~~~~~~~~~~~~~~~~~~~ ~~~~
| |
| SOCKET {aka long long unsigned int}
/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'
79 | pos += sprintf(pos, __VA_ARGS__); \
| ^~~~~~~~~~~
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:7: note: in expansion of macro 'MDSMSG'
949 | MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ^~~~~~
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:949:33: note: format string is defined here
949 | MDSMSG("setup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ~^
| |
| int
| %I64d
In file included from /opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:46:
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c: In function 'cleanup_client':
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:12: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]
962 | MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ^~~~~~~~~~~~~~~~~~~~~~~~ ~~~~
| |
| SOCKET {aka long long unsigned int}
/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'
79 | pos += sprintf(pos, __VA_ARGS__); \
| ^~~~~~~~~~~
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:5: note: in expansion of macro 'MDSMSG'
962 | MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ^~~~~~
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:33: note: format string is defined here
962 | MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ~^
| |
| int
| %I64d
In file included from /opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:46:
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:12: error: format '%d' expects argument of type 'int', but argument 3 has type 'SOCKET' {aka 'long long unsigned int'} [-Werror=format=]
962 | MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ^~~~~~~~~~~~~~~~~~~~~~~~ ~~~~
| |
| SOCKET {aka long long unsigned int}
/opt/jenkins/workspace/MDSplus_PR-2740/windows/_include/mdsmsg.h:79:25: note: in definition of macro '__MDSMSG'
79 | pos += sprintf(pos, __VA_ARGS__); \
| ^~~~~~~~~~~
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:5: note: in expansion of macro 'MDSMSG'
962 | MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ^~~~~~
/opt/jenkins/workspace/MDSplus_PR-2740/windows/servershr/ServerQAction.c:962:33: note: format string is defined here
962 | MDSMSG("cleanup connection %d " SVRJOB_PRI, sock, SVRJOB_VAR(job));
| ~^
| |
| int
| %I64d
However, I do wonder what will happen if the server-side fix kills both sockets (i.e., cleans up the client) while the client's main thread is in the midst of the
GetAnswerInfoTS()
. I believe that the client will detect that as an error and will correctly handle that error. However, I will do additional testing to confirm that is true.
the server should never cancel out of a regular mdsip request unless it times out between the messages of the same request or is terminated or interrupted of some sort out of the ordinary. hence the result of a request should be independant of the state of the task (scheduled, executing, done). moreover if a job is scheduled it should return success. a race condition may arise only if the reply is lost. due to tcp we would probably run into a timeout assuming the dispatch was incomplete but actually was not. that should be very raw and requires an underperforming network considering the trafic.
@mwinkel-dev thanks for pointing out the issues with the jenkins checks.. looks like they are knowen issues that i simply payed not attention to. i will see if i can sort them out over the weekend... the power of docker ;)
Hi @zack-vii,
Thanks for answering my questions. And also for coming up with a much better fix than I had on the client-side.
Monday is a holiday for us. So we'll resume work on this next week.
fingers crossed, it went through on my machine but if it fails please hint me the failing platforms.
@zack-vii @mwinkel-dev It looks like it passed. :-)
Hi @zack-vii,
I have just approved this PR. It successfully passed the Jenkins build. And also passed much (but not all) of my testing. It can now be merged to the "alpha" branch.
Hi @smithsp,
This PR is a partial fix of Issue #2731. In my opinion, it is robust enough to handle GA's normal workflows. However, there are some "edge cases" that sill leak sockets (albeit at a much slower rate than when GA had to reboot a server on 25-Mar-2024). You should decide if you want this partial fix now, or instead wish to wait until the full fix is available.
If you want this partial fix now, it will take us a day or so to cherry-pick it into the GA branch, build the packages, do another round of testing, and distribute the build to GA.
@mwinkel-dev : can you give me some details about the edge cases that are still leaking.
Hi @zack-vii,
I am testing the "action server" as I would any web server -- normal load (which is A-OK), spike load (fails), heavy continuous load (still to do) and so forth.
For the spike load test, I have mdstcl
dispatch hundreds of actions faster than the action server can execute them. In that scenario, some of the dispatched actions are reported as failing because they've lost connection to the action server. Each of those failed actions leaks a socket. I have generated ~100 leaked sockets with that test.
https://github.com/MDSplus/mdsplus/pull/2740#issuecomment-2048811164
https://github.com/MDSplus/mdsplus/pull/2740#issuecomment-2049854703
I have also noticed in the code that the action server has a limited port range of 100 to 200 ports. It is possible (probable?) that the spike load test consumes all of those ports.
Although the spike load condition is unlikely to arise during normal workflows at GA, if it ever does then it might force GA to reboot the physical server. Which disrupts the work of many users and thus is a serious problem.
My hunch is that the fix for the spike load test will involve a client-side (mdstcl
) fix. And instead of doing load balancing and dynamically adjusting the number of available receiver threads, a simpler approach would simply be to add a throttle to mdstcl
so it can never dispatch more actions than available receiver threads. (I already wrote such a throttle as part of my client-side only approach for solving Issue #2731.)
Now that your client-side fix (i.e., this PR) has been merged to alpha, I will continue my investigation / testing. When I find the root cause of the problem, I will report my findings via GitHub (i.e., in a new issue).
@mwinkel-dev We would like to take you up on your offer to get a cherry-pick version of this partial fix to our GA version with RPM kits. Thank you for your efforts.
Hi @smithsp,
OK, will do. Here are the steps.
Note: -- Unless GA objects, we will also include PR #2735 in the cherry-pick to the GA branch. That is a simple change that can be useful when troubleshooting multi-threaded code. It would be useful to have that feature in the GA branch too.
Hi @smithsp and @sflanagan,
Before this PR was merged to the alpha branch, it passed all automated tests that Jenkins runs (on all platforms).
After it was merged to alpha, it also passed the following manual tests (performed on Ubuntu 20.04).
Caveats
mdstcl
dispatch and the "action service". Stated another way, it has not been tested with a pair of computers: one running mdstcl
, the other running the "action service".dispatch /build
and dispatch /phase
statements of mdstcl
. Other dispatch
related features have not been tested (e.g., abort server
, dispatch /check
, dispatch /close
, and so forth).Summary This PR worked well after being merged to alpha. It likely covers all normal workflows. If it fails, it will likely be when dealing with edge cases.
I have also noticed in the code that the action server has a limited port range of 100 to 200 ports. It is possible (probable?) that the spike load test consumes all of those ports.
the 100 ports are given by the default of the MDSIP_REPLY_PORT_RANGE (? or similar) env var which i thing ranges 8800-8899. it was used as one port per actionserver. I think this is not the case anymore as there is on listening port that handles all replies.
I will try to replicate your setup. Do you dispatch the actions in a single thread or multi-thread?
Hi @zack-vii,
This post has three topics: PR #2740 behavior (versus prior), client-side cleanup, and my test harness.
1) With PR 2740 versus Without Regarding last evening's testing of PR #2740 (in the alpha branch), I just realized that all 1,200 dispatched actions shared a single connection ID. My recollection (perhaps wrong) was that prior to the PR, that the "action service" created a separate connection ID for each action. Does sharing the same connection ID across all actions pose any problems? (The log file of the test indicates that the 1,200 actions probably did all execute correctly.). I will repeat this test after breakfast today to make sure my recollection is correct.
2) Client-Side Cleanup
The mdstcl
client also maintains data structures. But they probably aren't being cleaned-up. As per my earlier post -- "Looks like the Client_do_message()
of Client.h
needs a Client_remove(c, factive)
call in the case SrvJobFINISHED
section.
https://github.com/MDSplus/mdsplus/pull/2740#issuecomment-2048811164
3) My Test Harness
I use mdstcl
to dispatch actions via the dispatch phase
command. My test cases typically dispatch in multiples of 400 actions. Although mdstcl
is a multi-threaded program, it only uses a single thread to dispatch actions.
However, my test harness is also based on the client-side only fix that I created for Issue #2731. I have made numerous changes to my test harness to eliminate features that are now handled by your PR #2740. It is thus possible (probable?) that I am observing a bug in my test harness and not in the dispatch phase
feature of mdstcl
. After the build / release of the GA branch has been completed, I will investigate and let you know whether it is a bug in my code or not. And when my test harness is stable, I will also make it available to you.
Summary For GA, the most important questions to answer are 1) and 2) above. Item 3) can be ignored for now.
Addendum
Regarding item 1), the errors
log file for the "action service" only shows a single connection was made. I was instead expecting 400 connections made (even if the connection ID was reused).
For 2), I experimented with an approach that allowed the "receiver" threads to access the "thread-static information" in the "main" thread. My experimental code worked OK when it was configured to just read the linked list of connections. However, if it was configured to delete connections, then it caused deadlock when under heavy load. My hope is that with this PR #2740, we can ignore the client-side data structures -- however more testing should be done to confirm that doing so is OK.
Regarding 3), my test harness presently consists of a single instance of mdstcl
. It is thus not a multi-threaded test (i.e., not running multiple instances of mdstcl
by using multiple terminal windows).
Hi @zack-vii -- Likely root cause of the edge case that leaks sockets is simply the difference between re-using an existing connection and creating new connections.
mdstcl
and the "action service" -- and that one connection is used for dispatching all actions to the "action service". This is based on observing stable_7.96.9
, alpha
of 14-Apr-2024 (immediately prior to this PR 2740), and alpha
of 16-Apr-2024 (includes PR 2740).I will do a few more experiments this evening to confirm if this conjecture is indeed true.
Hi @zack-vii,
Major mystery solved. And yes, my test harness was too extreme.
The crux of the matter is that the architectures of the mdsip
and action
services are not well documented by the comments in the source code. Thus as a maintenance programmer, one makes guesses based on patterns seen in small regions of the code without understanding the broader context. I guessed wrong. However, as those guesses fail, one learns the architecture of the services by trial and error.
Now that I have a better understanding of the services, I can answer my own questions.
1) With PR 2740 and Without
Yes, the architecture is designed to re-use a single connection to dispatch all actions. Thus stable_7.96.9
(the version GA was running prior to November 2023), alpha
of 14-Apr-2024 (prior to PR #2740), and alpha
of 16-Apr-2024 (with PR #2740) are all working as designed.
2) Client-side Cleanup
Definitely should not have a Client_remove()
call in the SrvJobFINISHED
case. The architecture is designed to re-use the connection, and thus the connection should be retained no matter how many actions have completed. The cases that do use Client_remove()
are those that are associated with a defective connection. Although connections should not be removed when actions finish, perhaps there are other client-side cleanup tasks that should be done.
3) My Test Harness
My test harness was closing sockets and creating a new connection after each action executed (i.e., whenever an action's "receiver" thread received the SrvJobFINISHED
reply). And thus was creating hundreds of connections at the same time that hundreds of actions were being executed. That is an abnormal work flow for mdstcl
, and thus a very extreme stress test. Yes, mdstcl
and the "action service" should be able to handle that stress test without leaking sockets. But it is an improbable workflow, and thus the release of PR #2740 is low risk.
Addendum Apparently the "action service" only has one "receiver" thread. Which is a standard design for a web server. (But when I glanced at the high-level routines, I assumed that the service was starting a "receiver" thread for each action.)
Prior to this PR, the sockets were being leaked in the server_connect()
routine of ServerSendMessage.c
. For each action, that function was always calling ConnectToMds()
which in turn always creates a new connection.
This PR fixes that by changing server_connect()
to reuse connections. It only calls ConnectToMds()
in the occasional instance when it is unable to reuse a connection.
The other changes made by this PR are also useful (refactoring for clarity, using mutex to protect data structures, and so forth).
This is related to issue #2731 and may fix some if not all of the related problems.