Closed lwo closed 5 years ago
@mamedin @lwo I noticed this too when I was debugging on Lucien's deploy. These duplicate MCPServer/Client processes may be related to or responsible for the issues described in https://github.com/archivematica/Issues/issues/141 and especially the "paths under sharedDirectory/ are sporadically disappearing" issue mentioned in https://projects.artefactual.com/issues/12452. @mamedin if you have time maybe you can look into this?
I could reproduce the issue on Ubuntu trusty but not on xenial.
After restarting several times the dashboard and mcp-client, I found several mcp-server processes:
root@mamedin-test-iish-issue-182:/var/log/archivematica/MCPServer# ps aux | grep mcp-server
archive+ 8715 0.8 0.7 3258644 54860 ? Sl 16:56 0:55 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py
archive+ 12730 0.8 0.8 3258388 60736 ? Sl 17:12 0:47 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py
archive+ 16416 0.8 0.8 3258380 56556 ? Sl 18:05 0:21 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py
archive+ 19959 0.9 0.7 3259664 55240 ? Sl 18:18 0:18 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py
These 3 new processes match with this signal:
root@mamedin-test-iish-issue-182:/var/log/archivematica/MCPServer# grep "Recieved signal 15" MCPServer.debug.log
INFO 2018-09-20 17:12:53 archivematica.mcp.server:archivematicaMCP:signal_handler:216: Recieved signal 15 in frame <frame object at 0x1457f30>
INFO 2018-09-20 18:05:01 archivematica.mcp.server:archivematicaMCP:signal_handler:216: Recieved signal 15 in frame <frame object at 0x1b149b0>
INFO 2018-09-20 18:18:02 archivematica.mcp.server:archivematicaMCP:signal_handler:216: Recieved signal 15 in frame <frame object at 0x21ac100>
On xenial, I run a script restarting the dashboard service every 10 seconds for 1 hour and no additional mcp-server processes appeared.
It happens in trusty in both the qa/1.x branch of the artefactual and the IISH repositories.
Just a reminder that qa/1.x is not supported on Ubuntu 14.04 (Trusty):
Support has already been removed in https://github.com/artefactual/archivematica-storage-service/pull/414 https://github.com/artefactual/archivematica/pull/1266
@scollazo : thanks. Indeed on systemd with16.04lts there is no such issue.
Btw, I remember you asked about Ubuntu 18.04 while in Amsterdam. I have to say that support for it was added to 1.7.2 as experimental, but it should work without issues with qa/1.x
@mamedin noticed that the MCPServer process can't be killed with TERM
like it used to be possible in AM17 or older. This seems to be the underlying issue that somehow caused the behaviours we were seeing with upstart.
MCPServer now operates with threads differently because we've started using thread pools. However these threads are also daemon threads so they don't really prevent the application from exiting. What I believe is causing the signal handler to fail is the existence of a new non-daemon thread in taskGroupRunner
which runs the Gearman client. If we terminate this thread the problem seems to disappear. One simple way to achieve this is converting the thread into a daemon thread, probably not the best solution but it's the simplest. E.g.:
diff --git a/src/MCPServer/lib/taskGroupRunner.py b/src/MCPServer/lib/taskGroupRunner.py
index 6fd3f9bd..6089f383 100644
--- a/src/MCPServer/lib/taskGroupRunner.py
+++ b/src/MCPServer/lib/taskGroupRunner.py
@@ -145,6 +145,7 @@ class TaskGroupRunner():
time.sleep(5)
self.poll_thread = threading.Thread(target=event_loop)
+ self.poll_thread.daemon = True
self.poll_thread.start()
def _finish_task_group_job(self, task_group_job):
CC @marktriggs - I'm planning to make the event_loop
a daemon thread. We can look into better ways to clean up before exiting later. Do you think it'd be okay?
@sevein Ah! Yep, that makes sense to me.
Thank you!
For QA - what are we testing here? We want to make sure that the MCPServer process exits when the TERM
signal is sent by the user, e.g. you can do that with docker-compose kill -s TERM archivematica-mcp-server
.
The original use case described in this issue does not need to be covered because: 1) the problem is introduced in the AM18 dev branch, 2) it was only happening in Ubuntu 14.04 which is not going to be supported by AM18.
I was able to kill the MCP server by using kill -TERM <pid>
Expected behaviour $ service archivematica-mcp-server restart should stop the current process and start a new instance of the MCP server
Current behaviour It creates an additional process. Thus causing tasks to fire simultaniously.
Steps to reproduce $ service archivematica-dashboard restart $ service archivematica-dashboard restart $ service archivematica-dashboard restart
Then look at the processes:
$ ps ax | grep MCPServer 29004 ? Sl 19:02 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py 42489 pts/1 S+ 0:00 grep --color=auto MCPServer 61543 ? Sl 9:01 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py 61926 ? Sl 9:07 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py 64903 ? Sl 9:57 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py
Your environment (version of Archivematica, OS version, etc) Ubuntu 14 trusty, 64-bit Using AM qa/1.x
For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle: