Problem: MCPServer is unable to exit after signal is received causing upstart issues

lwo commented 6 years ago

Expected behaviour $ service archivematica-mcp-server restart should stop the current process and start a new instance of the MCP server

Current behaviour It creates an additional process. Thus causing tasks to fire simultaniously.

Steps to reproduce $ service archivematica-dashboard restart $ service archivematica-dashboard restart $ service archivematica-dashboard restart

Then look at the processes:

$ ps ax | grep MCPServer 29004 ? Sl 19:02 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py 42489 pts/1 S+ 0:00 grep --color=auto MCPServer 61543 ? Sl 9:01 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py 61926 ? Sl 9:07 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py 64903 ? Sl 9:57 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py

Your environment (version of Archivematica, OS version, etc) Ubuntu 14 trusty, 64-bit Using AM qa/1.x

For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

All PRs related to this issue are properly linked 👍
All PRs related to this issue have been merged 👍
Test plan for this issue has been implemented and passed 👍
Release documentation (e.g. release notes, wiki documentation, etc) regarding this issue has been written 👍

jrwdunham commented 6 years ago

@mamedin @lwo I noticed this too when I was debugging on Lucien's deploy. These duplicate MCPServer/Client processes may be related to or responsible for the issues described in https://github.com/archivematica/Issues/issues/141 and especially the "paths under sharedDirectory/ are sporadically disappearing" issue mentioned in https://projects.artefactual.com/issues/12452. @mamedin if you have time maybe you can look into this?

mamedin commented 6 years ago

I could reproduce the issue on Ubuntu trusty but not on xenial.

After restarting several times the dashboard and mcp-client, I found several mcp-server processes:

root@mamedin-test-iish-issue-182:/var/log/archivematica/MCPServer# ps aux | grep mcp-server
archive+  8715  0.8  0.7 3258644 54860 ?       Sl   16:56   0:55 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py
archive+ 12730  0.8  0.8 3258388 60736 ?       Sl   17:12   0:47 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py
archive+ 16416  0.8  0.8 3258380 56556 ?       Sl   18:05   0:21 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py
archive+ 19959  0.9  0.7 3259664 55240 ?       Sl   18:18   0:18 /usr/share/archivematica/virtualenvs/archivematica-mcp-server/bin/python /usr/lib/archivematica/MCPServer/archivematicaMCP.py

These 3 new processes match with this signal:

root@mamedin-test-iish-issue-182:/var/log/archivematica/MCPServer# grep "Recieved signal 15" MCPServer.debug.log
INFO      2018-09-20 17:12:53  archivematica.mcp.server:archivematicaMCP:signal_handler:216:  Recieved signal 15 in frame <frame object at 0x1457f30>
INFO      2018-09-20 18:05:01  archivematica.mcp.server:archivematicaMCP:signal_handler:216:  Recieved signal 15 in frame <frame object at 0x1b149b0>
INFO      2018-09-20 18:18:02  archivematica.mcp.server:archivematicaMCP:signal_handler:216:  Recieved signal 15 in frame <frame object at 0x21ac100>

On xenial, I run a script restarting the dashboard service every 10 seconds for 1 hour and no additional mcp-server processes appeared.

mamedin commented 6 years ago

It happens in trusty in both the qa/1.x branch of the artefactual and the IISH repositories.

scollazo commented 6 years ago

Just a reminder that qa/1.x is not supported on Ubuntu 14.04 (Trusty):

Support has already been removed in https://github.com/artefactual/archivematica-storage-service/pull/414 https://github.com/artefactual/archivematica/pull/1266

lwo commented 6 years ago

@scollazo : thanks. Indeed on systemd with16.04lts there is no such issue.

scollazo commented 6 years ago

Btw, I remember you asked about Ubuntu 18.04 while in Amsterdam. I have to say that support for it was added to 1.7.2 as experimental, but it should work without issues with qa/1.x

sevein commented 6 years ago

@mamedin noticed that the MCPServer process can't be killed with TERM like it used to be possible in AM17 or older. This seems to be the underlying issue that somehow caused the behaviours we were seeing with upstart.

MCPServer now operates with threads differently because we've started using thread pools. However these threads are also daemon threads so they don't really prevent the application from exiting. What I believe is causing the signal handler to fail is the existence of a new non-daemon thread in taskGroupRunner which runs the Gearman client. If we terminate this thread the problem seems to disappear. One simple way to achieve this is converting the thread into a daemon thread, probably not the best solution but it's the simplest. E.g.:

diff --git a/src/MCPServer/lib/taskGroupRunner.py b/src/MCPServer/lib/taskGroupRunner.py
index 6fd3f9bd..6089f383 100644
--- a/src/MCPServer/lib/taskGroupRunner.py
+++ b/src/MCPServer/lib/taskGroupRunner.py
@@ -145,6 +145,7 @@ class TaskGroupRunner():
                     time.sleep(5)

         self.poll_thread = threading.Thread(target=event_loop)
+        self.poll_thread.daemon = True
         self.poll_thread.start()

     def _finish_task_group_job(self, task_group_job):

sevein commented 6 years ago

CC @marktriggs - I'm planning to make the event_loop a daemon thread. We can look into better ways to clean up before exiting later. Do you think it'd be okay?

marktriggs commented 6 years ago

@sevein Ah! Yep, that makes sense to me.

sevein commented 6 years ago

Thank you!

sevein commented 6 years ago

For QA - what are we testing here? We want to make sure that the MCPServer process exits when the TERM signal is sent by the user, e.g. you can do that with docker-compose kill -s TERM archivematica-mcp-server.

The original use case described in this issue does not need to be covered because: 1) the problem is introduced in the AM18 dev branch, 2) it was only happening in Ubuntu 14.04 which is not going to be supported by AM18.

scollazo commented 5 years ago

I was able to kill the MCP server by using kill -TERM <pid>

archivematica / Issues

Problem: MCPServer is unable to exit after signal is received causing upstart issues #182