Closed amykglen closed 1 year ago
Thanks @amykglen for bringing this issue up. So, @isbluis can you provide some context for the issue yesterday with picking up the new meta-KG for automat-ctd? Which ARAX endpoint was this? How was it eventually resolved? Was the endpoint resetarted? Was the elog file saved before the Flask app was restarted?
I think I added more logging on what the BackgroundTasker was doing, so looking at the logs (when we determine which instance we're talking about) would be useful.
Based on minimal information, it appeared that the BackgroundTasker just stopped working with little information on what happened. The logs might show what its last communications were.
I'm thinking this problem appears to be new since we switched from a thread to a child process.
Hmm, this is pure conjecture, but my gut feeling is "unhandled signal" in the background tasker.
Could be that we just need to install a signal handler on the Background Tasker process (or change the default action to be "log and ignore").
Hi @edeutsch can you remind me, do we need to run the Background Tasker on RTX-KG2 services? We don't use the KP info cache on those services, so maybe we don't need it?
I'm doing this work in branch issue-2170
. Please see commit 86b6c2.
Yes, I think we should run it. There is also code that assesses the currently running child processes according to the system activity table and removes any that have died (i.e. if the child process is found to have gone away, it is marked as died and removed from the activity table. It also does some other things I think.
Understood. I will fix commit 86b6c2 so BackgroundTasker is run in the RTX-KG2 service. Expect another commit for that tonight, to the issue-2170
branch.
In this branch, I'm also removing all use of logging
code (i.e., the python logging framework) inside the flask application. Apparently it doesn't play nice with forking and we actually don't use it in very many places, so easy to replace with eprint
(which is really for issue 2114 but I'm folding that work into this branch).
OK, commit 7d622a8 should run the BackgroundTasker in the KG2 main.py now.
Testing this on /test
on arax.ncats.io
(see #deployment
channel on Slack).
I think we should have __main__.py
set the process title for the Background Tasker process to something like "ARAX_background_tasker.py", using the python setproctitle
package.
https://stackoverflow.com/questions/564695/is-there-a-way-to-change-effective-process-name-in-python
I will try to do that tonight
ah, intriguing!
Check it out, our beautiful new process table:
OK, I think the code is looking pretty stable. I'm going to merge to master
.
very fancy!
Hopefully this will help figure out what is going on inside the container, especially if a child process gets orphaned and then subsequently appears like a parent process.
Not sure if any of those commits will stop the issue with Background Tasker dying, but I've made some efforts to try and prevent it terminating unnecessarily (or logging if it has an issue).
So, where things stand-- the new code is at the moment only running on /test
and on /kg2test
, but the code has been merged to master
in GitHub. So the next time we roll out master
to the other dev endpoints, it will go out more broadly.
Please SMS me if these edits cause havoc tonight.
Hope the new package dependency setproctitle
doesn't cause issues for folks on Windows boxes. Can someone please test out pip3 install setproctitle
on windows please?
it installs fine for me.
# pip install setproctitle
Collecting setproctitle
Downloading setproctitle-1.3.3-cp310-cp310-win_amd64.whl (11 kB)
Installing collected packages: setproctitle
Successfully installed setproctitle-1.3.3
OK, so @edeutsch was right, this issue is more frequent that I had thought. Out of four background tasker processes that I left running on arax.ncats.io
last night (/beta
, /test
, /kg2beta
, /kg2test
), one of them had died as of this morning. From inside the rtx1
container:
24881 ? S 0:17 python3 -u -m openapi_server
24913 ? S 0:14 \_ python3 ARAX_background_tasker::run_tasks
24927 ? S 0:08 python3 -u -m openapi_server
24962 ? S 0:14 \_ python3 ARAX_background_tasker::run_tasks
25146 ? S 1:50 python3 -u -m openapi_server
25180 ? S 0:13 \_ python3 ARAX_background_tasker::run_tasks
25194 ? S 0:38 python3 -u -m openapi_server
Let's see which service has PID 25194 inside the container. Inside the container, as user rt
:
rt@d1fd345478a0:~$ netstat -pean | grep 25194
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:5003 0.0.0.0:* LISTEN 1025 799408281 25194/python3
So it's TCP port 5003. Looking at /etc/apache/sites-enabled/000-default.conf
for the Location
directive referencing port 5003 inside the container,
<Location "/beta/api/arax/v1.4">
ProxyPass "http://localhost:5003/beta/api/arax/v1.4"
#Header add Access-Control-Allow-Origin "*"
Header add Access-Control-Allow-Headers "Content-Type"
Header add Access-Control-Allow-Methods "GET, POST, OPTIONS"
</Location>
we see that this is the /beta
service whose background tasker died.
I've copied the logfile RTX_OpenAPI_beta.elog
to /home/stephenr
on the arax.ncats.io
host, and am inspecting it to see what happened. The child process PID (inside the container) for the background tasker was evidently 25229,
Background tasker is running in child process 25229
As of 10:04 PM PDT last night, the background tasker was running:
24881 ? S 0:07 python3 -u -m openapi_server
24913 ? S 0:02 \_ python3 ARAX_background_tasker::run_tasks
24927 ? S 0:06 python3 -u -m openapi_server
24962 ? S 0:02 \_ python3 ARAX_background_tasker::run_tasks
25146 ? S 0:08 python3 -u -m openapi_server
25180 ? S 0:02 \_ python3 ARAX_background_tasker::run_tasks
25194 ? S 0:07 python3 -u -m openapi_server
25229 ? S 0:02 \_ python3 ARAX_background_tasker::run_tasks
so somewhere in the last 9.5 hours, it died.
OK, here is the relevant excerpt of the RTX_OpenAPI_beta.elog
file:
2023-10-19T11:54:38.049795: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.0, 0.0, 0.0), n_clients=0, n_ongoing_queries=0
2023-10-19T11:55:38.310679: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.0, 0.0, 0.0), n_clients=0, n_ongoing_queries=0
2023-10-19T11:56:38.571498: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.0, 0.0, 0.0), n_clients=0, n_ongoing_queries=0
2023-10-19T11:57:38.832400: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.0, 0.0, 0.0), n_clients=0, n_ongoing_queries=0
127.0.0.1 - - [19/Oct/2023 11:58:23] "POST /beta/api/arax/v1.4/entity HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:23] "POST /beta/api/arax/v1.4/entity HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:23] "GET /beta/api/arax/v1.4/meta_knowledge_graph?format=simple HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:28] "GET /beta/api/arax/v1.4/response/a81193d0-cfab-4049-82e3-b38d1c89e2f5 HTTP/1.1" 200 -
WARNING: tried to convert the response to JSON and it did not work
It was: {'level': 'INFO', 'message': 'ARS info: <!DOCTYPE html><html><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><title>404: This page could not be found</title><meta name="next-head-count" content="3"/><noscript data-n-css=""></noscript><script defer="" nomodule="" src="/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js"></script><script src="/_next/static/chunks/webpack-3864d442a525cf32.js" defer=""></script><script src="/_next/static/chunks/framework-8883d1e9be70c3da.js" defer=""></script><script src="/_next/static/chunks/main-d1cb916d1fed4c3b.js" defer=""></script><script src="/_next/static/chunks/pages/_app-b555d5e1eab47959.js" defer=""></script><script src="/_next/static/chunks/pages/_error-d79168f986538ac0.js" defer=""></script><script src="/_next/static/jUzYuJn7pMobhm-zdAIud/_buildManifest.js" defer=""></script><script src="/_next/static/jUzYuJn7pMobhm-zdAIud/_ssgManifest.js" defer=""></script></head><body><div id="__next"><div style="font-family:system-ui,&quot;Segoe UI&quot;,Roboto,Helvetica,Arial,sans-serif,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;;height:100vh;text-align:center;display:flex;flex-direction:column;align-items:center;justify-content:center"><div><style>body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}</style><h1 class="next-error-h1" style="display:inline-block;margin:0 20px 0 0;padding-right:23px;font-size:24px;font-weight:500;vertical-align:top;line-height:49px">404</h1><div style="display:inline-block;text-align:left"><h2 style="font-size:14px;font-weight:400;line-height:49px;margin:0">This page could not be found<!-- -->.</h2></div></div></div></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{"statusCode":404}},"page":"/_error","query":{},"buildId":"jUzYuJn7pMobhm-zdAIud","nextExport":true,"isFallback":false,"gip":true,"scriptLoader":[]}</script></body></html>'}
127.0.0.1 - - [19/Oct/2023 11:58:30] "GET /beta/api/arax/v1.4/response/30212f09-5adf-4853-b9f3-1d48fb5c8c3b HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
PID returned from call to os.waitpid: 25998PID returned from call to os.waitpid: 25997
PID returned from call to os.waitpid: 26003
PID returned from call to os.waitpid: 26006
PID returned from call to os.waitpid: 26007
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:58:38] "GET /beta/api/arax/v1.4/response/b8f76104-ecea-445f-8e34-002deac91dff HTTP/1.1" 200 -
2023-10-19T11:58:39.093807: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.15, 0.03, 0.01), n_clients=0, n_ongoing_queries=0
INFO: Launching validator via multiprocessing
INFO: Launching validator via multiprocessing
INFO: Launching validator via multiprocessing
INFO: Launching validator via multiprocessing
127.0.0.1 - - [19/Oct/2023 11:58:40] "HEAD /beta/api/arax/v1.4/entity?q=ibuprofen HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
PID returned from call to os.waitpid: 25229PID returned from call to os.waitpid: 26043
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 26036
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 26072
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:58:42] "GET /beta/api/arax/v1.4/response/57c65997-eda8-4fce-ba6d-036eac8f2182 HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:42] "GET /beta/api/arax/v1.4/response/d52147f0-98f5-44bf-8cff-24b3729a4699 HTTP/1.1" 200 -
PID returned from call to os.waitpid: 0
INFO: Launching validator via multiprocessing
child process 25229 is already gone; exiting now
127.0.0.1 - - [19/Oct/2023 11:58:43] "GET /beta/api/arax/v1.4/response/0215373f-2f36-4b94-a59f-834b4363fb90 HTTP/1.1" 200 -
PID returned from call to os.waitpid: 0
WARNING: tried to convert the response to JSON and it did not work
It was: {'level': 'INFO', 'message': 'ARS info: <html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body>\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n</body>\r\n</html>\r\n'}
127.0.0.1 - - [19/Oct/2023 11:58:44] "GET /beta/api/arax/v1.4/response/78c1afac-0a8d-4647-83c0-d352ba33a4b0 HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:44] "GET /beta/api/arax/v1.4/response/327f5197-af3f-465f-a7a6-bcbeaf078d68 HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
127.0.0.1 - - [19/Oct/2023 11:58:46] "GET /beta/api/arax/v1.4/response/208c8007-a470-432e-a71e-c8405331ef3d HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:58:48] "GET /beta/api/arax/v1.4/response/43da21a3-e0e0-4df7-adda-448f8722d539 HTTP/1.1" 200 -
WARNING:bmt.toolkit:no biolink class found for the given curie: InChI=1S/C17H25N3O2/c18-9-14-2-1-3-20(14)15(21)10-19-16-5-12-4-13(6-16)8-17(22,7-12)11-16/h12-14,19,22H,1-8,10-11H2/t12-,13+,14-,16+,17-/m0/s1, try get_element_by_mapping?
PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:58:50] "GET /beta/api/arax/v1.4/response/6bed66f8-ff5f-4cb6-9cde-53b7cb41dbdd HTTP/1.1" 200 -
PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:59:00] "GET /beta/api/arax/v1.4/response/4065fae1-3652-446e-9bf2-e5367dce2a9b HTTP/1.1" 200 -
WARNING:bmt.toolkit:no biolink class found for the given curie: dct:description, try get_element_by_mapping?
WARNING:bmt.toolkit:no biolink class found for the given curie: bts:sentence, try get_element_by_mapping?
ChildProcessError(10, 'No child processes'); this is expected if there are no more child processes to reap
127.0.0.1 - - [19/Oct/2023 11:59:40] "GET /beta/api/arax/v1.4/response/d11d1331-de06-49d0-ad88-0b29a9b72b80 HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:59:51] "GET /beta/api/arax/v1.4/response/a81193d0-cfab-4049-82e3-b38d1c89e2f5 HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
WARNING:bmt.toolkit:no biolink class found for the given curie: metatype:Datetime, try get_element_by_mapping?
WARNING:bmt.toolkit:no biolink class found for the given curie: bts:sentence, try get_element_by_mapping?
ChildProcessError(10, 'No child processes'); this is expected if there are no more child processes to reap
127.0.0.1 - - [19/Oct/2023 12:01:00] "GET /beta/api/arax/v1.4/response/a53b2218-e370-4a17-a24a-e049d296b698 HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 12:03:40] "HEAD /beta/api/arax/v1.4/entity?q=ibuprofen HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 12:08:40] "HEAD /beta/api/arax/v1.4/entity?q=ibuprofen HTTP/1.1" 200 -
Note that the validator runs and then the background tasker dies right away:
INFO: Launching validator via multiprocessing
child process 25229 is already gone; exiting now
that suggests that the problem should be repeatable. Launch an instance and point a browser to: https://arax.ncats.io/beta/?r=5b476698-5613-4040-a497-34fd2c10f160 and see the problem?
On it
Restarting /beta
now on arax.ncats.io
.
OK, in the restarted /beta
, the background tasker's within-container PID is 26778:
26744 ? S 0:06 python3 -u -m openapi_server
26778 ? S 0:02 \_ python3 ARAX_background_tasker::run_tasks
OK, the problem has reoccurred. The response cache validator seemed to kill the background tasker with immediate effect:
26744 ? Sl 0:31 python3 -u -m openapi_server
26895 ? S 0:00 \_ python3 -u -m openapi_server
26999 ? S 0:00 \_ python3 -u -m openapi_server
I'm going to turn off multiprocessing in response_cache.py
, restart /beta
, and test again.
Here's the code change I made to response_cache.py
:
Here's the state of /beta
in the process table inside the container before making the above-referenced HTTPS request:
27084 ? S 0:07 python3 -u -m openapi_server
27118 ? S 0:02 \_ python3 ARAX_background_tasker::run_tasks
And.... everything is still fine with the background tasker:
24881 ? S 0:18 python3 -u -m openapi_server
24913 ? S 0:14 \_ python3 ARAX_background_tasker::run_tasks
24927 ? S 0:08 python3 -u -m openapi_server
24962 ? S 0:14 \_ python3 ARAX_background_tasker::run_tasks
25146 ? S 1:58 python3 -u -m openapi_server
25180 ? S 0:14 \_ python3 ARAX_background_tasker::run_tasks
27084 ? S 0:26 python3 -u -m openapi_server
27118 ? S 0:02 \_ python3 ARAX_background_tasker::run_tasks
This might help: https://pythonspeed.com/articles/python-multiprocessing/
or at least be amusing to read
Fixed. Merged to master. Deployed to /beta
and /test
Closing.
So, here is my explanation of what happened. When we switched to having the Flask application fork and run the background tasker in the child process, we needed a mechanism to be able to ensure that the background tasker doesn't persist as an orphan process if the Flask application is shut down via SIGTERM (which is how the System V init script will attempt to stop it, if you do service RTX_OpenAPI_beta stop
or whatever). So in the main module of the Flask application, I installed (in the parent process) a handler for SIGTERM, which sends a SIGKILL to the background-tasker child process. Seems simple, right? Well, here's the thing. In response_controller.py
(which of course runs in the Flask application "parent" process), it calls response_cache.py
where there is multiprocessing code that does a bunch of fork
calls under the hood, to create child processes to do the work of TRAPI response validation. In Linux, when a fork
is done, all signal handlers are copied to the child process. You can probably see where this is going... So each of the child processes for the multiprocessing validator is armed with a trigger to kill the background tasker! Now, by itself, that's not enough for this bug to occur. There needs to be a SIGTERM. But of course, how does the multiprocessing
module get rid of its child processes when they are no longer needed? You guessed it... SIGTERM. The solution: the SIGTERM signal handler that is installed in the parent process, now checks to see if it is the real parent process. And it only SIGKILLs the background tasker if it is the real parent process.
Of course, there appears to be a cleaner solution: (from the pythonspeed article Eric kindly linked above):
from multiprocessing import set_start_method
set_start_method("spawn")
but this approach might be a bit slower, so I opted to the presumably faster (but surely less elegant) solution described above. Anyhow, it's fixed.
@amykglen @isbluis @edeutsch please let me know (or reopen this issue) if you happen to see any future cases of the background tasker dying. Such cases should be easier to spot now, with the updated process names in the process table.
Since it has been merged to master, I have deleted the issue-2170
branch
I havae rolled out master
to /devED
and /devLM
. Both seem to be working.
great, thank you from sleuthing this out and fixing!
this was first reported in #2106 (here), when Expand didn't pick up on a KP's updated meta KG as quickly as we expected (eventually the backup mechanism that refreshes the cache if it hasn't been touched for more than 24 hours kicked in).
I think to resolve that we stopped and restarted
/beta
? (@saramsey and @edeutsch were involved in that discussion over in #2106)I think we were hoping that maybe that would be a one-off problem where the BackgroundTasker randomly died.
but @isbluis reported yesterday that automat-ctd's recently-fixed meta KG (in dev) had not been picked up by Expand (after more than an hour had passed, I believe).
so it seems maybe something is up with the BackgroundTasker?