RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

KP info cache not refreshing hourly - BackgroundTasker issue? #2170

Closed amykglen closed 1 year ago

amykglen commented 1 year ago

this was first reported in #2106 (here), when Expand didn't pick up on a KP's updated meta KG as quickly as we expected (eventually the backup mechanism that refreshes the cache if it hasn't been touched for more than 24 hours kicked in).

I think to resolve that we stopped and restarted /beta? (@saramsey and @edeutsch were involved in that discussion over in #2106)

I think we were hoping that maybe that would be a one-off problem where the BackgroundTasker randomly died.

but @isbluis reported yesterday that automat-ctd's recently-fixed meta KG (in dev) had not been picked up by Expand (after more than an hour had passed, I believe).

so it seems maybe something is up with the BackgroundTasker?

saramsey commented 1 year ago

Thanks @amykglen for bringing this issue up. So, @isbluis can you provide some context for the issue yesterday with picking up the new meta-KG for automat-ctd? Which ARAX endpoint was this? How was it eventually resolved? Was the endpoint resetarted? Was the elog file saved before the Flask app was restarted?

edeutsch commented 1 year ago

I think I added more logging on what the BackgroundTasker was doing, so looking at the logs (when we determine which instance we're talking about) would be useful.

Based on minimal information, it appeared that the BackgroundTasker just stopped working with little information on what happened. The logs might show what its last communications were.

I'm thinking this problem appears to be new since we switched from a thread to a child process.

saramsey commented 1 year ago

Hmm, this is pure conjecture, but my gut feeling is "unhandled signal" in the background tasker.

Could be that we just need to install a signal handler on the Background Tasker process (or change the default action to be "log and ignore").

saramsey commented 1 year ago

Hi @edeutsch can you remind me, do we need to run the Background Tasker on RTX-KG2 services? We don't use the KP info cache on those services, so maybe we don't need it?

saramsey commented 1 year ago

I'm doing this work in branch issue-2170. Please see commit 86b6c2.

edeutsch commented 1 year ago

Yes, I think we should run it. There is also code that assesses the currently running child processes according to the system activity table and removes any that have died (i.e. if the child process is found to have gone away, it is marked as died and removed from the activity table. It also does some other things I think.

saramsey commented 1 year ago

Understood. I will fix commit 86b6c2 so BackgroundTasker is run in the RTX-KG2 service. Expect another commit for that tonight, to the issue-2170 branch.

saramsey commented 1 year ago

In this branch, I'm also removing all use of logging code (i.e., the python logging framework) inside the flask application. Apparently it doesn't play nice with forking and we actually don't use it in very many places, so easy to replace with eprint (which is really for issue 2114 but I'm folding that work into this branch).

saramsey commented 1 year ago

OK, commit 7d622a8 should run the BackgroundTasker in the KG2 main.py now.

saramsey commented 1 year ago

Testing this on /test on arax.ncats.io (see #deployment channel on Slack).

saramsey commented 1 year ago

I think we should have __main__.py set the process title for the Background Tasker process to something like "ARAX_background_tasker.py", using the python setproctitle package.

https://stackoverflow.com/questions/564695/is-there-a-way-to-change-effective-process-name-in-python

I will try to do that tonight

edeutsch commented 1 year ago

ah, intriguing!

saramsey commented 1 year ago

Check it out, our beautiful new process table: Screenshot 2023-10-18 at 9 25 21 PM

saramsey commented 1 year ago

OK, I think the code is looking pretty stable. I'm going to merge to master.

edeutsch commented 1 year ago

very fancy!

saramsey commented 1 year ago

Hopefully this will help figure out what is going on inside the container, especially if a child process gets orphaned and then subsequently appears like a parent process.

saramsey commented 1 year ago

Not sure if any of those commits will stop the issue with Background Tasker dying, but I've made some efforts to try and prevent it terminating unnecessarily (or logging if it has an issue).

saramsey commented 1 year ago

So, where things stand-- the new code is at the moment only running on /test and on /kg2test, but the code has been merged to master in GitHub. So the next time we roll out master to the other dev endpoints, it will go out more broadly.

saramsey commented 1 year ago

Please SMS me if these edits cause havoc tonight.

saramsey commented 1 year ago

Hope the new package dependency setproctitle doesn't cause issues for folks on Windows boxes. Can someone please test out pip3 install setproctitle on windows please?

edeutsch commented 1 year ago

it installs fine for me.

# pip install setproctitle
Collecting setproctitle
  Downloading setproctitle-1.3.3-cp310-cp310-win_amd64.whl (11 kB)
Installing collected packages: setproctitle
Successfully installed setproctitle-1.3.3
saramsey commented 1 year ago

OK, so @edeutsch was right, this issue is more frequent that I had thought. Out of four background tasker processes that I left running on arax.ncats.io last night (/beta, /test, /kg2beta, /kg2test), one of them had died as of this morning. From inside the rtx1 container:

24881 ?        S      0:17 python3 -u -m openapi_server
24913 ?        S      0:14  \_ python3 ARAX_background_tasker::run_tasks
24927 ?        S      0:08 python3 -u -m openapi_server
24962 ?        S      0:14  \_ python3 ARAX_background_tasker::run_tasks
25146 ?        S      1:50 python3 -u -m openapi_server
25180 ?        S      0:13  \_ python3 ARAX_background_tasker::run_tasks
25194 ?        S      0:38 python3 -u -m openapi_server

Let's see which service has PID 25194 inside the container. Inside the container, as user rt:

rt@d1fd345478a0:~$ netstat -pean | grep 25194
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:5003            0.0.0.0:*               LISTEN      1025       799408281   25194/python3

So it's TCP port 5003. Looking at /etc/apache/sites-enabled/000-default.conf for the Location directive referencing port 5003 inside the container,

        <Location "/beta/api/arax/v1.4">
                ProxyPass "http://localhost:5003/beta/api/arax/v1.4"
                #Header add Access-Control-Allow-Origin "*"
                Header add Access-Control-Allow-Headers "Content-Type"
                Header add Access-Control-Allow-Methods "GET, POST, OPTIONS"
        </Location>

we see that this is the /beta service whose background tasker died.

saramsey commented 1 year ago

I've copied the logfile RTX_OpenAPI_beta.elog to /home/stephenr on the arax.ncats.io host, and am inspecting it to see what happened. The child process PID (inside the container) for the background tasker was evidently 25229,

Background tasker is running in child process 25229

As of 10:04 PM PDT last night, the background tasker was running:

24881 ?        S      0:07 python3 -u -m openapi_server
24913 ?        S      0:02  \_ python3 ARAX_background_tasker::run_tasks
24927 ?        S      0:06 python3 -u -m openapi_server
24962 ?        S      0:02  \_ python3 ARAX_background_tasker::run_tasks
25146 ?        S      0:08 python3 -u -m openapi_server
25180 ?        S      0:02  \_ python3 ARAX_background_tasker::run_tasks
25194 ?        S      0:07 python3 -u -m openapi_server
25229 ?        S      0:02  \_ python3 ARAX_background_tasker::run_tasks

so somewhere in the last 9.5 hours, it died.

saramsey commented 1 year ago

OK, here is the relevant excerpt of the RTX_OpenAPI_beta.elog file:

2023-10-19T11:54:38.049795: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.0, 0.0, 0.0), n_clients=0, n_ongoing_queries=0
2023-10-19T11:55:38.310679: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.0, 0.0, 0.0), n_clients=0, n_ongoing_queries=0
2023-10-19T11:56:38.571498: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.0, 0.0, 0.0), n_clients=0, n_ongoing_queries=0
2023-10-19T11:57:38.832400: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.0, 0.0, 0.0), n_clients=0, n_ongoing_queries=0
127.0.0.1 - - [19/Oct/2023 11:58:23] "POST /beta/api/arax/v1.4/entity HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:23] "POST /beta/api/arax/v1.4/entity HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:23] "GET /beta/api/arax/v1.4/meta_knowledge_graph?format=simple HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:28] "GET /beta/api/arax/v1.4/response/a81193d0-cfab-4049-82e3-b38d1c89e2f5 HTTP/1.1" 200 -
WARNING: tried to convert the response to JSON and it did not work
It was: {'level': 'INFO', 'message': 'ARS info: &lt;!DOCTYPE html&gt;&lt;html&gt;&lt;head&gt;&lt;meta charSet=&quot;utf-8&quot;/&gt;&lt;meta name=&quot;viewport&quot; content=&quot;width=device-width&quot;/&gt;&lt;title&gt;404: This page could not be found&lt;/title&gt;&lt;meta name=&quot;next-head-count&quot; content=&quot;3&quot;/&gt;&lt;noscript data-n-css=&quot;&quot;&gt;&lt;/noscript&gt;&lt;script defer=&quot;&quot; nomodule=&quot;&quot; src=&quot;/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/chunks/webpack-3864d442a525cf32.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/chunks/framework-8883d1e9be70c3da.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/chunks/main-d1cb916d1fed4c3b.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/chunks/pages/_app-b555d5e1eab47959.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/chunks/pages/_error-d79168f986538ac0.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/jUzYuJn7pMobhm-zdAIud/_buildManifest.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;script src=&quot;/_next/static/jUzYuJn7pMobhm-zdAIud/_ssgManifest.js&quot; defer=&quot;&quot;&gt;&lt;/script&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;__next&quot;&gt;&lt;div style=&quot;font-family:system-ui,&amp;quot;Segoe UI&amp;quot;,Roboto,Helvetica,Arial,sans-serif,&amp;quot;Apple Color Emoji&amp;quot;,&amp;quot;Segoe UI Emoji&amp;quot;;height:100vh;text-align:center;display:flex;flex-direction:column;align-items:center;justify-content:center&quot;&gt;&lt;div&gt;&lt;style&gt;body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}&lt;/style&gt;&lt;h1 class=&quot;next-error-h1&quot; style=&quot;display:inline-block;margin:0 20px 0 0;padding-right:23px;font-size:24px;font-weight:500;vertical-align:top;line-height:49px&quot;&gt;404&lt;/h1&gt;&lt;div style=&quot;display:inline-block;text-align:left&quot;&gt;&lt;h2 style=&quot;font-size:14px;font-weight:400;line-height:49px;margin:0&quot;&gt;This page could not be found&lt;!-- --&gt;.&lt;/h2&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;script id=&quot;__NEXT_DATA__&quot; type=&quot;application/json&quot;&gt;{&quot;props&quot;:{&quot;pageProps&quot;:{&quot;statusCode&quot;:404}},&quot;page&quot;:&quot;/_error&quot;,&quot;query&quot;:{},&quot;buildId&quot;:&quot;jUzYuJn7pMobhm-zdAIud&quot;,&quot;nextExport&quot;:true,&quot;isFallback&quot;:false,&quot;gip&quot;:true,&quot;scriptLoader&quot;:[]}&lt;/script&gt;&lt;/body&gt;&lt;/html&gt;'}
127.0.0.1 - - [19/Oct/2023 11:58:30] "GET /beta/api/arax/v1.4/response/30212f09-5adf-4853-b9f3-1d48fb5c8c3b HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
PID returned from call to os.waitpid: 25998PID returned from call to os.waitpid: 25997
PID returned from call to os.waitpid: 26003
PID returned from call to os.waitpid: 26006
PID returned from call to os.waitpid: 26007
PID returned from call to os.waitpid: 0

PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:58:38] "GET /beta/api/arax/v1.4/response/b8f76104-ecea-445f-8e34-002deac91dff HTTP/1.1" 200 -
2023-10-19T11:58:39.093807: INFO: ARAXBackgroundTasker (PID 25229) status: waiting. Current load is (0.15, 0.03, 0.01), n_clients=0, n_ongoing_queries=0
INFO: Launching validator via multiprocessing
INFO: Launching validator via multiprocessing
INFO: Launching validator via multiprocessing
INFO: Launching validator via multiprocessing
127.0.0.1 - - [19/Oct/2023 11:58:40] "HEAD /beta/api/arax/v1.4/entity?q=ibuprofen HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
PID returned from call to os.waitpid: 25229PID returned from call to os.waitpid: 26043
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 26036
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 26072
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 0

PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:58:42] "GET /beta/api/arax/v1.4/response/57c65997-eda8-4fce-ba6d-036eac8f2182 HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:42] "GET /beta/api/arax/v1.4/response/d52147f0-98f5-44bf-8cff-24b3729a4699 HTTP/1.1" 200 -
PID returned from call to os.waitpid: 0
INFO: Launching validator via multiprocessing
child process 25229 is already gone; exiting now
127.0.0.1 - - [19/Oct/2023 11:58:43] "GET /beta/api/arax/v1.4/response/0215373f-2f36-4b94-a59f-834b4363fb90 HTTP/1.1" 200 -
PID returned from call to os.waitpid: 0
WARNING: tried to convert the response to JSON and it did not work
It was: {'level': 'INFO', 'message': 'ARS info: &lt;html&gt;\r\n&lt;head&gt;&lt;title&gt;504 Gateway Time-out&lt;/title&gt;&lt;/head&gt;\r\n&lt;body&gt;\r\n&lt;center&gt;&lt;h1&gt;504 Gateway Time-out&lt;/h1&gt;&lt;/center&gt;\r\n&lt;/body&gt;\r\n&lt;/html&gt;\r\n'}
127.0.0.1 - - [19/Oct/2023 11:58:44] "GET /beta/api/arax/v1.4/response/78c1afac-0a8d-4647-83c0-d352ba33a4b0 HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:58:44] "GET /beta/api/arax/v1.4/response/327f5197-af3f-465f-a7a6-bcbeaf078d68 HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
127.0.0.1 - - [19/Oct/2023 11:58:46] "GET /beta/api/arax/v1.4/response/208c8007-a470-432e-a71e-c8405331ef3d HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
PID returned from call to os.waitpid: 0
PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:58:48] "GET /beta/api/arax/v1.4/response/43da21a3-e0e0-4df7-adda-448f8722d539 HTTP/1.1" 200 -
WARNING:bmt.toolkit:no biolink class found for the given curie: InChI=1S/C17H25N3O2/c18-9-14-2-1-3-20(14)15(21)10-19-16-5-12-4-13(6-16)8-17(22,7-12)11-16/h12-14,19,22H,1-8,10-11H2/t12-,13+,14-,16+,17-/m0/s1, try get_element_by_mapping?
PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:58:50] "GET /beta/api/arax/v1.4/response/6bed66f8-ff5f-4cb6-9cde-53b7cb41dbdd HTTP/1.1" 200 -
PID returned from call to os.waitpid: 0
127.0.0.1 - - [19/Oct/2023 11:59:00] "GET /beta/api/arax/v1.4/response/4065fae1-3652-446e-9bf2-e5367dce2a9b HTTP/1.1" 200 -
WARNING:bmt.toolkit:no biolink class found for the given curie: dct:description, try get_element_by_mapping?
WARNING:bmt.toolkit:no biolink class found for the given curie: bts:sentence, try get_element_by_mapping?
ChildProcessError(10, 'No child processes'); this is expected if there are no more child processes to reap
127.0.0.1 - - [19/Oct/2023 11:59:40] "GET /beta/api/arax/v1.4/response/d11d1331-de06-49d0-ad88-0b29a9b72b80 HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 11:59:51] "GET /beta/api/arax/v1.4/response/a81193d0-cfab-4049-82e3-b38d1c89e2f5 HTTP/1.1" 200 -
INFO: Launching validator via multiprocessing
WARNING:bmt.toolkit:no biolink class found for the given curie: metatype:Datetime, try get_element_by_mapping?
WARNING:bmt.toolkit:no biolink class found for the given curie: bts:sentence, try get_element_by_mapping?
ChildProcessError(10, 'No child processes'); this is expected if there are no more child processes to reap
127.0.0.1 - - [19/Oct/2023 12:01:00] "GET /beta/api/arax/v1.4/response/a53b2218-e370-4a17-a24a-e049d296b698 HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 12:03:40] "HEAD /beta/api/arax/v1.4/entity?q=ibuprofen HTTP/1.1" 200 -
127.0.0.1 - - [19/Oct/2023 12:08:40] "HEAD /beta/api/arax/v1.4/entity?q=ibuprofen HTTP/1.1" 200 -

Note that the validator runs and then the background tasker dies right away:

INFO: Launching validator via multiprocessing
child process 25229 is already gone; exiting now
edeutsch commented 1 year ago

that suggests that the problem should be repeatable. Launch an instance and point a browser to: https://arax.ncats.io/beta/?r=5b476698-5613-4040-a497-34fd2c10f160 and see the problem?

saramsey commented 1 year ago

On it

saramsey commented 1 year ago

Restarting /beta now on arax.ncats.io.

OK, in the restarted /beta, the background tasker's within-container PID is 26778:

26744 ?        S      0:06 python3 -u -m openapi_server
26778 ?        S      0:02  \_ python3 ARAX_background_tasker::run_tasks
saramsey commented 1 year ago

OK, the problem has reoccurred. The response cache validator seemed to kill the background tasker with immediate effect:

26744 ?        Sl     0:31 python3 -u -m openapi_server
26895 ?        S      0:00  \_ python3 -u -m openapi_server
26999 ?        S      0:00  \_ python3 -u -m openapi_server
saramsey commented 1 year ago

I'm going to turn off multiprocessing in response_cache.py, restart /beta, and test again.

saramsey commented 1 year ago

Here's the code change I made to response_cache.py:

Screenshot 2023-10-19 at 8 40 07 AM

Here's the state of /beta in the process table inside the container before making the above-referenced HTTPS request:

27084 ?        S      0:07 python3 -u -m openapi_server
27118 ?        S      0:02  \_ python3 ARAX_background_tasker::run_tasks
saramsey commented 1 year ago

And.... everything is still fine with the background tasker:

24881 ?        S      0:18 python3 -u -m openapi_server
24913 ?        S      0:14  \_ python3 ARAX_background_tasker::run_tasks
24927 ?        S      0:08 python3 -u -m openapi_server
24962 ?        S      0:14  \_ python3 ARAX_background_tasker::run_tasks
25146 ?        S      1:58 python3 -u -m openapi_server
25180 ?        S      0:14  \_ python3 ARAX_background_tasker::run_tasks
27084 ?        S      0:26 python3 -u -m openapi_server
27118 ?        S      0:02  \_ python3 ARAX_background_tasker::run_tasks
edeutsch commented 1 year ago

This might help: https://pythonspeed.com/articles/python-multiprocessing/

or at least be amusing to read

saramsey commented 1 year ago

Fixed. Merged to master. Deployed to /beta and /test Closing.

saramsey commented 1 year ago

So, here is my explanation of what happened. When we switched to having the Flask application fork and run the background tasker in the child process, we needed a mechanism to be able to ensure that the background tasker doesn't persist as an orphan process if the Flask application is shut down via SIGTERM (which is how the System V init script will attempt to stop it, if you do service RTX_OpenAPI_beta stop or whatever). So in the main module of the Flask application, I installed (in the parent process) a handler for SIGTERM, which sends a SIGKILL to the background-tasker child process. Seems simple, right? Well, here's the thing. In response_controller.py (which of course runs in the Flask application "parent" process), it calls response_cache.py where there is multiprocessing code that does a bunch of fork calls under the hood, to create child processes to do the work of TRAPI response validation. In Linux, when a fork is done, all signal handlers are copied to the child process. You can probably see where this is going... So each of the child processes for the multiprocessing validator is armed with a trigger to kill the background tasker! Now, by itself, that's not enough for this bug to occur. There needs to be a SIGTERM. But of course, how does the multiprocessing module get rid of its child processes when they are no longer needed? You guessed it... SIGTERM. The solution: the SIGTERM signal handler that is installed in the parent process, now checks to see if it is the real parent process. And it only SIGKILLs the background tasker if it is the real parent process.

Of course, there appears to be a cleaner solution: (from the pythonspeed article Eric kindly linked above):

from multiprocessing import set_start_method
set_start_method("spawn")

but this approach might be a bit slower, so I opted to the presumably faster (but surely less elegant) solution described above. Anyhow, it's fixed.

saramsey commented 1 year ago

@amykglen @isbluis @edeutsch please let me know (or reopen this issue) if you happen to see any future cases of the background tasker dying. Such cases should be easier to spot now, with the updated process names in the process table.

saramsey commented 1 year ago

Since it has been merged to master, I have deleted the issue-2170 branch

saramsey commented 1 year ago

I havae rolled out master to /devED and /devLM. Both seem to be working.

edeutsch commented 1 year ago

great, thank you from sleuthing this out and fixing!