LCOGT / mop

Microlensing Observation Portal
GNU General Public License v3.0
0 stars 7 forks source link

Webservice performance issues #82

Closed rachel3834 closed 7 months ago

rachel3834 commented 11 months ago

Both mop-dev and mop-prod exhibit severe slow-downs of the brower-based UI at transient intervals.
These seem to occur after a redeployment, and/or when there is a heavy load of parallelized fitting processes running in crobjobs.

Possibly related log output from stern:

mop-54f64c58cb-rblr8 nginx 10.100.22.6 - - [27/Oct/2023:00:09:03 +0000] "GET / HTTP/1.1" 499 0 "https://mop.lco.global/targets/35484/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0" "134.4.64.64" mop-54f64c58cb-9vz2m mop [2023-10-27 00:09:33 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8) mop-54f64c58cb-9vz2m mop [2023-10-27 00:09:34 +0000] [20] [INFO] Booting worker with pid: 20

rachel3834 commented 11 months ago

Monitoring the log output, the performance slow-downs seem to occur whenever the fitneedevent and fitaliveevent conjobs are running. They also self-resolve once these processes complete. Its not clear to me whether correlation = causation in this case, since these two processes have been running this way for some time, and the only recent change was a minor version change to PyLIMA. It maybe the result of an underlying configuration of how the databases handle clashing queries.

rachel3834 commented 11 months ago

I switched off our modeling processes over the weekend and ran an exec command to run a single fitting process manually as a test. This was killed by the OOM. Even after this process died, MOP is painfully slow to load event pages. No other pods are running, so this transient performance issue does not seem to be related to our software.

rachel3834 commented 7 months ago

This has been the subject of several issues under Milestone 1. The DB query optimization work has significantly reduced the load on the DB and somewhat improved the performance of the webservice but this is now limited by two factors. One is the structure of the Target extra parameters as key-value pairs which makes queries rather cumbersome. This will be address in an upcoming new release of the TOM Toolkit.

The second issue is that we are receiving a high traffic volume from external webcrawler bots, which have apparently got hold of an old (and extensive) list of alert broker queries that MOP used to store. Although it can't get to the output of these queries, the bots are still attempting to run them. This will be fixed in an upcoming new release of the Toolkit as well.

I'm closing this issue for now as we have done all we can.