Investigate heavy DB write traffic from MOP's cronjobs

rachel3834 commented 5 months ago

From the LCO Slack discussion focusing around fit_all_events_PSPL.py (actually disabled but similar in structure to fit_need_events_PSPL.py):

The initial query for targets is quite broad, narrowed down only later, but because it is done under a transaction.atomic, all of the DB rows are locked until it completes.
Reducing the parallelization of the fitting process would also reduce the number of simultaneous operations
The loop over the targets once selected subsequently queries for different extra_fields; each of these queries is a separate DB transaction.

Proposed solutions after discussions with the TOM Team:

Restructure the initial queries to search for TargetExtras which have Alive=True, then downselect to minimize subsequent queries.
Combine queries more by AND statements rather than doing separate operations
Try preloading data

rachel3834 commented 5 months ago

To evaluate fit_need_events_PSPL's current code, I added checkpoints throughout the management command which measured the time taken and the number of DB connections open at any time.

The code starts by making a query for all Alive targets which have a Last_fit date earlier than a given threshold. Checkpoints before (1) and after (2) this query record: N DB connectioned 1: 1 N DB connectioned 2: 1 Time for targets query: 0:00:00.004373 Target(s) remaining in processing queue: 4118

This executes a multi-criteria query on the TargetExtra table but it seems to be efficient in both DB connections and time.

Next, the code selects just the first target, and queries the DB for associated ReducedDatums. This query is actually run twice - once for photometry and once for the model lightcurve. After which the checkpoint reports: N DB connectioned 3: 6 Time for ReducedDatums query for 1 target: 0:00:00.190659

So still efficient in time, but now we have multiple DB connections open.

[x] This could be improved by using the output from the first query to extract the model lightcurve as well. The target in question (Gaia23aiy) has both Gaia ReducedDatums and those for uploaded photometry and the model lightcurve.

The code then updates the Last_fit timestamp, generating more connections:

N DB connectioned 4: 12 Time for storing Last_fit timestamp: 0:00:00.010926

At this point, the code calls the run_fit function for the selected single target. Step 1 is to check whether it is a Gaia target, in which case it pulls the ReducedDatums from the DB again, updates the photometric uncertainties and saves them again.
Then it checks whether the target is classified as microlensing; if it isn't the Alive status is set to False and stored as an extra_field. After all this, checkpoint output is:

N DB connections 1: 20

[x] It's not clear to me why the Gaia uncertainties are evaluated here rather than during ingest at the Gaia broker - so this call can be moved from the fitting code. I suspect that this may once have been a necessary update but is legacy code at this point.
[x] Checking for microlensing as the classification should be a filter in the original query for targets, rather than a second query.

As the code loops through the list of targets, it seems to accumulate DB connections, even if no further processing of the target is done. In the case of Gaia23aiy, it appears to go into an infinite loop, because (as it happens), it had a non-microlensing classification, so the run_fit function returned to main without further output and returning no result.
The main function does not check the output, or remove the target from the list of targets to consider. This appears to cause it to remain in the target list, because the next set of output refers to the same object at the top of the list:

Target(s) remaining in processing queue: 4118 N DB connectioned 3: 25 Time for targets query: 0:00:00.166363 Target post save hook: Gaia23aiy created: False N DB connectioned 4: 31

With more DB connections accumulating.

[x] run_fit should return a definitive result in all cases.

[x] fit_need_events_PSPL should evaluate the result of run_fit, and remove a target from the list of events to process.

KKruszynska commented 5 months ago

Why are there so many opened DB connections at the end? They seem to multiply almost exponentially after each operation requiring getting data from the DB...

rachel3834 commented 5 months ago

Yes exactly! I think it's partly an unintended result of attempts to parallelize the fitting of events. Possibly some misunderstanding between the original authors. But there are a lot of inefficiencies here that we can address.

rachel3834 commented 5 months ago

We are not the only DB to have run into this issue - here's a helpful discussion from StackOverflow for reference. Also link to Django's own documentation for this.

rachel3834 commented 5 months ago

Interestingly, Django provides a CONN_MAX_AGE parameter which sets an integer number of seconds that a DB connection can be open. Historically, this defaulted to 0, meaning the connection would be closed automatically after the query was complete. It can be set to None, meaning all connections are persistent. This can be added as an open to the TOM's settings.py file in the DATABASES dictionary, but by default it isn't specified. So theoretically, by default the connections should close once complete.

However, QuerySets have a 'lazy' connection to the DB, in that the query doesn't execute until the result is needed. The implication is that once you perform a query operation, you hold open a DB connection, whereas a save operation closes the DB connection once it is complete. Hence it makes sense to minimize queries as much as possible.

rachel3834 commented 5 months ago

OK, I've restructured the fit_need_events_PSPL and related functions to implement the changes above and rationalize the DB calls. Testing this with a single event (using fit_event_PSPL, which uses the same functions underneath) leads to some interesting results:

Checkpoints after loading the Target and ReducedDatums, prior to running the model fitting code:

Fitting event: Gaia23aiy
N DB connections run_fit 1: 3
N DB connections run_fit 2: 3

Checkpoint after running the modeling code:

FIT: completed modeling process for Gaia23aiy
N DB connections run_fit 3: 3
Time taken: 0:00:19.134225

Checkpoint after storing a single ReducedDatum for the resulting model lightcurve:

FIT: Stored model lightcurve for event Gaia23aiy
N DB connections run_fit 4: 12
Time taken: 0:00:02.125479

Checkpoint after storing a set of 22 extra_fields for the target, representing the parameters of the fitted model:

FIT: Stored model parameters for event Gaia23aiy
N DB connections run_fit 5: 63
Time taken: 0:00:00.149701

So it would seem that storing these parameters like this dramatically increases the number of DB connections. Possible related to the fact that extra_fields have a one-to-many relation to a target?

rachel3834 commented 5 months ago

I've done some experiments to figure out the optimum way to retrieve a set of targets and associated extra_fields from MOP (for full information see this issue).

I've restructured fit_need_events_PSPL accordingly to minimize DB queries and optimize for time.

Modeling a single event, it would seem that the majority of the DB connections stem from writing extra_field parameters to MOP:

<snip>
FIT: completed modeling process for Gaia23aiy
N DB connections run_fit 3: 3
Time taken: 0:00:21.013689
<snip>
FIT: Stored model lightcurve for event Gaia23aiy
N DB connections run_fit 4: 12
Time taken: 0:00:02.102672
<snip>
FIT: Stored model parameters for event Gaia23aiy
N DB connections run_fit 6: 63

rachel3834 commented 5 months ago

To make this more efficient, I've implemented a superclass of Target -> MicrolensingEvent, a structure to store the querysets of both the Target and all the related database objects (Extra_Fields and ReducedDatums). The advantage to this is that we are not then repeating the same queries later on when we want to update those same objects, or duplicating DB entries. The fitting code has been re-written to take advantage of this class. The results of the checkpoints for the same single-target model fit are now:

Fitting event: Gaia23aiy
N DB connections run_fit 1: 3
N DB connections run_fit 2: 3
Time taken: 0:00:00.000168
FIT: Found 2 datasets and a total of 201 datapoints to model for event Gaia23aiy
N DB connections run_fit 3: 3
Time taken: 0:00:00.000102
FITTOOLS: established event
FITTOOLS: appended 2 telescopes
check_event  : Everything looks fine...
FITTOOLS: Set model 1, static PSPL
<snip>
FIT: completed modeling process for Gaia23aiy
N DB connections run_fit 3: 3
Time taken: 0:00:20.575143
Updating existing model  ReducedDatum object (1196923)
FIT: Stored model lightcurve for event Gaia23aiy
N DB connections run_fit 4: 4
Time taken: 0:00:00.504319
N DB connections run_fit 5: 4
Time taken: 0:00:00.001380
FIT: Stored model parameters for event Gaia23aiy
N DB connections run_fit 6: 28
Time taken: 0:00:00.098651
Total time for single target model fit: 0:00:21.820080

The overall time taken is now dominated by the actual model fitting process (20s) with the rest of the code taking about 2s.

rachel3834 commented 5 months ago

The refactored code definitely improves the number of DB connections and processing time.
However, the selection of the subset of events to be modeled can also be improved. Currently we are selecting (and preloading data for) all events with a Last_fit date prior to a cutoff - this selects almost all events. Better to query just for events with data recently added, though I need to check where this parameter is getting updated. Alternatively, could also introduce a look-back time maximum.

rachel3834 commented 5 months ago

New reviewing the run_TAP.py code, which was the other management command found to be heavy on DB traffic.

Results of checkpointing the original code for reference, running for a single object in no-go mode:

runTAP: Started with options {'verbosity': 1, 'settings': None, 'pythonpath': None, 'traceback': False, 'no_color': False, 'force_color': False, 'skip_checks': False, 'target_name': 'Gaia23aiy', 'observe': 'no-go'}
runTAP: Starting with N DB connections: 0
runTAP: Loaded event Gaia23aiy
runTAP: N DB connections 2: 3
runTAP: Took 0:00:00.027823
runTAP: identified pending observations for 5 targets: {'Gaia24amo': ['1M0-SCICAM-SINISTRO'], 'Gaia23dta': ['1M0-SCICAM-SINISTRO'], 'OGLE-2023-GD-0011': ['1M0-SCICAM-SINISTRO'], 'OGLE-2023-BLG-1322': ['1M0-SCICAM-SINISTRO'], 'Gaia23cnu': ['1M0-SCICAM-SINISTRO']}
runTAP: N DB connections 3: 3
runTAP: Took 0:00:00.354291
runTAP: analyzing event Gaia23aiy, 0 out of 1
TAP model parameters sanity check returned True
runTAP: N DB connections 4: 11
runTAP: Took 0:00:00.039250
Target post save hook: Gaia23aiy created: False
Baseline data prior to an event found.
runTAP: Gaia23aiy baseline exists - True.
runTAP: Planet priority: 2.240952885431235, 1.019003602380409
runTAP: Last datapoint: 2460233.46738
Target post save hook: Gaia23aiy created: False
runTAP: Mag now = 19.325743481824293
runTAP: Long tE priority: 75.326272 27.426145649702484
Target post save hook: Gaia23aiy created: False
runTAP: N DB connections 5: 47
runTAP: Took 0:00:01.587930
runTAP: N DB connections 6: 57
runTAP: Took 0:00:00.035505
runTAP: Event in HCZ: False
runTAP: Event alive? True
runTAP: N DB connections 7: 58
runTAP: Took 0:00:00.008956
runTAP: Event should be observed
runTAP: Event visible?  True
runTAP: mag_baseline: 19.349
Good for planet observations: False
Good for long tE observations: None
runTAP: Observing mode: Gaia23aiy None
Target post save hook: Gaia23aiy created: False
INTERFERO: Evaluating event Gaia23aiy
INTERFERO: Identified 1 stars in the neighbourhood of Gaia23aiy
INTERFERO: Calculated uncertainties Gmag=14.787789+/-0.564684442064341mag
INTERFERO: Computed JHK photometry for neighbouring stars
INTERFERO: Evaluation for interferometry for Gaia23aiy: No guide=0
Target post save hook: Gaia23aiy created: False
INTERFERO: Stored neighbouring star data in MOP
INTERFERO: Querying GSC catalog around event Gaia23aiy
INTERFERO: Compose matrix of AO and FT stars for Gaia23aiy
INTERFERO: Stored GSC search results in MOP
INTERFERO: Stored AOFT matrix in MOP
Target post save hook: Gaia23aiy created: False
runTAP: Evaluated Gaia23aiy for interferometry
runTAP: N DB connections 8: 187
runTAP: Took 0:00:04.300097
runTAP: Completed run
runTAP: N DB connections 9: 187
runTAP: Took in total0:00:06.354133

rachel3834 commented 5 months ago

So once again it appears that accessing the TargetExtras is what boosts the number of database connections, although not as dramatically as for the fitting process.

rachel3834 commented 5 months ago

After refactoring run_TAP to use the new querytools.py functions for more optimized data retrieval and storage, the output for run_TAP.py for the same target is:

runTAP: Started with options {'verbosity': 1, 'settings': None, 'pythonpath': None, 'traceback': False, 'no_color': False, 'force_color': False, 'skip_checks': False, 'target_name': 'Gaia23aiy', 'observe': 'no-go'}
runTAP: Starting with N DB connections: 0
runTAP: N DB connections post tap_list query: 1
runTAP: Time taken 0:00:00.028174
queryTools: Retrieved associated data for 1 Targets
queryTools: N DB connections: 3
queryTools: Time taken: 0:00:00.000651
queryTools: collating data on microlensing event set
queryTools: evaluating target Gaia23aiy, 0 out of 1
queryTools: Need to fit: False, reason: Outdated model; new data available
queryTools: Collated data for 1 targets in 0:00:00.596030
queryTools: N DB connections: 5
runTAP: N DB connections post creation of event objects: 5
runTAP: Time taken 0:00:00.601328
runTAP: identified pending observations for 5 targets: {'Gaia24amo': ['1M0-SCICAM-SINISTRO'], 'Gaia23dta': ['1M0-SCICAM-SINISTRO'], 'OGLE-2023-GD-0011': ['1M0-SCICAM-SINISTRO'], 'OGLE-2023-BLG-1322': ['1M0-SCICAM-SINISTRO'], 'Gaia23cnu': ['1M0-SCICAM-SINISTRO']}
runTAP: N DB connections post fetch of pending obs: 5
runTAP: Time taken 0:00:00.345212
runTAP: analyzing event Gaia23aiy, 0 out of 1
2459969.03547 <class 'float'>
0.48566 <class 'float'>
0.01705 <class 'float'>
564.94704 <class 'float'>
205.69609 <class 'float'>
12.326 <class 'float'>
TAP model parameters sanity check returned True
runTAP: N DB connections post sanity check: 5
runTAP: Time taken 0:00:00.001860
Baseline data prior to an event found.
runTAP: Gaia23aiy baseline exists - True.
runTAP: Planet priority: 2.2406688204257907, 1.0188854912570031
runTAP: Last datapoint: 2460233.46738
runTAP: Mag now = 19.325747774157772
runTAP: Long tE priority: 75.326272 27.426145649702484
runTAP: N DB connections post priority calculations: 7
runTAP: Time taken 0:00:00.329340
runTAP: Event in HCZ: False
runTAP: Event alive? 'True'
runTAP: Event should be observed
runTAP: Event visible?  True
runTAP: mag_baseline: 19.349
Good for planet observations: False
Good for long tE observations: None
runTAP: Observing mode: Gaia23aiy None
INTERFERO: Evaluating event Gaia23aiy
INTERFERO: Identified 1 stars in the neighbourhood of Gaia23aiy
INTERFERO: Calculated uncertainties Gmag=14.787789+/-0.5761444853351154mag
INTERFERO: Computed JHK photometry for neighbouring stars
INTERFERO: Evaluation for interferometry for Gaia23aiy: No guide=0
Target post save hook: Gaia23aiy created: False
INTERFERO: Stored neighbouring star data in MOP
INTERFERO: Querying GSC catalog around event Gaia23aiy
INTERFERO: Compose matrix of AO and FT stars for Gaia23aiy
INTERFERO: Stored GSC search results in MOP
INTERFERO: Stored AOFT matrix in MOP
Target post save hook: Gaia23aiy created: False
runTAP: Evaluated Gaia23aiy for interferometry
runTAP: N DB connections post obscontrol block: 107
runTAP: Time taken 0:00:02.953656
TAP_priority TAP_priority: 2.24067
TAP_priority_error TAP_priority_error: 1.01889
TAP_priority_longtE TAP_priority_longtE: 75.32627
TAP_priority_longtE_error TAP_priority_longtE_error: 27.42615
Category Category: Microlensing long-tE
Mag_now Mag_now: 19.326
Observing_mode Observing_mode: None
Sky_location Sky_location: Outside HCZ
runTAP: N DB connections post extra parameters store: 115
runTAP: Time taken 0:00:00.034056
runTAP: N DB connections post reduceddatums store: 125
runTAP: Time taken 0:00:00.023471
runTAP: Completed run
runTAP: N DB connections at end: 125
runTAP: Time taken to complete 0:00:04.317411

rachel3834 commented 5 months ago

So the number of connections is somewhat reduced, as is the time taken.

rachel3834 commented 5 months ago

Updating mop-prod and doing test runs the revised run_TAP.py code seems to run smoothly until the process is halted by the OOM. It reports 2565 events classified as microlensing and Alive, which seems credible. To minimize DB connections and optimize run time we are now prefetching the data for all these events...which puts us at risk of running out of memory, something we tend to do anyway.

Three options now that I can see:

[ ] Increase the memory allowance for the pod. It is currently set to ~1 GB.
[ ] Run run_TAP over batches of ~200 events at a time. This would impact our response time to changes on the targets.
[ ] Restructure how we decide when observations should happen. For example, we only need to re-model and evaluate a target for observations after new data come in. We could combine the functions of the data harvesters, model fitting and runTAP codes, and perform the later steps if new data are available. This is obviously a substantial re-write.

At the least, we should restructure how we time these cronjobs so that they don't run concurrently - there is no point runTAP running at the same time as fit_need_events_PSPL.

rachel3834 commented 5 months ago

Increasing the resources allocated to the pod produced the same intervention by the OOM killer, so that is not the answer.

rachel3834 commented 5 months ago

I've added more code profiling tools to MOP, including django-silk to give insights into the workload generated by the UI, and psutils throughout the key cronjobs in order to track memory usage.

Example output from fit_need_events_PSPL:

FIT_NEED_EVENTS: Starting checkpoint: 
CHECKPOINT: N DB connections: 0, memory: 652.73MiB
FIT_NEED_EVENTS: Initial queries selected 8611 events classified as microlensing, 3030 events currently Alive, 10222 events last modeled before 2460361.294979452
FIT_NEED_EVENTS: Initial target list has 2576 entries
CHECKPOINT: N DB connections: 0, memory: 655.75MiB
FIT_NEED_EVENTS: Retrieved associated data for 2576 Targets
CHECKPOINT: N DB connections: 0, memory: 655.75MiB
FIT_NEED_EVENTS: Time taken chk 2: 0:00:01.267591
<snip>
Fitting event: Gaia24anb
FIT: Found 1 datasets and a total of 2 datapoints to model for event Gaia24anb
Insufficient lightcurve data available to model event Gaia24anb
FIT_NEED_EVENTS: Completed modeling of Gaia24anb in 0:00:00.001515
CHECKPOINT: N DB connections: 0, memory: 672.94MiB
FIT_NEED_EVENTS: modeling target Gaia24ana, 7 out of 8
Fitting event: Gaia24ana
FIT: Found 1 datasets and a total of 5 datapoints to model for event Gaia24ana
Insufficient lightcurve data available to model event Gaia24ana
FIT_NEED_EVENTS: Completed modeling of Gaia24ana in 0:00:00.001498
CHECKPOINT: N DB connections: 0, memory: 672.94MiB
FIT_NEED_EVENTS: Finished modeling set of targets in 0:03:14.376116
CHECKPOINT: N DB connections: 0, memory: 672.94MiB

rachel3834 commented 5 months ago

Subsequently running the same code produces:

FIT_NEED_EVENTS: Starting checkpoint: 
CHECKPOINT: N DB connections: 0, memory: 651.62MiB
FIT_NEED_EVENTS: Initial queries selected 8611 events classified as microlensing, 598 events currently Alive, 10222 events last modeled before 2460361.2996815094
<snip>
FIT_NEED_EVENTS: evaluated target Gaia24aiz, 143 out of 144
CHECKPOINT: N DB connections: 0, memory: 716.54MiB
FIT_NEED_EVENTS: Collated data for 0 targets in 0:00:59.637903
CHECKPOINT: N DB connections: 0, memory: 716.54MiB
FIT_NEED_EVENTS: Finished modeling set of targets in 0:01:00.798403
CHECKPOINT: N DB connections: 0, memory: 716.54MiB

This implies that the workload of fitting events is reduced because of a change to the function which switches the Alive flag of each event to False when the event has expired. We were seeing a large number of "zombie" events, where a few datapoints were received some time ago but not enough to model properly. This lead to the t0, tE never being set, and so MOP didn't know whether the event was alive or not. I've added code to expire these events after 6 months if no further data is provided.

Interestingly though, this doesn't reduce the total memory allocation, since the process is prefetching the data for all of the selected targets, even if they are not required.

rachel3834 commented 5 months ago

Now profiling run_TAP:

runTAP: Started with options {'verbosity': 1, 'settings': None, 'pythonpath': None, 'traceback': False, 'no_color': False, 'force_color': False, 'skip_checks': False, 'target_name': 'all', 'observe': 'no-go'}
CHECKPOINT: N DB connections: 0, memory: 633.51MiB
runTAP: Time taken for tap_list query0:00:00.020399
CHECKPOINT: N DB connections: 0, memory: 633.51MiB
queryTools: checkpoint 1
CHECKPOINT: N DB connections: 0, memory: 633.51MiB
queryTools: Initial queries selected 8611 events classified as microlensing, 598 events currently Alive
CHECKPOINT: N DB connections: 0, memory: 633.51MiB
queryTools: Initial target list has 83 entries
queryTools: Retrieved associated data for 83 Targets
CHECKPOINT: N DB connections: 0, memory: 633.51MiB
queryTools: Time taken: 0:00:00.001113
queryTools: collating data on microlensing event set
queryTools: collated data for target Gaia23dcj, 0 out of 83
CHECKPOINT: N DB connections: 0, memory: 633.51MiB
queryTools: collated data for target Gaia24ail, 1 out of 83
CHECKPOINT: N DB connections: 0, memory: 633.51MiB
queryTools: collated data for target Gaia23dce, 2 out of 83
CHECKPOINT: N DB connections: 0, memory: 669.25MiB
<snip>
queryTools: collated data for target OGLE-2023-BLG-1308, 39 out of 83
CHECKPOINT: N DB connections: 0, memory: 931.96MiB
queryTools: collated data for target OGLE-2023-BLG-1269, 40 out of 83
CHECKPOINT: N DB connections: 0, memory: 935.57MiB
queryTools: collated data for target OGLE-2023-BLG-1285, 41 out of 83
CHECKPOINT: N DB connections: 0, memory: 939.96MiB
queryTools: collated data for target OGLE-2023-BLG-1286, 42 out of 83
CHECKPOINT: N DB connections: 0, memory: 944.08MiB
Killed

Although the number of events to consider is substantially reduced, run_TAP still falls foul of the OS' Out Of Memory killer. This may be because I am running it manually in an existing pod rather than it running in its own pod, as it usually would.

rachel3834 commented 5 months ago

Running run_TAP under the cron in its own pod suggests that its memory limits are now within spec for a single pod:

mop-runtap-28474560-f7tg2 mop CHECKPOINT: N DB connections: 0, memory: 1137.9MiB
mop-runtap-28474560-f7tg2 mop runTAP: analyzing event Gaia24aiz, 82 out of 83
mop-runtap-28474560-f7tg2 mop TAP model parameters sanity check returned False
mop-runtap-28474560-f7tg2 mop runTAP: Time taken post sanity check: 0:01:52.187927
mop-runtap-28474560-f7tg2 mop CHECKPOINT: N DB connections: 0, memory: 1137.9MiB
mop-runtap-28474560-f7tg2 mop runTAP: Completed run
mop-runtap-28474560-f7tg2 mop runTAP: Time taken to complete 0:02:11.267635
mop-runtap-28474560-f7tg2 mop CHECKPOINT: N DB connections: 0, memory: 1137.9MiB

The number of DB queries is strangely returning zero whereas this works on localhost.

rachel3834 commented 5 months ago

runTAP is now running in operation. Subsequently, during a period of MOP slow-down Jashan recognised that an external bot was performing a large number of queries of our server, causing restarts; this may have been a contributing factor.
Documented in this issue on the TOM Toolkit, since it would be a general TOM issue.

LCOGT / mop

Investigate heavy DB write traffic from MOP's cronjobs #136