Compare retrival times for separate ReducedDatums versus timeseries ReducedDatum

rachel3834 commented 5 months ago

Since switching all of MOP's timeseries photometry from separate ReducedDatums to timeseries arrays will be quite an invasive change, it would be good to first test what difference this would make to the speed of retrieval.

[x] Create a test target with no other data
[x] Create a test lightcurve
[x] Create a management command that ingests the lightcurve both as a single timeseries ReducedDatum, and as a series of ReducedDatums
[x] Retrieve both from the DB using MOP's standard routines for doing so, and time how long this takes.

rachel3834 commented 5 months ago

Added management command for testing purposes, called test_lc_retrieval.py

Results of timing tests, made on a localhost test copy of the MOP DB (which has only a fraction of the full MOP DB's data):

First run: Time to retrieve single target: 0:00:00.027339 Time to retrieve filtered ReducedDatums 0:00:00.000615 Time to repackage the lightcurve 0:00:00.019424 Found 0 stored Datums for TEST1 /data/software/mop_venv/lib/python3.10/site-packages/django/db/models/fields/init.py:1595: RuntimeWarning:

DateTimeField ReducedDatum.timestamp received a naive datetime (2024-02-13 20:02:03.532130) while time zone support is active.

Time to store timeseries in array format 0:00:00.192346 Time to retrieve timeseries in array format 0:00:00.000310 Time to repackage timeseries in array format 0:00:00.344075

Second run: Time to retrieve single target: 0:00:00.027522 Time to retrieve filtered ReducedDatums 0:00:00.000573 Time to repackage the lightcurve 0:00:00.019359 Found 1 stored Datums for TEST1 Time to store timeseries in array format 0:00:00.268346 Time to retrieve timeseries in array format 0:00:00.000463 Time to repackage timeseries in array format 0:00:00.346806

rachel3834 commented 5 months ago

So the total time to retrieve individually-stored ReducedDatums is 0.02 0.344 to retrieve an array, but the majority of the time lies in the repackaging of the data into a numpy array format.

The ratio of time to retrieve array / time to retrieve separate datapoints = 0.5, 0.8 in each run respectively, so this is the faster method. In addition to speed, it may also have the benefit of reducing the total load of DB query traffic, since we are retrieving fewer records overall.

But the major speed gain is to be had in improving the repackaging of the data. Currently, ReducedDatum.value is stored as a JSON blob, so the input must be JSON serializable. Numpy arrays are not, so we normally output everything to a dictionary of 3 lists (time, mag, mag_error) before ingest. This then has to be rebuilt into a 3-column array before processing.

rachel3834 commented 5 months ago

Rather than separate out the columns of the timeseries array, what happens if we just store the 2D array whole, in list format?

Time to retrieve single target: 0:00:00.027485 Time to retrieve filtered ReducedDatums 0:00:00.000614 Time to repackage the lightcurve 0:00:00.019270 Found 0 stored Datums for TEST1 Time to store timeseries in array format 0:00:00.187859 Time to retrieve timeseries in array format 0:00:00.000308 Time to repackage timeseries in array format 0:00:00.087014

This new repackaging approach... new_array = np.array(qs[0].value['timeseries'])

...is considerably faster than the previous one... ndp = len(qs[0].value['mag']) new_array = np.zeros((ndp,3)) new_array[:,0] = qs[0].value['time'] new_array[:,1] = qs[0].value['mag'] new_array[:,2] = qs[0].value['mag_error']

But strangely not competative with the repackaging of individual datapoints, as performed by fittools.repackage_lightcurve. Here, the code performs a for loop over all datapoints in a queryset and accumulates a 2D list, which is then converted into an array. Its not clear to me why, since the approach above effectively just skips the accumulation phase and just does the conversion back to an array, which has to happen regardless.

Interesting, if I first extract the first entry in the queryset and time that separately: rd = qs[0] Time to extract first queryset entry: 0:00:00.101647 ...this seems to take the majority of the retrieval time.

If I then unpack the timeseries array, it takes: Time to repackage timeseries in array format 0:00:00.000119

rachel3834 commented 5 months ago

This suggests that iterating over a queryset actually takes quite a lot of time. So I conclude that it would be more efficient for us to store timeseries data as a single 2D array, if we continue to use the TOM's default format of JSON dictionaries.

rachel3834 commented 5 months ago

Closing this as investigation is complete.

LCOGT / mop

Compare retrival times for separate ReducedDatums versus timeseries ReducedDatum #133