Fix/add event filters to search subquery

nhoening commented 8 months ago

Querying timely beliefs data from the database takes way longer than it should.

Digging in, we found that a subquery for selecting only the latest belief was missing some event_start filtering. This meant that the subquery was applied to all data, probably multiple times over, when it suffices to only check the same data subset that the main query is interested in.
We also found a problem which POstgres had with data types. We gave it a datetime row plus a Python timedelta on one side of an inequality. It seems this lead to problems in executing the query efficiently. Postgres or SQLAlchemy were basically struggling to execute it at all (not sure who exactly struggled). When we re-ordered the inequalities to move interpretation of Python datetime/timedelta calculations before SQLAlchemy's query interpretation, we saw speed improve again.

Next to improving the where criteria for the subquery, we also looked into adding another index, see #167.

The event start filtering can improve the query time by a factor of around 4 (if the sensor has sufficient data, that's the time-scaling factor). Solving the data type issue led to another huge improvement.

This is the query as it is happening before this PR:

SELECT timed_belief.event_start, timed_belief.belief_horizon, timed_belief.source_id, timed_belief.cumulative_probability, timed_belief.event_value 
FROM timed_belief
JOIN data_source ON data_source.id = timed_belief.source_id
JOIN (SELECT timed_belief.event_start AS event_start, timed_belief.source_id AS source_id,  
           min(timed_belief.belief_horizon) AS most_recent_belief_horizon 
           FROM timed_belief
           JOIN data_source ON data_source.id = timed_belief.source_id 
           WHERE timed_belief.sensor_id = :sensor_id_1 GROUP BY timed_belief.event_start, timed_belief.source_id
          ) AS anon_1
ON timed_belief.event_start = anon_1.event_start AND timed_belief.source_id = anon_1.source_id
AND timed_belief.belief_horizon = anon_1.most_recent_belief_horizon 
WHERE timed_belief.sensor_id = :sensor_id_2 AND timed_belief.event_start + :event_start_1 > :param_1
AND timed_belief.event_start < :event_start_2

(the naming of the parameters is confusing - event_start_1 is an horizon, I believe, and param_1 a datetime)

The subquery anon_1 is not applying the event_start time window so we'll add timed_belief.event_start + :event_start_1 > :param_1 AND timed_belief.event_start < :event_start_2 to its where clause, as well.

Here is the code I used for timing (could be rewritten to only work in timely-beliefs) within flexmeasures shell:

from flexmeasures.data.models.time_series import TimedBelief

from datetime import datetime, timedelta
import pytz
import timeit
sensor=5
from_time = datetime(2023, 8, 10, tzinfo=pytz.UTC)
to_time = from_time + timedelta(days=2)
start_time = datetime.now()
TimedBelief.search(sensors=[sensor], event_starts_after=from_time, event_ends_before=to_time)
time=datetime.now()-start_time; print(f"This took: {time.seconds}.{time.microseconds} seconds.")
print("----")

nhoening commented 8 months ago

Thanks! Can you do some light testing in the FlexMeasures UI, and maybe a schedule?

Flix6x commented 8 months ago

I had. It looks like this is the improvement we have been looking for for quite some time.

SeitaBV / timely-beliefs

Fix/add event filters to search subquery #166