AbdullahO / tspdb

tspdb: Time Series Predict DB
Apache License 2.0
185 stars 58 forks source link

Poor forecast results #19

Closed dpage closed 2 years ago

dpage commented 2 years ago

Hi

I've been experimenting with tspdb with PostgreSQL 13 & Python 3.7 (both from EDB) on macOS. I have a sample dataset that I've previously experimented with using WaveNet models in Tensorflow quite successfully. It represents a simulated number of connections to a Postgres database with multiple seasonalities, such as you might find when users log on and off throughout the day, with quiet and peak times, and fewer users at weekends - all with some noise mixed in to better emulate a real world scenario.

My aim is to be able to predict workload into the near future based on past monitoring data. Unfortunately I'm not getting useful predictions for tspdb.

Here's my data set: data.csv, and a rough graph of it:

raw_data

I've experimented with various values for the k parameter:

tspdb=# SELECT * FROM list_pindices();
       index_name       | value_columns | relation | time_column |  initial_timestamp  |   last_timestamp    | agg_interval | uncertainty_quantification 
------------------------+---------------+----------+-------------+---------------------+---------------------+--------------+----------------------------
 backends_tsp_k1        | {processes}   | backends | ts          | 2020-10-06 13:25:01 | 2020-11-02 13:35:01 | 00:05:00     | t
 backends_tsp_k10       | {processes}   | backends | ts          | 2020-10-06 13:25:01 | 2020-11-02 13:35:01 | 00:05:00     | t
 backends_tsp_k2        | {processes}   | backends | ts          | 2020-10-06 13:25:01 | 2020-11-02 13:35:01 | 00:05:00     | t
 backends_tsp_k20       | {processes}   | backends | ts          | 2020-10-06 13:25:01 | 2020-11-02 13:35:01 | 00:05:00     | t
 backends_tsp_k3        | {processes}   | backends | ts          | 2020-10-06 13:25:01 | 2020-11-02 13:35:01 | 00:05:00     | t
 backends_tsp_k4        | {processes}   | backends | ts          | 2020-10-06 13:25:01 | 2020-11-02 13:35:01 | 00:05:00     | t
 backends_tsp_k5        | {processes}   | backends | ts          | 2020-10-06 13:25:01 | 2020-11-02 13:35:01 | 00:05:00     | t
 backends_tsp_k_default | {processes}   | backends | ts          | 2020-10-06 13:25:01 | 2020-11-02 13:35:01 | 00:05:00     | t
(8 rows)

tspdb=# SELECT * FROM pindices_stat();
       index_name       | column_name | number_of_observations | number_of_trained_models |  imputation_score  |   forecast_score   | test_forecast_score 
------------------------+-------------+------------------------+--------------------------+--------------------+--------------------+---------------------
 backends_tsp_k1        | processes   |                   7779 |                        1 | 0.9358105618035297 | 0.8986683760778478 |                    
 backends_tsp_k2        | processes   |                   7779 |                        1 | 0.9543878543235768 | 0.9493582039740182 |                    
 backends_tsp_k3        | processes   |                   7779 |                        1 | 0.9587100784133898 | 0.9573324136174277 |                    
 backends_tsp_k5        | processes   |                   7779 |                        1 | 0.9658566939198344 | 0.9779691559430637 |                    
 backends_tsp_k10       | processes   |                   7779 |                        1 |  0.978661078459557 | 0.9851500580756463 |                    
 backends_tsp_k4        | processes   |                   7779 |                        1 | 0.9624307902466809 | 0.9603959478873095 |                    
 backends_tsp_k20       | processes   |                   7779 |                        1 | 0.9942970665081167 | 0.9934041036452561 |                    
 backends_tsp_k_default | processes   |                   7779 |                        1 | 0.9543878543235768 | 0.9493582039740182 |                    
(8 rows)

The results I get look as follows. Each graph shows the actual data and the predicted data, but note that on some of them the scale changes radically so the actual data may look like a straight line:

k_default k_1 k_3 k_4 k_5 k_10 k_20

Do you have any suggestions as to why this may be happening, and how I might get a reasonable prediction as I do with Tensorflow?

Thanks!

AbdullahO commented 2 years ago

Thanks @dpage for reporting this.

One thing that we have though about when designing tspDB is can we come up with good default parameters to work across many datasets, and I think here the default choice for the "col-to-row-ration" is failing us.

for some context, tspDB uses mSSA (multivariate singular spectrum analysis) as its core algorithm. This algorithm main subroutine is constructing a "stacked Page matrix", performing SVD, and then maintaining the top k singular components of that SVD. There are two parameters that affect the performance:

  1. k (the number of singular components to keep)
  2. col-to-row-ration: which control the size of the page matrix e.g., when this is set to 1, you get a square matrix.

Changing the second parameter to "2" drastically improves the performance. Default is 5, which is good when you have more data (something like ~100k total obs across all time series, e.g., 100 time series each with 1k obs). I have experiments a bit with your data here: https://colab.research.google.com/drive/1eTzVZfSoh_VHVjPQ_xD8jn_xFwEfrR3e?usp=sharing

dpage commented 2 years ago

Thanks for the assistance @AbdullahO.

I've experimented a little myself, and find the best results seem to be with k=5 and row_to_column_ratio=2 or k=3, row_to_column_ratio=2. This is what I got:

k_5_ctr_2

k_3_ctr_2

k=10, row_to_column_ratio=2 also gave results that followed the basic pattern, but at half the original frequency:

k_10_ctr_2

Unfortunately this leaves me with a couple of problems:

1) With k=5, row_to_column_ratio=2, there's clearly an amplitude problem. My aim is to detect unexpected load on the system in terms of number of users (or lack thereof) or other metrics, so it's critical for me that the amplitude is roughly correct.

2) If the tuning parameters are so data-dependent, it becomes much more difficult to build a system that can self tune/manage. I can envisage a system that continually monitors its own performance as new data is received and makes appropriate adjustments, but with the current API (and I'm assuming the way pindices are stored internally), that would require completely rebuilding indices if it's determined that better results would be achieved with a parameter change.

For reference, this is what I'm using as my target benchmark, though I'm not expecting to get this level of accuracy in realtime. This was generated using a WaveNet model in Tensorflow, trained using 75% of the same dataset, with the remaining 25% used for validation. Of course, the reason I'm interested in tspDB is the ability to avoid having to re-train the model whenever new metrics arrive (which could be every few seconds) which is horribly expensive.

wavenet

Thanks!

AbdullahO commented 2 years ago

For the second problem, I have noted from experience that as you get more data, the sensitivity to the parameters decreases.

Wavenet results looks impressive. One question about that, are these results generated using

  1. one-step ahead forecast? i.e., when forecasting the load at time t, do you use actual observations until t-1?
  2. Multi-steps ahead forecast, where only the information till t= training time is used regardless of what time you are predicting.

The latter is much harder, and I was wondering if we're comparing apples to apples here

dpage commented 2 years ago

For the second problem, I have noted from experience that as you get more data, the sensitivity to the parameters decreases.

OK; I don't have any real-world data that spans a longer time period right now, but I can probably tweak my load generator code (https://github.com/EnterpriseDB/pgworkload) to generate raw numbers rather than actual PostgreSQL connections in order to generate a large version of this dataset with the same seasonalities. I'll look into that.

Wavenet results looks impressive. One question about that, are these results generated using

Disclaimer; I'm a relative newbie to machine learning (PostgreSQL is my core area of work), so take my answer below with a largish pinch of salt:

  1. one-step ahead forecast? i.e., when forecasting the load at time t, do you use actual observations until t-1?
  2. Multi-steps ahead forecast, where only the information till t= training time is used regardless of what time you are predicting.

In a nutshell, WaveNet is a deep learning neural network that has multiple hidden 1D convolutional layers, with an increasing dilation rate. Practically speaking, this means that each layer can learn a different seasonality frequency in the data; one might pick up monthly, one weekly, and one daily etc. Data is learned by walking through the values in each layer, examining X values at once (X is the kernel size), and then moving forwards Y values (where Y is the stride). On each iteration, the next value is calculated.

So when we perform a prediction, I think what is actually happening under the hood is a one-step ahead forecast (following the pattern learnt), that's then repeated using the previously predicted values as we move forwards.

The latter is much harder, and I was wondering if we're comparing apples to apples here

Oh, I'm pretty sure we're not! I'm fully expecting WaveNet to be far more accurate than a more algorithmic approach, but the need to train (and retrain) makes it impractical for what I'm trying to do. I have no problem trading off accuracy for the ability to work in realtime, but as you can imagine there is a point at which the prediction becomes inaccurate enough that it's not useful.

dpage commented 2 years ago

Following on from my previous comment, I generated some test data similar to the previous; simulating 1 sample every 5 minutes over a 1 year period resulting in 104832 data points. I used a sine wave to emulate daily activity, with Saturday reduced to 70% and Sunday to 50% of the original value to add weekly seasonality. I then added random noise. It was easiest to just whip together a script, included here if you want to play with it:

import numpy as np
import matplotlib.pyplot as plt

resolution = 7 * 24 * 12 # 5 minutes, over a week

sat_start = 1441
sun_start = 1729

length = np.pi * 2 * 7 # 7 days in a week
week = np.sin(np.arange(0, length, length / resolution))

# Adjust for weekends
with np.nditer(week, flags=['f_index'], op_flags=['readwrite']) as it:
    for d in it:
        if it.index >= sun_start:
            d[...] = d * 0.5
        elif it.index >= sat_start:
            d[...] = d * 0.7

# Stretch this out to a year
year = week

for w in range(1, 52):
    year = np.append(year, week)

# Add some noise
noise = np.random.normal(0, .1, year.shape)
year = year + noise

# Scale to ~100 users
year = year * 50
year = year + 70 # 20 more, to ensure noise doesn't take us negative

# Integers, as these are user counts
year = year.astype(int)

x = 0
for y in year:
    print("{},{}".format(x, y))
    x = x + 1

This gave better results in tspDB, but I still found I needed to tweak the parameters. Unfortunately, it wasn't able to reproduce the seasonality either.

bg_default

bg_k_3_ctr_2

bg_k_5_ctr_2

AbdullahO commented 2 years ago

@dpage this is interesting, I had expected it to work in this case. Can you share with me a notebook where I can reproduce these plots? I want to play around with it a little bit and see if it is something that can be fixed.

dpage commented 2 years ago

Sure: https://colab.research.google.com/drive/1d2ACPJ_un0bXHHlEMPxg31NUR9DOCArw

AbdullahO commented 2 years ago

Thanks @dpage for the interesting example!

It seems that indeed the mSSA algorithm does not perform well for this specific signal. It decides to fit one of the harmonics and ignore the others (e.g., it only repeats the last seen sine wave). While in theory, this should work, as our model support periodic signals, I was not quite quite sure why it does not in practice.

After investigating it further, the problem is that the period (1 week ~ 2000 observations) is too large compared to the signal (52 week), so in order to build a good predictor for mSSA, you would need to have some (ridiculous) amount of data like ~ 4000 weeks. This is due to some design decision that we have made in the algorithm, but is fixable at the expense of the computational performance. Basically, this is due to using what is called the Page matrix vs. the Hankel matrix traditionally used in mSSA.

I'm thinking of incorporating the Hankel mSSA, but I don't think this will happen anytime soon given my current schedule. However, I do have some suggestions as to what to do if you want to use tspDB as it currently stands. What I suspect would work is the following:

  1. Divide your data into multiple time series, each time series represents one day of the week (7 cols, 24*12 rows per week).
  2. Then build the prediction index on all the columns, and see if that would improve the predictions. I suspect that it will.

Of course this is suboptimal as you would need to record your data in this unusual format, where rows are not inserted all at once. But if in your operation it is okay to insert and predict the data once weekly instead of in real-time, this could work. e.g., at every Monday, you forecast the next 2 weeks.

dpage commented 2 years ago

Thanks for the suggestion @AbdullahO. I've been playing with this a little as suggested, and the first thing I notice is that it seems to be orders of magnitude slower with a multi-column index - to the extent that using the previous test data split into columns for each day seems to time out when creating the pindex on Colab. On my laptop (a pretty high end macbook pro) it was a good 30 minutes+, as opposed to a couple of seconds for the single column used previously.

Predictions seem to be similarly much slower, though that is partly because I now need to predict each value individually rather than in a range (that could probably be improved with a pl/python function, but it would be difficult in SQL).

I've shared a colab here if you want to take a look: https://colab.research.google.com/drive/1hX4LR7LgqiS_sIFg_k-AbUZCSSon_h0Z?usp=sharing, though I've yet to get it to successfully create the pindex (the same code works on my laptop).

AbdullahO commented 2 years ago

Hmm, it seems that this slowdown is due to using timestamps. There is an aggregation process that is there to overcome cases when time intervals are not regular. Two problems here:

  1. I should disable this process if time intervals are regular.
  2. I should investigate why such aggregation is this slow, it should not be. When I replace timestamps with indices it is build very fast (couple of seconds as before). I'll mark this down as a bug and investigate it further.

Another potential problem in the current setup is that you have way too many missing values (6/7) which will affect the final predictions. One way to overcome this is to use the rather unusual schema, where the time index only represent time of day

timeindex Sat Sun Mon Tue Wed Thu Fri
1 value value value value value value value
2 value value value value value value value

.... | 12*24 +1 | value |value |value |value |value |value |value |

So no nulls are there. Hopefully this data format is not considered an abomination. I guess it can be set as a view of another schema that is more appropriate.

dpage commented 2 years ago

Ah, yes - changing to a numeric time index does solve the speed issue.

Unfortunately the data format suggested isn't something I think is practical to use (and yes, as a database guy I do think it's an abomination 🤣). My use case is for database monitoring purposes, and often they contain many GBs of data that's already in a single column form, with observations that range from a few seconds to a day or so in frequency. Splitting that into individual columns for each day is a push at best, but is feasible with a relatively simple query. Adding a row-wise coalesce in batches of daily observations is possible with a more complex (and more expensive) query, but reconstructing the original timeline from that data is even more complex and I think pushes this idea well beyond the realms of what's feasible for me.

FWIW, after looking a numerous different options, at the moment Prophet is looking like the best option for me. It's extremely fast to train and to predict, and I get good results with my test datasets. Whether it's fast enough to allow model updates in real time without excessively affecting overall system performance (which is where I think tspDB would excel) remains to be seen.

Ultimately it does seem like the issue for me is that my use case is simply not suited to tspDB, due to the frequency of the observations as you've suggested.

Thank you for your help @AbdullahO.