SuperDuperDB / superduperdb

🔮 SuperDuperDB: Bring AI to your database! Build, deploy and manage any AI application directly with your existing data infrastructure, without moving your data. Including streaming inference, scalable model training and vector search.
https://superduperdb.com
Apache License 2.0
4.54k stars 444 forks source link

[FEATURE REQUEST]: #1769

Closed makkarss929 closed 1 month ago

makkarss929 commented 4 months ago

Contact Details

makkarss929@gmail.com

Feature Description

darts framework (time series forecasting framework), we can implement this with our superduperDB there is no framework related to time series forecasting

Why Darts?

  1. The forecasting models can all be used in the same way, using fit() and predict() functions, similar to scikit-learn.
  2. Less learning curve
  3. The library also makes it easy to backtest models, combine the predictions of several models, and take external data into account.
  4. Darts supports both univariate and multivariate time series and models.
  5. Darts also offers extensive anomaly detection capabilities. For instance, it is trivial to apply PyOD models on time series to obtain anomaly scores, or to wrap any of Darts forecasting or filtering models to obtain fully-fledged anomaly detection models.

Use Case Description

It will help a lot of companies, who are consuming time series data and making applications on that.

Organization

weather forecasting, e-commerce, social media, supply-demand. It's used by small to big companies.

Who are the stake-holders?

No response

anitaokoh commented 4 months ago

Hey @makkarss929 ,

Thank you for reaching out.

Yes, Darts does look promising based on our quick scan

An integration idea makes sense as well.

However, at the moment, we do not support time dimensions in our data-layer

Theoretically, a function to "initialize" a table or a collection as a time-series object must be defined. Also, some customizations on the .predict API call need to be done.

How do you plan on integrating with our framework?

makkarss929 commented 4 months ago

It’s simple. We can add a time dimension to our data layer.

In darts, We need 2 things time and value in a table or collection to specify a series. darts will do parsing and sorting according to time for us.

There will be different parameters for different models.

  1. Generally, deep learning models need input_chunk_length, and output_chunk_length while initializing models
  2. .fit() needs epochs and series and covariates
  3. .predict() needs n (number of forecasts) and covariates.

NOTE: covariates are those who are not part of the forecasting, but help in the forecasting, like day, week, month, temperature, and sales, It can be anything that helps to forecast.


See the below example of LSTM 




my_model = RNNModel(
    model="LSTM",
    hidden_dim=20,
    dropout=0,
    batch_size=16,
    n_epochs=300,
    optimizer_kwargs={"lr": 1e-3},
    model_name="Air_RNN",
    log_tensorboard=True,
    random_state=42,
    training_length=20,
    input_chunk_length=14,
    force_reset=True,
    save_checkpoints=True,
)

my_model.fit(
    train_transformed,
    future_covariates=covariates,
    val_series=val_transformed,
    val_future_covariates=covariates,
    verbose=True,
)

pred_series = my_model.predict(n=26, future_covariates=covariates)

There are lots of models and Darts has very neat and clear documentation for all of them like SKlearn.

  1. we can start with simple models with fewer parameters.
  2. Later we can add on Deep learning models.

What are your thoughts? @anitaokoh

blythed commented 4 months ago

Hi @makkarss929 it's a great idea to potentially add a time-dimension to the Datalayer, but how would you do this concretely?

Currently, when we do predictions, we use single data points. So the documents look like this:

{
    "input_data": [0, 1, 3, 6],
    "_outputs": {"input_data": {"my_model": {"0": <output-of-model>}}}
}

However with time-series, I would think you have multiple inputs relevant to a prediction. How would you handle that?

makkarss929 commented 4 months ago

Hi @blythed we can something like this see the example below

{
    "input_data": [
        {"time": "2024-01-01", "values": [0, 1, 3, 6]},
        {"time": "2024-01-02", "values": [1, 2, 4, 7]},
        {"time": "2024-01-03", "values": [2, 3, 5, 8]}
    ],
    "_outputs": {
        "my_model": {
            "2024-0-01": <output-of-model-at-2024-01-01>,
            "2024-01-02": <output-of-model-at-2024-01-02>,
            "2024-01-03": <output-of-model-at-2024-01-03>
        }
    }
}
blythed commented 4 months ago

Ok @makkarss929 that's fine, but it doesn't really reflect the real world scenario that new time series data is probably inserted into new records.

makkarss929 commented 4 months ago

We can do something, like this, user will provide input_chunk_length and output chunk length we need to modify according to that

{
    "data": [
        {
            "time": "2024-01-01",
            "input_data": [0, 1, 3, 6],
            "_outputs": {
                "my_model":   [4, 5] # <output-of-model-at-2024-01-01>
            }
        },
        {
            "time": "2024-01-02",
            "input_data": [1, 2, 4, 7],
            "_outputs": {
                "my_model":  [10, 17]  # <output-of-model-at-2024-01-02>
            }
        },
        {
            "time": "2024-01-03",
            "input_data": [2, 3, 5, 8],
            "_outputs": {
                "my_model": [1, 5] # <output-of-model-at-2024-01-03>
            }
        },
    ]
}
blythed commented 4 months ago

That won't solve the problem. Imagine you have new data coming in? What do you do with it? Do you add it to an existing document, or put in a new document? What would happen if you keep putting in 1 document?

makkarss929 commented 4 months ago

HI @blythed

splitting the data into separate documents based on a time range criterion