Open andrewdalpino opened 3 years ago
Hey @andrewdalpino ! I have reserved first two weeks of June to work on this at least part-time. Studying docs/videos meanwhile.
Word
@andrewdalpino I am unassigning myself from this ticket in case someone wants to jump into it. Unfortunately, I wasn't able to make much progress during the summer due to lack of time.
All good @torchello
Sequence Learning is a type of machine learning where information is gleaned from the order in which samples are presented to the learning algorithm. Sequence prediction is a type of inference that predicts the next value in a sequence or the next sequence of values. Time-series analysis is a subset of sequence learning in which the samples are ordered by time. For the purposes of this research, let's assume that the training dataset is a sequence of samples that may be ordered by time, but may also be ordered by space such as a sequence of paragraphs (a.k.a. a book), or some other scheme and that the prediction will just be the next single continuous value given the input sequence. An example of this would be a regressor that is trained on a series of stock prices and is then asked to "look into the future" by predicting what the price will be tomorrow given the data for today.
Autoregressive models such as ARIMA (Autoregressive Integrated Moving Average) and VAR (Vector Autoregression) are well known statistical models of sequence data. They are very similar, but differ in that ARIMA only handles a single feature while VAR can handle a vector of features. Since Rubix ML is primarily designed for tabular datasets with many features, it makes sense to focus on implementing VAR models (VAR, VARMA, VARMAX) in the library, but I recommend understanding ARIMA as a baseline as its more common in the literature and in some ways simpler to understand.
https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
https://en.wikipedia.org/wiki/Vector_autoregression
Since the formulas for VAR are often given in matrix notation, I recommend implementing the algorithm using the Tensor library which we built for these types of problems. In short, Tensor lets you perform both basic and specialized operations on vectors and matrices of numbers. The API is not currently documented, however, the features are very similar to the older Numpy API.
https://github.com/RubixML/Tensor/blob/master/src/Matrix.php
https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
I think the current abstractions (Dataset objects, Estimator and Learner interfaces) should be able to handle sequence learning on tabular datasets. A sequence dataset is only different from a regular dataset in that it cannot be randomized without losing information. For this reason we'll need to take care in how we do cross-validation but that is more of an ancillary concern. Ideally, we'd be able to pass a sequence dataset as a Labeled or Unlabeled dataset object to a regressor that implements the
train()
method. Any dataset passed to a sequence learner will assumed to be in sequence. So basically we need a regressor that implements Learner.https://github.com/RubixML/ML/blob/master/src/Learner.php
Here is an example of a regressor implementation that uses the Tensor library under the hood for reference.
https://github.com/RubixML/ML/blob/master/src/Regressors/Ridge.php
The task is to understand the math behind a VAR model (Ex. VARMA) and come up with a working prototype. We should be able to explain how VAR models work. If successful, the next step is to refactor the prototype into production code.
Having that said, don't get discouraged if you run into a hard time feel free to reach out for help!
Here's a video to get you started!
https://www.youtube.com/watch?v=CCinpWc2nXA