[design discussion] Handling non-iid data pt1: time series

As was discussed on Slack, there may be design decisions to take so that MLJ can support non-iid tabular data (time series or other). To get the ball started, here are some thoughts on time series and what would need to be done to support such data effectively in MLJ. Please add comments in line with this (let's discuss other possible use cases different than time series in another issue)

Time series

We'd need:

interface with specific models that are adapted to TS (say Arima or whatever)
offer adapted tuning/resampling strategies (e.g. Holdout could be done differently to take into account notion of chronology)

fit-predict-evaluate

on temporal data, the notion of evaluation on a test set is less meaningful (doesn't offer meaningful guarantees) but may still be a way to get an idea for how a model performs, so a workflow that could be expected is something like

slice time in first 80% (train) - last 20% (test)
re-slice train again say first 90%, last 10%
train a bunch of models on the 90%, evaluate on last 10%, pick best or aggregate,
report how things work on the held-out set

As far as I'm aware this requires little work to get working (assuming a static dataset); just have a "temporal-holdout" which respects ordering

There is the question though that the predict would be semantically different (no input data per se), maybe we could introduce a forecast instead

classification / transformation

A separate task would be to identify similarity between time series; e.g. to cluster time series or classify them as a whole; this does not require anything specific as far as I know other than appropriate packages that would allow the representation of a TS in a numerical space (e.g. could be RNN-based)

other tasks

There are probably other tasks than forecasting / clustering / classification with time series, one that I can think of is to train something able to detect change points, probably an unsupervised task that would learn from training data how to pick changepoints with a sensitivity // penalty over how many change points it finds; then that could be used on new data.

Things to do // comments

Add a temporal holdout
Consider existing packages in the julia ecosystem that do some temporal stuff and try interfacing with simple things (e.g. TimeSeries.jl)
Consider how the predict would happen (no input data per se, rather just a set of future times)

Comments

I'm not sure we'd need a specific scientific type; or maybe just one for DateTime; but then assuming there's something like ARIMA.jl, a user would just feed data to fit adapted to ARIMA and ARIMA would internally consider data as temporally ordered.
There should be a choice as to how the column representing time is passed; one way would be to have a MLJBase function that does this (like, say, MLJBase.time_matrix) and tries to detect a column that has a datetype out of the feature matrix and use it as a guide ;
- we should take inspiration from the TimeArray type and possibly generalise it to something like a TimeTable type // see also integration TimeArray <> Tables.jl

Thanks for getting the ball rolling. I believe that these efforts on time series, spatial, etc. deserve separate packages and shouldn't be implemented as if they were inside the MLJ.jl umbrella. What was discussed on slack was a common set of Base packages to interface general operations like resampling, etc that other projects developed by people who actually research time series can build upon. Putting time series inside MLJ.jl is not optimal.

On Wed, Oct 30, 2019, 08:48 Thibaut Lienart notifications@github.com wrote:

As was discussed on Slack, there may be design decisions to take so that MLJ can support non-iid tabular data (time series or other). To get the ball started, here are some thoughts on time series and what would need to be done to support such data effectively in MLJ. Time series

We'd need:

interface with specific models that are adapted to TS (say Arima or whatever)

offer adapted tuning/resampling strategies (e.g. Holdout could be done differently to take into account notion of chronology)

fit-predict-evaluate

on temporal data, the notion of evaluation on a test set is less meaningful (doesn't offer meaningful guarantees) but may still be a way to get an idea for how a model performs, so a workflow that could be expected is something like

slice time in first 80% (train) - last 20% (test)

re-slice train again say first 90%, last 10%

train a bunch of models on the 90%, evaluate on last 10%, pick best or aggregate,

report how things work on the held-out set

As far as I'm aware this requires little work to get working (assuming a static dataset); just have a "temporal-holdout" which respects ordering classification / transformation

A separate task would be to identify similarity between time series; e.g. to cluster time series or classify them as a whole; this does not require anything specific as far as I know other than appropriate packages that would allow the representation of a TS in a numerical space (e.g. could be RNN-based) other tasks

There are probably other tasks than forecasting / clustering / classification with time series, one that I can think of is to train something able to detect change points, probably an unsupervised task that would learn from training data how to pick changepoints with a sensitivity // penalty over how many change points it finds; then that could be used on new data. Things to do // comments

Add a temporal holdout

Consider existing packages in the julia ecosystem that do some temporal stuff and try

Comments

I'm not sure we'd need a specific scientific type; or maybe just one for DateTime; but then assuming there's something like ARIMA.jl, a user would just feed data to fit adapted to ARIMA and ARIMA would internally consider data as temporally ordered.

There should be a choice as to how the column representing time is passed; one way would be to have a MLJBase function that does this (like, say, MLJBase.time_matrix) and tries to detect a column that has a datetype out of the feature matrix and use it as a guide

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/303?email_source=notifications&email_token=AAZQW3O6WMJLBUN7HYU5PPTQRFYB7A5CNFSM4JGX55HKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HVMHZ3Q, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3IAQIAILENRJEVPOQ3QRFYB7ANCNFSM4JGX55HA .

I don't really understand why you're saying that and it seems non-constructive.

There can be small modifications in the MLJ environment to make it possible to deal with Time Series (as highlighted above); there may be limitations as to how far this can go, fine (I was hoping for people mentioning this here). People who don't see this working for them may want to develop their own packages independently of MLJ (or use other existing packages like OnlineStats), that's fine, I don't see why that precludes us from trying to make it easier to handle TS data if we can.

I think it should be clear that we're not "putting things inside MLJ" rather in light of MLJ being just a way to do interface with things, people may choose to use it or not... So let's please focus on what's currently not there and could easily be added and if in some future there is scope for more modularity and package separation fine; for now we're already spending a lot of energy trying to manage the multiple repos we have and we won't just start new repos unless we have a clear idea of the advantages.

To be honest it looks like you don't like how we're doing things atm, I understand this and we welcome criticism, however please understand that while you have made specific suggestions in the past which I believe have been addressed, it's not super useful to us to just get feedback like "you're doing this wrong".

So I'd suggest you open a separate issue where you discuss a full design plan which would improve over the current status quo and addresses past comments that were made to your past suggestions or work with us to try to make modifications like the ones suggested here.

In previous issues I discussed how exporting the name @load from MLJBase would be beneficial to me. For some reason this trivial change was rejected without clear reasons. That is why I lost interest in spending too much time writing long comments here. In the task design discussion I spent a great amount of text and received good feedback from users (see the likes, hearts in the comments). However the MLJ devs decided to leave this discussion aside and continue with a limiting workflow that has in it a lot of assumptions that don't serve for my research. Machines, tabular data, etc.

I will try to be more constructive next time but I confess that I'm not feeling that my feedback is being incorporated anyhow. The issues are still open without any action to remediate the design problems I raised.

On Wed, Oct 30, 2019, 09:11 Thibaut Lienart notifications@github.com wrote:

I don't really understand why you're saying that and it seems non-constructive.

There can be small modifications in the MLJ environment to make it possible to deal with Time Series (as highlighted above); there may be limitations as to how far this can go, fine (I was hoping for people mentioning this here). People who don't see this working for them may want to develop their own packages independently of MLJ (or use other existing packages like OnlineStats), that's fine, I don't see why that precludes us from trying to make it easier to handle TS data if we can.

I think it should be clear that we're not "putting things inside MLJ" rather in light of MLJ being just a way to do interface with things, people may choose to use it or not... So let's please focus on what's currently not there and could easily be added and if in some future there is scope for more modularity and package separation fine; for now we're already spending a lot of energy trying to manage the multiple repos we have and we won't just start new repos unless we have a clear idea of the advantages.

To be honest it looks like you don't like how we're doing things atm, I understand this and we welcome criticism, however please understand that while you have made specific suggestions in the past which I believe have been addressed, it's not super useful to us to just get feedback like "you're doing this wrong".

So I'd suggest you open a separate issue where you discuss a full design plan which would improve over the current status quo and addresses past comments that were made to your past suggestions or work with us to try to make modifications like the ones suggested here.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/303?email_source=notifications&email_token=AAZQW3MY6TSBQCVYIJRYRX3QRF2WJA5CNFSM4JGX55HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECT5WMI#issuecomment-547871537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3IVZVJH5KYJTENT5K3QRF2WJANCNFSM4JGX55HA .

In case you're not aware of it, we've been working on a scikit-learn compatible Python package for machine learning with time series data, take a look at our repo here, with @fkiraly as one of the core developers. We also have a short paper out in which we describe different data formats and learning tasks that arise in a temporal/sequential data context. Hope this helps!

If you decide to implement time series functionality, we'd be more than happy to collaborate further, perhaps in form of another development sprint or so.

Thanks @mloning, I follow sktime and did intend to take a look, the pointer to the paper is very useful.

@juliohm I think we feel a similar frustration on our side, I apologise for this as we do care about feedback and integrating comments; with respect to @load it was addressed clearly, effectively MLJBase is to be seen primarily as a door to MLJ and moving @load is not conducive to this (please consider that there is a registry in the mix and it's not trivial to decouple the two); I understand that you'd like this to not be the case (i.e. MLJBase effectively be a modern and maintained MLBase); it may be that one day we actually do this but at the moment this seems to us to be a distraction from what we're trying to do well (i.e. serve "standard" ML use cases). A criticism could be that we need to get the design right early on to avoid things to bite us later; which is the reason for such thing as the discussion; deciding which part of the code goes where is not really what we'd like to focus on now even though it may perfectly be that in the medium term we end up with something that resembles what you had in mind all along. In short we want to consolidate MLJ first (considering MLJ+MLJBase more or less as a unit) and when users can actually do standard things and compose as we said was the main goal of MLJ then we can potentially consider moving mature and fixed things to more abstract packages. At least that's my opinion.

Thank you @tlienart , I disagree with this approach as it goes against the usual design of package ecosystems in Julia. Having strong Base packages is much more important than the actual umbrella that puts functionality together. We saw this in many successful projects including the DifferentialEquations.jl umbrella, the Makie.jl umbrella, and other umbrellas that just load sub packages and reexport.

If the plan is to write code in the MLJ.jl umbrella, that is unfortunate. It will certainly limit our collaboration opportunities.

FYI we've started working on MLJTime.jl - a time series extension package for MLJ, together with @sjvollmer and @aa25desh

JuliaAI / MLJ.jl