Readme / docs unclear about using ppscore on time series data

lastmeta commented 2 years ago

I would just love to use this on timeseries data and out of the box it seems to do pretty well, but I don't know if I'm interpreting the score right, however, I read over the readme at this link: https://github.com/8080labs/ppscore#calculation-of-the-pps

''' The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via cross_validation). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets '''

I don't understand, should I set sample=None for time series or should I modify the cross_validation kwarg for timeseries data?

fwetdb commented 2 years ago

I would love to help out here but I am not sure how I can do that. The answer heavily depends on your exact data and what you try to achieve with the ppscore. Maybe you can provide more detail?

The short answer is: Don't use ppscore on timeseries data at all - unless you have deep technical understanding of how ppscore works and how to interpret the results.

The long answer is:

The crossvalidation used within ppscore might "leak information" between folds that should not be leaked because of the timeseries structure e.g. in 1990 you did not know yet about the crashes of the financial industry in the 2000s
If this is a problem in your dataset, then ppscore might not be applicable at all - or at least lead to wrong/confusing statements. Adjusting sample size or the number of cross_validation folds won't help with this.
The solution might have to do with passing a new suitable CV object as proposed in https://github.com/8080labs/ppscore/pull/10 however, this is currently not a priority for us and if i remember correctly, the proposed solution would require too much of our attention to make sure it was done in a correct way. Maybe work on this will continue in the future

lastmeta commented 2 years ago

I'm not in a typical data modeler situation here. I'm not using ppscore to build a model, I'm building a system that builds models. This means I can't answer your question about what kind of data I have, because I don't know ahead of time.

I have hundreds of thousands of datastreams of literally all kinds of data. What the data streams have in common is that every observation comes with a timestamp, that's it. the observations can be float, integer, binary, string, you name it.

I am looking for a metric (like ppscore) that I can use as a preliminary filter for the models built to predict these datastreams. I want something better than correlation, indeed I want something that shows non-linear correlations. Non-spurious, non-linear correlations are hard to detect between two datasets, unless perhaps the intermittent correlation is cyclical. Therefore, I believe the ideal solution is to take multiple datastreams into account at a time, but I don't know how to even approach that naively let alone efficiently.

So if the system wants to build a model targeting one datastream, it needs to find variables to use as input to the model. I have been scoring random datastreams against the target variable using ppscore and then making automated models with the streams that score in the top 10%. Those models then compete till I have just one.

So given that my use case is not intricately intertwined with the ppscore, that I'm using it as a preliminary filter, would you say its a good enough fit. from my vantage point its my only option aside from calculating some linear correlation score. I'm hoping the ppscore will, in some cases notice some predictive power between two datasets that a linear correlation would miss.

fwetdb commented 2 years ago

Thanks for sharing your input! Since this is not a typical situation, I would have to think more thoroughly about this. However, I am currently busy with other work. I might revisit this within the next weeks but I don't want to guarantee anything at this point.

8080labs / ppscore

Readme / docs unclear about using ppscore on time series data #58