MLBazaar / MLPrimitives

Primitives for machine learning and data science.
https://mlbazaar.github.io/MLPrimitives
MIT License
69 stars 38 forks source link

Add new primitive: Butterworth filter #167

Open AlexanderGeiger opened 5 years ago

AlexanderGeiger commented 5 years ago

Description

A low pass filter (in this example a butterworth filter) for the preprocessing of time series data.

The outcome of such a filter should be similar to the moving aggregations, but the number of samples will not be decreased and therefore might improve the performance of the pipeline.

It takes an array of the data, that should be filtered, as input and returns another filtered array.

What I Did

I started implementing this primitive for testing purposes in the butterworth branch on my fork, which you can check out.

Concretely, I added a Primitive JSON file and a custom function in timeseries_preprocessing.py.

Any feedback on the primitive itself and the implementation would be highly appreciated.

csala commented 5 years ago

Thanks for the proposal @AlexanderGeiger

Some thoughts and comments:

  1. Perhaps it would be better to split this in two parts and make these First Type (JSON only) function primitives. We could have:
    • scipy.signal.butter.ba.json, which points at scipy.signal.butter, has N and Wn and btypeas tunable Hyperparameters and output=ba and analog=False as fixed hyperparameter and which inputs nothing but returns b and a, which will be set as context variables.
    • scipy.signal.filtfilt.json, which points at scipy.signal.filtfilt, has axis as fixed hyperparameter and padtype and padlen as tunable hyperparameters and which inputs a, b and X and returns X.

Optionally, we would add scipy.signal.butter.zpk in the future if needed.

Doing this, no python code is needed and both primitives can be freely combined with other options.

  1. If we do not make them JSON Only and we build a custom python function, consider:
    • not sorting the timeseries and not even requiring a time index: a single sequence without time index can be also processed. Assume that it has been already sorted before. This would, enable, for example, using this primitive right after a downsampling made by timeseries_preprocessing.time_series_aggregation, which outputs X and the time index as two different variables.
    • supporting both numpy array and pandas DataFrame. This is not mandatory, but if possible, it's better if primitives support both types of inputs.
    • making the output match the input format: if you are given a DataFrame with a time index column, return a DataFrame with a time index column. If you are given a 1d numpy array, return a 1d numpy array.
AlexanderGeiger commented 5 years ago

Thanks for the feedback @csala I like the first approach and will try to implement the primitive this way.