We currently use PPCA to infer nans. This works when, for the affected timepoints (or observations), at least some of the features are observed. However, when none of the features at a given timepoint are observed, PPCA can't fill in those missing features.
To deal with the scenario where all features are nans, we could use a Kalman filter (smooth + predict) to fill in the missing data using surrounding data. We could also use the Kalman filter to predict future observations, which would allow hypertools to function as a nice wrapper for a multi-dimensional Kalman filter.
The setup would be something like:
Add a hyp.predict function that fits a Kalman filter to the given data
This would fit a Kalman filter to each numpy array/dataframe and store the results in a list of length len(data).
With no arguments (other than passing in the data), hyp.predict should fill in missing observations. (If the user passes a time-indexed DataFrame, the indices should be considered.)
With an optional t keyword argument, the function should forward predict t timesteps beyond the last observation (or up to time=t)
If t is an integer, treat it as "predict t timesteps into the future`. The timestep duration is equal to the minimum non-zero difference between any pair of observations.
If t is a DateTime object and the data are time-indexed, determine the number of timesteps needed to predict up to the specified date (again automatically inferring the timestep duration). Tricky thing: if the specified time is in the past, the data should be truncated (without using a Kalman filter)
What gets returned is a new dataset (in the same format as the original data), but with missing data filled in, and truncation and/or forward prediction applied as needed.
Another thought: we shouldn't apply Kalman filters by default-- if the user passes in non-timeseries data, it wouldn't make sense to use this approach. But we could provide access to predict via a keyword argument (to plot, reduce, and align).
We currently use PPCA to infer nans. This works when, for the affected timepoints (or observations), at least some of the features are observed. However, when none of the features at a given timepoint are observed, PPCA can't fill in those missing features.
To deal with the scenario where all features are nans, we could use a Kalman filter (smooth + predict) to fill in the missing data using surrounding data. We could also use the Kalman filter to predict future observations, which would allow hypertools to function as a nice wrapper for a multi-dimensional Kalman filter.
The setup would be something like:
hyp.predict
function that fits a Kalman filter to the given datalen(data)
.hyp.predict
should fill in missing observations. (If the user passes a time-indexed DataFrame, the indices should be considered.)t
keyword argument, the function should forward predictt
timesteps beyond the last observation (or up totime=t
)t
is an integer, treat it as "predictt
timesteps into the future`. The timestep duration is equal to the minimum non-zero difference between any pair of observations.t
is a DateTime object and the data are time-indexed, determine the number of timesteps needed to predict up to the specified date (again automatically inferring the timestep duration). Tricky thing: if the specified time is in the past, the data should be truncated (without using a Kalman filter)Another thought: we shouldn't apply Kalman filters by default-- if the user passes in non-timeseries data, it wouldn't make sense to use this approach. But we could provide access to
predict
via a keyword argument (toplot
,reduce
, andalign
).This implementation looks nice: https://pykalman.github.io/