Closed jeremymanning closed 7 years ago
(This will help us determine if PPCA is correctly interpolating)
The reconstruction looks good, but the values are still nans..is this correct? I can't tell whether or not they should be from this implementation.
The plotting works fine with the nans
I implemented a simple interpolation function:
def nan_helper(y):
"""Helper to handle indices and logical indices of NaNs.
"""
return np.isnan(y), lambda z: z.nonzero()[0]
def interp_col_nans(data):
data_interp = np.zeros(data.shape)
for col in range(data.shape[1]):
y = data[:,col]
nans, x= nan_helper(y)
y[nans]= np.interp(x(nans), x(~nans), y[~nans])
data_interp[:,col] = y
return data_interp
here is how it performs (red is orig, blue is interp):
It looks like there's still some interpolation going in in reduce.py
...
desired behavior:
1.) if no nans, use PCA to reduce to the specified number of dimensions 2.) if nans, use PPCA (instead of PCA) to reduce to the specified number of dimensions. some observations may still be nans after using PPCA. those should show up as breaks in the line (i.e. don't explicitly remove them from the plot, but they just won't be visible). not removing nans is important because the user may want the rows to match up across matrices, and we don't want to mess with that.
in the matlab version the nans are removed before doing PCA, and then they are added back in prior to plotting. what i'm proposing for the python version is to be a little fancier by using PPCA when possible to reconstruct missing data. since we're already making an assumption that the data covariance matters in applying PCA to the data, we can leverage the same assumption to fill in parts of missing observations. but for skipped observations (i.e. where no feature is observed for that row of the data matrix) we shouldn't add in any additional assumptions about the timecourse (we can't even assume that the user is giving us a timecourse).
in other words, we want the reduced data to have the same number of rows as the original data.
hyp.plot([data1_r, data2_r, missing_data], ['r','b--','b*'])
@jeremymanning i think this is all set now..if you agree, want to close the issue?
yep, this seems complete.
For the PPCA demo, I recommend generating two datasets:
1.) First generate a well-structured covariance matrix:
from scipy.linalg import toeplitz
import numpy as np
K = 10 - toeplitz(np.arange(10))
2.) Now generate a first dataset (a random walk with the given covariance matrix)
data1 = np.cumsum(np.random.multivariate_normal(np.zeros(10), K, 250), axis=0)
3.) Now copy the first dataset
from copy import copy
data2 = copy(data1)
4.) Set random entries of data2 to
nan
(choose some level of sparsity for this, e.g. 10% of the entries)5.) Now plot data1 (solid line) and data2 (dashed line) and make sure they line up with each other