dealing with nans - Githubissues

ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data

http://hypertools.readthedocs.io/en/latest/

MIT License

1.83k stars 160 forks source link

dealing with nans #33

Closed jeremymanning closed 7 years ago

jeremymanning commented 7 years ago

For the PPCA demo, I recommend generating two datasets:

1.) First generate a well-structured covariance matrix:

from scipy.linalg import toeplitz import numpy as np K = 10 - toeplitz(np.arange(10))

2.) Now generate a first dataset (a random walk with the given covariance matrix)

data1 = np.cumsum(np.random.multivariate_normal(np.zeros(10), K, 250), axis=0)

3.) Now copy the first dataset

from copy import copy data2 = copy(data1)

4.) Set random entries of data2 to nan (choose some level of sparsity for this, e.g. 10% of the entries)

5.) Now plot data1 (solid line) and data2 (dashed line) and make sure they line up with each other

jeremymanning commented 7 years ago

(This will help us determine if PPCA is correctly interpolating)

andrewheusser commented 7 years ago

The reconstruction looks good, but the values are still nans..is this correct? I can't tell whether or not they should be from this implementation.

The plotting works fine with the nans

andrewheusser commented 7 years ago

I implemented a simple interpolation function:

def nan_helper(y):
    """Helper to handle indices and logical indices of NaNs.
    """
    return np.isnan(y), lambda z: z.nonzero()[0]

def interp_col_nans(data):
    data_interp = np.zeros(data.shape)
    for col in range(data.shape[1]):
        y = data[:,col]
        nans, x= nan_helper(y)
        y[nans]= np.interp(x(nans), x(~nans), y[~nans])
        data_interp[:,col] = y
    return data_interp

here is how it performs (red is orig, blue is interp):

jeremymanning commented 7 years ago

It looks like there's still some interpolation going in in reduce.py...

desired behavior:

1.) if no nans, use PCA to reduce to the specified number of dimensions 2.) if nans, use PPCA (instead of PCA) to reduce to the specified number of dimensions. some observations may still be nans after using PPCA. those should show up as breaks in the line (i.e. don't explicitly remove them from the plot, but they just won't be visible). not removing nans is important because the user may want the rows to match up across matrices, and we don't want to mess with that.

in the matlab version the nans are removed before doing PCA, and then they are added back in prior to plotting. what i'm proposing for the python version is to be a little fancier by using PPCA when possible to reconstruct missing data. since we're already making an assumption that the data covariance matters in applying PCA to the data, we can leverage the same assumption to fill in parts of missing observations. but for skipped observations (i.e. where no feature is observed for that row of the data matrix) we shouldn't add in any additional assumptions about the timecourse (we can't even assume that the user is giving us a timecourse).

in other words, we want the reduced data to have the same number of rows as the original data.

andrewheusser commented 7 years ago

[x] if no nans, use PCA to reduce to the specified number of dimensions
[x] if nans, use PPCA
[x] if the whole row is nans, don't plot the data (for that row)

hyp.plot([data1_r, data2_r, missing_data], ['r','b--','b*'])

andrewheusser commented 7 years ago

@jeremymanning i think this is all set now..if you agree, want to close the issue?

jeremymanning commented 7 years ago

yep, this seems complete.