ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data
http://hypertools.readthedocs.io/en/latest/
MIT License
1.83k stars 160 forks source link

tools.reduce: matrix rank affecting the dimensions of the result #121

Closed TomHaoChang closed 7 years ago

TomHaoChang commented 7 years ago

import hypertools as hyp import numpy as np print hyp.tools.reduce(np.random.normal(0,1,[5,100]),ndims=10).shape

This is the code I tried which was supposed to give me a 5x10 dimension matrix. However, because the rank of the matrix is 5, PCA is unable to generate 10 dimensions from the data. Therefore the resulting matrix I got was a 5x5 matrix.

I talked about this issue with Professor Manning today, and he suggested a fix to this problem: if the number of dimensions to reduce to is greater than the rank of the matrix, then pad the matrix with rows of 0s to increase the rank, do PCA then eliminate the 0 rows.

Could you look into this issue and let me know what you think? Thanks!

jeremymanning commented 7 years ago

To expand on what @TomHaoChang said:

If the ndims < data.shape[0], then the reduced data will have data.shape[0] columns rather than the expected ndims columns.

To maintain the expected shape of the data, we could pad the returned matrix with zeros to that it has the expected shape.

andrewheusser commented 7 years ago

hmm, interesting..so pad the input data to the PCA function, or pad the PCA-reduced data?

jeremymanning commented 7 years ago

In this rare case, we'd pad the output so that it has the correct number of dimensions.

jeremymanning commented 7 years ago

To give some more background for this issue, @TomHaoChang and I are using hyp.tools.reduce for another project, where the data reduction API for hypertools is a really convenient way to apply PCA to either a list of numpy arrays (using the group-level reduction model that we use for plotting multiple arrays) or a single numpy array, using the same syntax.

But for that project, unlike with the typical plotting setup that hypertools was primarily designed for, we frequently encounter the situation where the number of observations is less than the number of desired PCA dimensions (since we need to get to a pre-specified number of dimensions for our math to work out nicely).

This issue doesn't often show up when we're using hypertools for plotting, since the number of observations is nearly always greater than 3. But in this somewhat off-the-beaten-path use case we're not getting the correct number of dimensions from reduce despite specifying ndims.

andrewheusser commented 7 years ago

👍 ill write a check to see if the number of columns returned by the PCA model is less than ndims, and if so, fill with zeros

andrewheusser commented 7 years ago

implemented on 477a548