PCA implementation doesn't appear to normalize features before reducing

ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data

http://hypertools.readthedocs.io/en/latest/

MIT License

1.83k stars 160 forks source link

PCA implementation doesn't appear to normalize features before reducing #50

Closed andrewheusser closed 7 years ago

andrewheusser commented 7 years ago

this is important if there are mean/variance differences between the columns of features. One possibility is that we could z-score the columns before PCA automatically (with a flag to turn it off), or conversely off with the option to turn it on. Another option would be to print a warning when there are large differences in mean/var between cols.

andrewheusser commented 7 years ago

also, do we want to scale the columns within each list separately, or across all lists together? my intuition would be within each list (for each column) separately, but not sure

jeremymanning commented 7 years ago

I think we should scale within column but across matrices. That way we'll be able to relate different matrices correctly.

jeremymanning commented 7 years ago

note: i don't think this is a "bug" in the traditional sense. the PCAing works correctly, it's just a question of if or how we want to normalize different columns. and whatever decision we make, i think we should allow the user to control how we normalize. e.g. we could have a normalize flag that can be set to across (default; z-score within column, across matrices), within (z-score within column, separately for each matrix), or none (no normalization)

jeremymanning commented 7 years ago

we probably also want the user to have access to these functions under util

andrewheusser commented 7 years ago

ah, good point - not really a bug.. Follow up question, do we want to support different scaling/normalizing? A user might want to z-score each column, or each row. Or alternatively, they may want to normalize the column/row (make the vector unit length). Do we want to support these options?

jeremymanning commented 7 years ago

yeah-- see above comment. i think we could have a flag with these options:

across (default): z-score within column, across matrices
within: z-score within column, within matrix
none: no normalization
row (your new suggestion): z-score within row

andrewheusser commented 7 years ago

cool. is it ok to z-score binary vectors?

jeremymanning commented 7 years ago

yeah, it'll just re-center each column (e.g. the 1s and 0s won't all have the same values across columns anymore). but if we use a flag, the user can turn this behavior off.

andrewheusser commented 7 years ago

sounds good. how about a 'scale_type' flag to plot and reduce, and then exposing the function as 'hyp.util.scale'

jeremymanning commented 7 years ago

what about a normalize flag to plot and reduce, exposed via hyp.util.normalize?

note: i also added (this morning) a normalize function in helpers.py that we should probably rename. the "helpers" normalize function adjusts the list of matrices to have a minimum value (across all matrices) of -1 and a maximum value of 1. (this was needed to get the animated plots to work.)

andrewheusser commented 7 years ago

👍

andrewheusser commented 7 years ago

hyp.plot(x) and hyp.util.reduce(x) will now normalize (across) by default. added hyp.util.normalize function (see readme for API details). Also, renamed 'normalize' in helpers.py to 'scale'