Closed andrewheusser closed 7 years ago
also, do we want to scale the columns within each list separately, or across all lists together? my intuition would be within each list (for each column) separately, but not sure
I think we should scale within column but across matrices. That way we'll be able to relate different matrices correctly.
note: i don't think this is a "bug" in the traditional sense. the PCAing works correctly, it's just a question of if or how we want to normalize different columns. and whatever decision we make, i think we should allow the user to control how we normalize. e.g. we could have a normalize
flag that can be set to across
(default; z-score within column, across matrices), within
(z-score within column, separately for each matrix), or none
(no normalization)
we probably also want the user to have access to these functions under util
ah, good point - not really a bug.. Follow up question, do we want to support different scaling/normalizing? A user might want to z-score each column, or each row. Or alternatively, they may want to normalize the column/row (make the vector unit length). Do we want to support these options?
yeah-- see above comment. i think we could have a flag with these options:
across
(default): z-score within column, across matriceswithin
: z-score within column, within matrixnone
: no normalizationrow
(your new suggestion): z-score within rowcool. is it ok to z-score binary vectors?
yeah, it'll just re-center each column (e.g. the 1s and 0s won't all have the same values across columns anymore). but if we use a flag, the user can turn this behavior off.
sounds good. how about a 'scale_type' flag to plot and reduce, and then exposing the function as 'hyp.util.scale'
what about a normalize
flag to plot and reduce, exposed via
hyp.util.normalize
?
note: i also added (this morning) a normalize
function in helpers.py
that we should probably rename. the "helpers" normalize function adjusts
the list of matrices to have a minimum value (across all matrices) of -1
and a maximum value of 1. (this was needed to get the animated plots to
work.)
👍
hyp.plot(x) and hyp.util.reduce(x) will now normalize (across) by default. added hyp.util.normalize function (see readme for API details). Also, renamed 'normalize' in helpers.py to 'scale'
this is important if there are mean/variance differences between the columns of features. One possibility is that we could z-score the columns before PCA automatically (with a flag to turn it off), or conversely off with the option to turn it on. Another option would be to print a warning when there are large differences in mean/var between cols.