This is essentially determined by the way the data is preprocessed. Must know which variables are categorical, and these are then handled differently. One single script. Can handle only continuous and only categorical data as well; subsumes proc1.py and proc2.py.
Pass 1:
[x] Removes continuous non-numeric values and calculates means, mins, maxs, and any other 1-pass stats of remaining variables
[x] Calculates tables of values for discrete values. Rare categories with # levels under use-specified threshold are set to "other" (or "other_other" if "other" is already taken etc). User may specify a maximum number M of categories per variable; categories beyond the M most numerous categories are also set to "other". Note that NA values effectively don't exist; they are treated as a separate category. Extension: user specified NA value(s)
[x] Separate csv file with C/D marking each variable as continuous/discrete. Structured as the first row of the input data set, followed by comma separated Cs/Ds
[x] Returns logfile with number of missing rows per column, and counts of number of missing cols for row
[x] Returns logfile with tables for each categorical variable, both before and after thresholding counts
[x] Returns data set with missing rows replaced (can't replace levels with "other" until a later pass).
[x] Return data set is all continuous variables followed by all categorical variables
Pass 2:
[x] Calculate SD for continuous variables
[x] Output stat tables for continuous
[x] Output stat tables for categorical
Pass 3:
[x] Z-normalize continuous variables
[x] Dummy code categorical vars with Hennig-Liao weights: new variable names based on amalgam of variable name and level name (use thresholded levels from pass 1)
[x] Dummy coded values are NOT z-normalized
[x] Returns final data set
Make sure output logfiles interact naturally with existing kmeans and summary scripts.
How to summarize categorical vars in Rnw doc is a separate issue dependent on this one.
This is essentially determined by the way the data is preprocessed. Must know which variables are categorical, and these are then handled differently. One single script. Can handle only continuous and only categorical data as well; subsumes proc1.py and proc2.py.
Pass 1:
Pass 2:
Pass 3:
Make sure output logfiles interact naturally with existing kmeans and summary scripts.
How to summarize categorical vars in Rnw doc is a separate issue dependent on this one.