ahfoss / kamilaStreamingHadoop

k-means and KAMILA algorithms written for MyHadoop on a SLURM batch scheduler
GNU General Public License v3.0
0 stars 0 forks source link

Implement Hennig-Liao coding in preprocessing step #22

Closed ahfoss closed 8 years ago

ahfoss commented 8 years ago

This is essentially determined by the way the data is preprocessed. Must know which variables are categorical, and these are then handled differently. One single script. Can handle only continuous and only categorical data as well; subsumes proc1.py and proc2.py.

Pass 1:

Pass 2:

Pass 3:

Make sure output logfiles interact naturally with existing kmeans and summary scripts.

How to summarize categorical vars in Rnw doc is a separate issue dependent on this one.