Closed ahfoss closed 8 years ago
instead of cluster numbers used as keys, divide clusters by randomly allocating a minor number, e.g. keys are of the form 1.1, 1.2, 1.3, etc for cluster 1; 2.1, 2.2, 2.3, etc for cluster 2.
This will be a fairly deep change to the current structure. It will involve:
First tackle (4) and (5), and debug on real data. Then tackle (2) and (3), and debug on real data. Last tackle (6) and debug on real data.
Currently working on R/km_summary_intermediary.R. Need to extend to max vectors and mean vector calculation. Then, modify kmeans.slurm to save initial stats.tsv as "tmp.tsv" or something, then call km_summary_intermediary.R to generate collapsed stats.tsv as originally defined.
In the reduce step, if too few clusters are specified, the reduce jobs sent out to the nodes are huge. A workaround should be devised in which big clusters are split up and the continuous variables summed separately (tallying counts), and then merged appropriately before dividing total sums by the counts.