ahfoss / kamilaStreamingHadoop

k-means and KAMILA algorithms written for MyHadoop on a SLURM batch scheduler
GNU General Public License v3.0
0 stars 0 forks source link

Premature conversion of counts to means in reducer step #1

Closed ahfoss closed 8 years ago

ahfoss commented 9 years ago

In the reduce step, if too few clusters are specified, the reduce jobs sent out to the nodes are huge. A workaround should be devised in which big clusters are split up and the continuous variables summed separately (tallying counts), and then merged appropriately before dividing total sums by the counts.

ahfoss commented 8 years ago

instead of cluster numbers used as keys, divide clusters by randomly allocating a minor number, e.g. keys are of the form 1.1, 1.2, 1.3, etc for cluster 1; 2.1, 2.2, 2.3, etc for cluster 2.

ahfoss commented 8 years ago

This will be a fairly deep change to the current structure. It will involve:

  1. Initializing means as normal
  2. Calculating the number of observations to assign each node (# obs / # nodes)
  3. Creating keys based on cluster and the count in (2.), using a counter in the mapping step
  4. The reducing step won't calculate means, but sums and counts
  5. The intermediary step will aggregate the different keys within the same cluster.
  6. Summary map-reduce step will need to be changed accordingly (different issue).
ahfoss commented 8 years ago

First tackle (4) and (5), and debug on real data. Then tackle (2) and (3), and debug on real data. Last tackle (6) and debug on real data.

ahfoss commented 8 years ago

Currently working on R/km_summary_intermediary.R. Need to extend to max vectors and mean vector calculation. Then, modify kmeans.slurm to save initial stats.tsv as "tmp.tsv" or something, then call km_summary_intermediary.R to generate collapsed stats.tsv as originally defined.