ahfoss / kamilaStreamingHadoop

k-means and KAMILA algorithms written for MyHadoop on a SLURM batch scheduler
GNU General Public License v3.0
0 stars 0 forks source link

Change initialization/reseeding strategy from uniform to sampling data points #8

Closed ahfoss closed 8 years ago

ahfoss commented 8 years ago

Rather than sample from the entire distributed data set at once, create a subsampled data set using the python script (check if file exists before running, and don't re-subsample if necessary). Sample/resample points from this subsampled file.