genbattle / dkm

A generic C++11 k-means clustering implementation
MIT License
203 stars 46 forks source link

[Feature Request] Please add dkm::kmeans_lloyd that take std::vector<std::vector<..>> #8

Closed firesurfer closed 5 years ago

firesurfer commented 5 years ago

Hi, thanks for your great library! Would it be possible for you to implement an overloaded version of the dkm::kmeans_lloyd method that accepts a std::vector<std::vector> for the data instead of a std::vector<std::array<>>. I think that would make it much easier to use this implementation for certain code bases that require a dynamic amount of dimension (or at least you don't need to pass that number as a template parameter which makes it more error prone because you need to change it at multiple locations in the code)

genbattle commented 5 years ago

The reason I chose to use a std::vector<std::array<T, N>> is performance. std::vector<std::vector<T>> not only incurs more pointer indirection and heap allocation, it also means you couldn't be sure without explicitly checking at runtime that the dimensionality of all of the data rows is the same. This would incur a significant performance hit.

This would be better as a helper function implemented in dkm_utils.hpp or in your own codebase to convert from one to the other. If the performance hit doesn't pose a significant issue, then you may want to fork your own version of this library and replace all instances of std::vector<std:array<T, N>> with std::vector<std::vector<T>>. Doing so will require changing or duplicating all functions in the dkm.hpp header.

Sorry, I'm not willing to implement the suggested interface in this library, as it will have too great an impact on performance.

firesurfer commented 5 years ago

Thanks for your answer. I did a fork where I replaced all std::arrays with std::vector. See https://github.com/firesurfer/dkm

As you said there is a performance impact (see benchmarks below). Unfortunatly there is no generic way of converting an std::vector to an std::array without knowing the size at compile time.

I think most of the performance hit came from these two lines I had to add in calculate_means probably because allocating the std::vector's costs more.

size_t size = data[0].size();
std::vector<std::vector<T>> means(k, std::vector<T>(size,0));

Benchmarks

std::vector version
# BEGINNING PROFILING #

## Dataset iris.data.csv ##
..............................
DKM: 0.0326307ms
DKM parallel: 4.31197ms
OpenCV: 0.451947ms

## Dataset s1.data.csv ##
..............................
DKM: 4.30267ms
DKM parallel: 1.83744ms
OpenCV: 2.24517ms

## Dataset birch3.data.csv ##
..............................
DKM: 2999.58ms
DKM parallel: 902.924ms
OpenCV: 543.945ms

## Dataset dim128.data.csv ##
....................
DKM: 17.2075ms
DKM parallel: 3.79244ms
OpenCV: ---
std::array version
# BEGINNING PROFILING #

## Dataset iris.data.csv ##
..............................
DKM: 0.021068ms
DKM parallel: 4.36142ms
OpenCV: 0.473351ms

## Dataset s1.data.csv ##
..............................
DKM: 2.47607ms
DKM parallel: 1.524ms
OpenCV: 2.21526ms

## Dataset birch3.data.csv ##
..............................
DKM: 2190.07ms
DKM parallel: 525.933ms
OpenCV: 549.029ms

## Dataset dim128.data.csv ##
....................
DKM: 18.0008ms
DKM parallel: 3.65933ms
OpenCV: ---