genbattle / dkm

A generic C++11 k-means clustering implementation
MIT License
203 stars 46 forks source link

change random generator so it works on 32bit armhf #18

Closed thorstink closed 4 years ago

thorstink commented 4 years ago

fixes https://github.com/genbattle/dkm/issues/17

genbattle commented 4 years ago

Travis CI doesn't seem to be working anymore for this repo, but have you tried running the tests locally, this seems like a change that could affect some of the tests.

thorstink commented 4 years ago

I thought I did.. I’ll take a look.

Thomas

On 15 Jul 2020, at 01:29, Nick Sarten notifications@github.com wrote:

 Travis CI doesn't seem to be working anymore for this repo, but have you tried running the tests locally, this seems like a change that could affect some of the tests.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

thorstink commented 4 years ago

ok so I dug a bit deeper but I guess the hard-coded numbers in the test are from an experiment generated with the old (current) generator. Using different generators also shuffles cluster IDs.

Using std::mt19937 gives good clustering but different results for 2d dataset, but the Initial means picked correctly and K-means calculated correctly via Lloyds method tests fail. Also the Iris dataset tests have slightly varying means (but correct within approximation).

The different generator I proposed in this pull request succeeds the Iris tests, but fails clustering the 2d-dataset..

std::mt19937 (mindmap 0=2, 1=1 and 2=0...):

{ 2, 0, 0, 0, 1, 1, 1, 2, 2, 0, 0, 0, 0, 1, 2, 2, 1, } == 
{ 0, 2, 2, 2, 1, 1, 1, 0, 0, 2, 2, 2, 2, 1, 0, 0, 1, }

vs std::linear_congruential_engine<std::uint_fast32_t, 48271, 0, 2147483647>:

{ 0, 2, 2, 2, 1, 0, 1, 0, 0, 2, 2, 2, 2, 1, 0, 0, 1, } == 
{ 0, 2, 2, 2, 1, 1, 1, 0, 0, 2, 2, 2, 2, 1, 0, 0, 1, }

I guess the tests are a bit brittle. I hoped with this little fix I could contribute fix a tiny issue, but I don't have the time to investigate this further :( Good luck!

genbattle commented 4 years ago

Thanks for looking into it. The tests aren't supposed to be brittle if a consistent random seed and reproducible entropy engine are used. There may have been an existing test failure that crept into due to CI no longer running. At some point I'll convert the CI to GH Actions and fix the tests.

Thanks for your time anyway. Once I get some of my own I'll try and integrate this fix.