Closed andrewcsmith closed 8 years ago
Thanks for reporting this - I can't see why this would be happening immediately. It is likely an issue with the way the data is being fed in (the models are currently fairly strict about the format - something which needs improving).
It could be one of the following:
k
(the number of clusters) to 0
(seems unlikely as you said it works sometimes).input
data has zero rows (you rule this out in the description above).k
and the clusters present in the training data.If none of those are true it seems likely it is a bug (or at least a data-dependent issue I'd like to try and detect at run time). I'll try to figure it out - if you're happy to share the data that may help too!
EDIT: If you're not happy to share the data but don't mind digging into the source code I'd start here. If the sum of dist
is less than 0f64
then this would cause the error you're seeing. Though again, that shouldn't be possible unless all of the centroids end up being on one of the data points (I believe).
No matter the cause - this definitely highlights the need for some better error handling.
This error should only occur within the initialization phase. I'm assuming that you're using KMeans++ (the default scheme). You could try using forgy or random partition instead to see if it fixes the panic.
I'll keep trying to find the bug in the mean time.
Here's the kmeans feature vector that's crashing. (Take off the .txt extension. That's just to trick github into accepting it.)
Looks like there quite a bit of -inf and NaN values in there. Not sure how that happened, but that's obviously my problem to fix. So it seems like it's not a bug, but just bad values...and it seems like a waste to check every f64
value, so perhaps just add this to the list of things that might go wrong.
That does seem to be the cause of the error. As you say there is not much we can do about this at the model level (at least without some significant changes). I think a fairly easy way to improve the errors would be to have the initialization return Result
s - that should give us more detailed error reports.
We can also test for this particular error by checking that the distance sum mentioned above is_normal. If not we can assume something went wrong with the data.
I'll keep this issue open to track how he handle this.
@andrewcsmith after a long delay I finally got round to this. I have a PR open ( #83 ).
There is some other stuff included but it addresses the issues discussed here (partly). If you have the time to check it out that would be awesome. I know it's been a while so don't feel compelled!
I'm not sure quite what could be causing this. I'm attempting to cluster data with 2348 rows and 21 columns. Might be a bug in KMeans, or it might be an issue with my data, but it's just not clear where the error is from this backtrace. It doesn't fail all the time. With a smaller dataset it works just fine.