AtheMathmo / rusty-machine

Machine Learning library for Rust
https://crates.io/crates/rusty-machine/
MIT License
1.25k stars 152 forks source link

Panic on KMeansClassifier::train #60

Closed andrewcsmith closed 8 years ago

andrewcsmith commented 8 years ago

I'm not sure quite what could be causing this. I'm attempting to cluster data with 2348 rows and 21 columns. Might be a bug in KMeans, or it might be an issue with my data, but it's just not clear where the error is from this backtrace. It doesn't fail all the time. With a smaller dataset it works just fine.

thread '<main>' panicked at 'Rng.gen_range called with low >= high', /Users/acsmith/.multirust/toolchains/nightly/cargo/registry/src/github.com-88ac128001ac3a9a/rand-0.3.14/src/lib.rs:466
stack backtrace:
   1:        0x10cdffa58 - std::sys::backtrace::tracing::imp::write::h9fb600083204ae7f
   2:        0x10ce02b25 - std::panicking::default_hook::_$u7b$$u7b$closure$u7d$$u7d$::hca543c34f11229ac
   3:        0x10ce0275e - std::panicking::default_hook::hc2c969e7453d080c
   4:        0x10cdf48b6 - std::sys_common::unwind::begin_unwind_inner::h30e12d15ce2b2e25
   5:        0x10cdd64f4 - std::sys_common::unwind::begin_unwind::h65390c63b501d6ed
   6:        0x10cdd7ea0 - _<learning..k_means..KMeansClassifier as learning..UnSupModel<linalg..matrix..Matrix<f64>, linalg..vector..Vector<usize>>>::train::hda9f4aa6ef437a10
   7:        0x10cd9ec4a - phoneme_morpher::run::hfaae9e912c91a493
   8:        0x10cd9c998 - phoneme_morpher::main::h44c66430a6737b02
   9:        0x10ce02352 - std::sys_common::unwind::try::try_fn::h04c0c89e4add6cc4
  10:        0x10cdfee3b - __rust_try
  11:        0x10ce02194 - std::rt::lang_start::h61f4934e780b4dfc
e
AtheMathmo commented 8 years ago

Thanks for reporting this - I can't see why this would be happening immediately. It is likely an issue with the way the data is being fed in (the models are currently fairly strict about the format - something which needs improving).

It could be one of the following:

If none of those are true it seems likely it is a bug (or at least a data-dependent issue I'd like to try and detect at run time). I'll try to figure it out - if you're happy to share the data that may help too!

EDIT: If you're not happy to share the data but don't mind digging into the source code I'd start here. If the sum of dist is less than 0f64 then this would cause the error you're seeing. Though again, that shouldn't be possible unless all of the centroids end up being on one of the data points (I believe).

No matter the cause - this definitely highlights the need for some better error handling.

AtheMathmo commented 8 years ago

This error should only occur within the initialization phase. I'm assuming that you're using KMeans++ (the default scheme). You could try using forgy or random partition instead to see if it fixes the panic.

I'll keep trying to find the bug in the mean time.

andrewcsmith commented 8 years ago

Here's the kmeans feature vector that's crashing. (Take off the .txt extension. That's just to trick github into accepting it.)

Looks like there quite a bit of -inf and NaN values in there. Not sure how that happened, but that's obviously my problem to fix. So it seems like it's not a bug, but just bad values...and it seems like a waste to check every f64 value, so perhaps just add this to the list of things that might go wrong.

kmeans_features.csv.txt

AtheMathmo commented 8 years ago

That does seem to be the cause of the error. As you say there is not much we can do about this at the model level (at least without some significant changes). I think a fairly easy way to improve the errors would be to have the initialization return Results - that should give us more detailed error reports.

We can also test for this particular error by checking that the distance sum mentioned above is_normal. If not we can assume something went wrong with the data.

I'll keep this issue open to track how he handle this.

AtheMathmo commented 8 years ago

@andrewcsmith after a long delay I finally got round to this. I have a PR open ( #83 ).

There is some other stuff included but it addresses the issues discussed here (partly). If you have the time to check it out that would be awesome. I know it's been a while so don't feel compelled!