BIDData / BIDMach

CPU and GPU-accelerated Machine Learning Library
BSD 3-Clause "New" or "Revised" License
914 stars 168 forks source link

Random forest with low depth raises java.lang.IndexOutOfBoundsException #76

Open alexice opened 8 years ago

alexice commented 8 years ago

This is for 1.0.3 version (MacOS X). Not sure that RF should be working at all. Do you have some plans to build new BIDMach version?

// data preparation
… 
val trainDopts = new MatDS.Options
val trainDS = new MatDS(Array(dataMatrix, y), trainDopts)

val (model, opts) = RandomForest.learner(trainDS)

opts.depth = 1
opts.ntrees = 10
opts.nnodes = 1000
opts.nbits = 1
opts.trace = 1
opts.batchSize = 1000

model.train

Outputs:

pass= 0
purity gain 0.0000, fraction impure 1.000, nnew 0.0, nnodes 1.0
Time=0.2000 secs, gflops=1.04
java.lang.IndexOutOfBoundsException: 0
    scala.collection.mutable.ListBuffer.apply(ListBuffer.scala:126)
    BIDMach.Learner$.scores2FMat(Learner.scala:775)
    BIDMach.Learner.retrain(Learner.scala:122)
    BIDMach.Learner.train(Learner.scala:53)
jcanny commented 8 years ago

Hi, It doesnt look like the problem is related to Random Forests at all. BIDMach does on-the-fly cross-validation by withholding every k'th minibatch of data for evaluation. k is 11 by default (opts.evalStep). From the Learner output, I can see that it never performed an eval step (it prints out the results of the eval steps at the nearest percentage point of data consumed). That's causing a zero-length array exception when the learner tries to save the results of the evalsteps on the first pass.

It looks like your dataset has fewer than 11 minibatches (length < 11000 cols). Either shrink the minibatch size, or try a larger dataset, or reduce opts.evalStep to avoid this error.

There are likely to be other problems though. You've specified nbits = 1, i.e. that your data has one bit of precision. For integer data, that's the low order bit. For floating point its the leading bit (i.e. the sign bit). That's probably not what you want. Normally you specify the full precision for integer data, or say 16 for floating point, which gives you the exponent plus 8 bits of mantissa (that's the default).

Its a good idea to also specify ncats, which is the number of labels (and they should be consecutive integers). If you leave as the default (0), the code will guess the number by looking at the first minibatch. But if the labels are not all present in that minibatch it will get it wrong.

Before throwing the error, the algorithm was unable to split any nodes with a purity gain (so the number of new nodes per tree added, nnew = 0). That's not a good sign. by default the code will only split nodes when the children exhibit a purity gain > 0.01f (opts.gain). Normally this is a good heuristic to avoid (inefficient) random splits. But if your data is truly nasty, you will need to do random splits to some depth before the nodes start favoring a single label (e.g. parity-like functions). To deal with such data set opts.gain = 0.

I added an RF evaluation script called "testrf.ssc" in BIDMach/scripts. I put some similar code in the main test script (BIDMach/scripts/workout.ssc). It trains to 99% x-validated accuracy on the first 10 chunks of MNIST8M data in about a minute on a K40 GPU machine. You can look at the values that it uses for typical use of RFs.