Shark-ML / Shark

The Shark Machine Leaning Library. See more:
http://shark-ml.github.io/Shark/
GNU Lesser General Public License v3.0
508 stars 131 forks source link

How many data does CART need? #149

Closed Mike-Feng closed 7 years ago

Mike-Feng commented 7 years ago

The CART always failed on train, the train function throw an exception: CARTTrainer trainer; CARTClassifier<RealVector> model; trainer.train(model, dataTrain);

I found when there is a lot of data, it will success. but on very few data, it failed. And I didn't found the critical point. Where is the critical point?

Ulfgard commented 7 years ago

what is the exception? maybe you found a bug

Best, Oswin

Mike-Feng commented 7 years ago

The exception is very general, I take sevrial screenshots: exception.png: it shows the exception. exception

data.png: it shows how many data in the triandata object. data

and the test data is the attachment: failedcsv.csv will lead to the exception. successcsv.csv will pass. cart.zip

Thank you Oswin.

elehcim commented 7 years ago

I remember Shark having problems with datasets with one single entry. Is this the case?

Mike-Feng commented 7 years ago

If the data is really just one single entry. I would not be confused. Actually, there are almost 10 entries to make it work. Maybe the problem is not CART algorithm, but the importCSV function: just a guess. If you are intrested this question, please try the attachment data in the above reply, and just sample code here: http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/cart.html Thank you elehcim.

Ulfgard commented 7 years ago

Hi, the answer is simple: by default, the trainer uses internally 10 fold cross-validation to validate the optimally pruning of the tree. As you only have 9 data points, this will fail. The number of folds is governed by CARTTrainer::setNumberOfFolds(10). Note that using trees with so little points does not make sense.

I think we should add checks like that somewhere :)