IBM / taxinomitis

Source code for Machine Learning for Kids site
https://machinelearningforkids.co.uk
Apache License 2.0
147 stars 140 forks source link

"balanced" decision trees? #227

Open kevinrobinson opened 5 years ago

kevinrobinson commented 5 years ago

Should the decisions trees default to class_weight: 'balanced'? I don't have much experience with scikit-learn, but reading the docs a bit it seems like this might be good for use cases that might be common to ML for Kids (eg, small numbers of training examples, classes possibly not evenly weighted).

dalelane commented 5 years ago

I'm not sure.... I need to look into the implications of the setting, as it's not one I'm super familiar with.

But my initial reaction is that I'm wary of adding things that would fix problems auto-magically.

A common pattern with classes taught using ML for Kids is to start creating biased and unbalanced training sets, letting students see the impact that this has on ML models, letting them experiment with fixing/improving the training data, and then see the impact this has on the predictions it creates.

So if a setting like that would try to minimize the impact of an unbalanced training set, that would probably be unhelpful. I guess I have a slightly odd use case/requirement here, that I'm not necessarily looking for the best possible model from the training data :-)

But like I say, that's just a superficial knee-jerk reaction to description of the setting - I'll have a proper look, as it may well be that I'm misunderstanding the implications of the setting.

kevinrobinson commented 5 years ago

Thanks for sharing your thinking, this is super helpful! 👍

Yeah, this seems like its sort of an inherent trade-off in making small models with only a few examples anyway. I don't have a strong understanding of this and would have to experiment more too, but I also think there's something to be said for always just using the default scikit-learn behavior too.

For this particular configuration, my limited understanding is that this is more of a "anyone using this should probably use this setting, and perhaps it should be the scikit-learn default" so that's why I suggested it, because it seems like ML for Kids use cases would expose the downsides of unbalanced training sets pretty frequently in normal use (aside from the pedagogy of "do the naive thing, see how it doesn't quite work, iterate and improve").

In other work, the way I've thought about this is to try to represent what the state-of-the-art approach would be as best as possible, with the only constraint being the small training sets, reducing training time, or other things that are really necessary pedagogically. So if there are knobs to the training process that an experienced practitioner would turn (eg, use adaptive learning rates, do augmentation on images, incorporate pre-training or transfer learning, etc.) then we should try to continue incorporating those over time. That way the power of the examples or educational scenarios grows over time as those practices or tools become more available.

I'm not sure, those are just my thoughts. Thanks for sharing your work and thinking as always, on all this complicated work! 😄