Closed alanocallaghan closed 5 months ago
I've found this lesson to work well with just two features, but I do play around with some of the parameters to demonstrate what is happening. These should be captured in the materials, so I'll try to make some updates to explain things more clearly.
What I mean is that if we're fitting a random forest to two variables, then I'd expect the feature subsampling to produce trees with one feature, otherwise it's just a regular tree ensemble
What I mean is that if we're fitting a random forest to two variables, then I'd expect the feature subsampling to produce trees with one feature, otherwise it's just a regular tree ensemble
One of the nice things about dealing with only two variables is that we can demonstrate that this expectation is not true for random forests (at least for this particular implementation).
If it was true that setting max_features=1
as an argument led to trees with a single variable, we would not see the following trees (which all make decisions based on both variables).
The explanation is that features are being limited at each split, not at the model level:
Ah. In that case it'd be good to explain that in the lesson
@alanocallaghan Please could you take a look at https://github.com/carpentries-incubator/machine-learning-trees-python/pull/27 and let me know if this resolves the issue?
base_estimator -> estimator in recent sklearn
In the random forest page, we specify
max_features=1
but the decision boundaries are all bivariate. This makes for a very confusing introduction to random forests https://carpentries-incubator.github.io/machine-learning-trees-python/06-random-forest/index.html