ddbourgin / numpy-ml

Machine learning, in numpy
https://numpy-ml.readthedocs.io/
GNU General Public License v3.0
15.35k stars 3.72k forks source link

Question about dt scirpt #41

Closed 505555998 closed 4 years ago

505555998 commented 4 years ago

Hi ddbourgin, I learned a lot from your repo, thanks. When I read tree / dt.py script, I have some questions for you.

Code:

if self.depth >= self.max_depth:
    v = np.mean(Y, axis=0)
    if self.classifier:
        v = np.bincount(Y, minlength=self.n_classes) / len(Y)
    return Leaf(v)

N, M = X.shape
self.depth += 1
feat_idxs = np.random.choice(M, self.n_feats, replace=False)

# greedily select the best split according to `criterion`
feat, thresh = self._segment(X, Y, feat_idxs)
l = np.argwhere(X[:, feat] <= thresh).flatten()
r = np.argwhere(X[:, feat] > thresh).flatten()

# grow the children that result from the split
left = self._grow(X[l, :], Y[l])
right = self._grow(X[r, :], Y[r])
return Node(left, right, (feat, thresh))

When left meets the termination condition and becomes Leaf, self.depth will reach self.max_depth, so the right will always be Leaf. In this tree, the right node of each layer is a Leaf ,is correct ? In this case, there are still many samples on the right node that will no longer be split, I think this is a problem and will lead to less precise results.

Looking forward to your reply

ddbourgin commented 4 years ago

Hi @505555998 -- apologies for the delay! You're absolutely right, this is a bug. Thanks a ton for catching it! I think the proper thing to do would be to pass a depth argument to the _grow method rather than rely on the self.depth attribute. I'll push a fix momentarily!