Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
149 stars 50 forks source link

Add a new max_splits parameter; make a couple of fix updates #136

Closed jihaekor closed 10 months ago

jihaekor commented 10 months ago

This adds a new max_splits parameter, which forces the best split computation to continue (similar to the min_child_node_size parameter) until the number of split groups left is at most equal to max_splits.

The basic use case for this to limit the number of splits at each split decision, in an effort to keep the size of the trees manageable while merging the "most similar" categories together to achieve this goal.

As a part of the change, I also added in the ability to specify minimum node sizes as fractions, defined as range (0, 1), with open intervals used on both ends.

I also made a couple of fixes that I noticed:

I hope you find the new parameter interesting enough that you would find it useful to get this PR merged.

In terms of the version: Given that this introduces a new parameter, I updated the minor version rather than the patch version; however, do let me know if you would prefer the version to be updated as just a patch, and I can make that adjustment.

codecov[bot] commented 10 months ago

Codecov Report

Patch coverage: 96.15% and project coverage change: +0.17% :tada:

Comparison is base (b7c13b2) 93.11% compared to head (3700196) 93.29%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #136 +/- ## ========================================== + Coverage 93.11% 93.29% +0.17% ========================================== Files 8 8 Lines 654 671 +17 ========================================== + Hits 609 626 +17 Misses 45 45 ``` | [Files Changed](https://app.codecov.io/gh/Rambatino/CHAID/pull/136?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mark+Ramotowski) | Coverage Δ | | |---|---|---| | [CHAID/stats.py](https://app.codecov.io/gh/Rambatino/CHAID/pull/136?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mark+Ramotowski#diff-Q0hBSUQvc3RhdHMucHk=) | `96.66% <93.33%> (+0.15%)` | :arrow_up: | | [CHAID/invalid\_split\_reason.py](https://app.codecov.io/gh/Rambatino/CHAID/pull/136?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mark+Ramotowski#diff-Q0hBSUQvaW52YWxpZF9zcGxpdF9yZWFzb24ucHk=) | `100.00% <100.00%> (ø)` | | | [CHAID/tree.py](https://app.codecov.io/gh/Rambatino/CHAID/pull/136?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mark+Ramotowski#diff-Q0hBSUQvdHJlZS5weQ==) | `96.99% <100.00%> (+0.19%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jihaekor commented 10 months ago

Hi @Rambatino - I wanted to ping you to see if you had interest in merging this PR. Thanks!

Rambatino commented 10 months ago

Hi @jihaekor, very sorry for being slow! Will review later today :)

Rambatino commented 10 months ago

Thank you, looks great!

jihaekor commented 10 months ago

Hi @Rambatino - Thank you! Was this pushed out to pypi as well?

Also, do you mind taking a quick look at the issue that I opened as well? I will be happy to submit a PR for it based on your thoughts.

Thanks!

Rambatino commented 10 months ago

Thanks @jihaekor, I think I commented. But feel free to push a PR to fix, as I don't know if I'm honest. It should be pushed to pypi, let me know if any issues :)