forestry-labs / Rforestry

https://forestry-labs.github.io/Rforestry/
34 stars 11 forks source link

sklearn compatibility #101

Closed petrovicboban closed 11 months ago

petrovicboban commented 1 year ago

@edwardwliu can you figure out why test is failing here? I've tried to do it, but it goes deep into logic (categorical feature mapping).

edwardwliu commented 1 year ago

@petrovicboban The issue was find_match() wasn't identifying matching values for different np.int and np.float64 types. I added a cast, but it's possible this could be handled in a cleaner manner. The specific test appears case to be passing now.

petrovicboban commented 1 year ago

@edwardwliu thanks! I thought the issue was in calling function, because the test was failing when arr_b was numpy array. In all other tests, it was list.

petrovicboban commented 1 year ago

@edwardwliu it seems that fit() should not change parameters which are set during __init__(), and that's exactly what we do. Here is test failure: https://github.com/forestry-labs/Rforestry/actions/runs/4774656551/jobs/8488479086

I guess we should skip setting those parameters in __init__() and set them in fit() only, but I'm not sure about implications of that.

edwardwliu commented 1 year ago

@petrovicboban I see, let's only use the following parameters in fit() and remove them from __init__() if overlapping:

petrovicboban commented 1 year ago

@edwardwliu it looks like we need to move these too:

        if self.max_depth is None:
            self.max_depth = round(nrow / 2) + 1

        if self.interaction_depth is None:
            self.interaction_depth = self.max_depth

        if self.max_obs is None:
            self.max_obs = y.size

because they depend on argument of fit()

edwardwliu commented 1 year ago

I see, yes let's move those parameters to fit() for now as well. Conceptually, the params max_depth and interaction_depth could be agnostic to the data provided in fit(), but the implementation in this package requires this change.

petrovicboban commented 1 year ago

@edwardwliu can you check current test failures? One is related to you recent change (groups) and one is from check_estimator()

edwardwliu commented 1 year ago

@petrovicboban Both of these test cases should be passing be now. The estimator test was because we do not currently support a data type scipy.sparse.csc_matrix. In addition, it may be useful to parameterize checks while I debug.

petrovicboban commented 1 year ago

@edwardwliu this looks good for now. It fails only on estimator's pickle check. Probably because saved_forest_ attribute is expected during the process but is added only after call to translate_tree. I've tried several solutions, but failed.

After we solve that, I need to improve __init__ parameters validation.

github-actions[bot] commented 1 year ago

☂️ Python Coverage

current status: ✅

Overall Coverage

Statements Covered Coverage Threshold Status
814 646 79% 60% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
Python/random_forestry/forestry.py 79% 🟢
Python/random_forestry/preprocessing.py 72% 🟢
Python/random_forestry/validators/init.py 100% 🟢
Python/random_forestry/validators/base_validator.py 91% 🟢
Python/random_forestry/validators/fit_validator.py 87% 🟢
Python/random_forestry/validators/predict_validator.py 89% 🟢
TOTAL 86% 🟢

**updated for commit: dc2031a

petrovicboban commented 1 year ago

@petrovicboban Both of these test cases should be passing be now. The estimator test was because we do not currently support a data type scipy.sparse.csc_matrix. In addition, it may be useful to parameterize checks while I debug.

It's passing with:

    def _more_tags(self):
        return {
            "_xfail_checks": {
                "check_estimators_pickle": "To be fixed later",
                "check_n_features_in": "To be fixed later",
                "check_estimators_nan_inf": "To be fixed later",
                "check_dtype_object": "To be fixed later",
            },
        }
Ilia-Shutov commented 11 months ago

New PR based on this one was created https://github.com/forestry-labs/Rforestry/pull/146