Feature Requests - Githubissues

xuyxu commented 3 years ago

This issue collects all features requests. Any one is welcomed to work on issues listed below, and do not forget to include your contributions and name in the CHANGELOG.rst.

If you want to work on a requested feature, please re-open the linked issue, and leave a comment below to let us know that you want to work on it.

New features

[x] CascadeForestRegressor class for regression problem (#4)
[ ] export_graphviz method on visualizing decision trees in deep forest (#12)
[ ] Export decision tree in DF21 to SHAP (#12)
[x] Internal label encoder on the class labels (#13)
[x] Better support on the input data (#19, #20)
[ ] GPU Support (#27)
[x] Support customized estimator in the layers (#29)
[ ] CascadeForestSurvAnalyzer class for survival analysis (#71)
[ ] Customized evaluation metrics in cascade forest (#98)

Python package

[ ] Build wheels for Python 2.7
[x] Set up CI on building wheels for Mac-OS (#6, #32)

New language wrappers:

[ ] C/C++ Interface (#9)

Fix

[ ] Model fix on extremely small datasets (#103)

tczhao commented 3 years ago

I will work on the #4 regressor task

xuyxu commented 3 years ago

I will work on the #4 regressor task

That would be really nice @tczhao! Adding the regressor requires many efforts, can you open a draft pull request and upload what you have done there? I am willing to take part in the development on this feature request and have some deeper discussions there.

In addition, here are some things that may be helpful to you:

For regression, the augmented features are out-of-bag predicted values from the cascade layer, which is unbounded features (In contrast, the augmented features for classification are bounded, i.e., the class vectors). This poses some problems if we want to use binning for acceleration, because the unbounded feature values after binning will be very sensitive to the boundary values.
Use the RandomForestRegressor and ExtraTreeRegressor from Scikit-Learn first. This version only includes the reduced version on classification trees. I am willing to optimize that for regression trees after we have a quick verification on the effectiveness on regression.

NiMaZi commented 3 years ago

I'm working on #13

tczhao commented 3 years ago

I will work on the #4 regressor task

That would be really nice @tczhao! Adding the regressor requires many efforts, can you open a draft pull request and upload what you have done there? I am willing to take part in the development on this feature request and have some deeper discussions there.

In addition, here are some things that may be helpful to you:

For regression, the augmented features are out-of-bag predicted values from the cascade layer, which is unbounded features (In contrast, the augmented features for classification are bounded, i.e., the class vectors). This poses some problems if we want to use binning for acceleration, because the unbounded feature values after binning will be very sensitive to the boundary values.

Use the RandomForestRegressor and ExtraTreeRegressor from Scikit-Learn first. This version only includes the reduced version on classification trees. I am willing to optimize that for regression trees after we have a quick verification on the effectiveness on regression.

Thanks, will have a draft ready in 2 days

tczhao commented 3 years ago

maybe we can skip the Build wheels for Python 2.7 since python 2.7 is no longer maintained since 2020-01

xuyxu commented 3 years ago

maybe we can skip the Build wheels for Python 2.7 since python 2.7 is no longer maintained since 2020-01

Wheels for Python 2.7 is not included in the CI on build wheels, I have created an individual branch for people of interests ;-)

EDIT: This is actually a feature request from several users in the industrial community, who told me that ver2.7 is still the most frequently used python version in their environment.

davidlkl commented 3 years ago

Hi,

Thanks @tczhao for the hard work!

Just would like to understand that if it would be sufficient to supply a custom loss by predictor_kwargs, (in other words, is there any other part in the CascadeForestRegressor using MSE as default?).

Thanks David

xuyxu commented 3 years ago

Hi,

Thanks @tczhao for the hard work!

Just would like to understand that if it would be sufficient to supply a custom loss by predictor_kwargs, (in other words, is there any other part in the CascadeForestRegressor using MSE as default?).

Thanks David

I think it is relatively easy to add the Mean Absolute Error (MAE), which is also available in Scikit-Learn. For custom loss functions, a new splitting criterion should be implemented for decision trees.

Maybe we can add another parameter to CascadeForestClassifier and CascadeForestRegression (e.g., criterion), which specifies the splitting criterion for decision trees in the model.

T-Allen-sudo commented 3 years ago

I will work on the package for Mac-OS (#6, #32)

xuyxu commented 3 years ago

I will work on the package for Mac-OS (#6, #32)

Thanks ;-). You may find the documentation on cibuildwheel helpful when working on the CI: build-wheels.

chendingyan commented 3 years ago

Hi @xuyxu , I found that in the current master branch, input y value will be checked by "deepforest.cascade._check_target_values". But when I input a sequence of integers as y value, it will be defined as "multiclass" instead of "continuous". In my point of view, y value in regression problem can be float number or integer number. It may cause big error in the future. The images is the example from sklearn.utils.multiclass function type_of_target.

xuyxu commented 3 years ago

Hi @chendingyan, I agree with you on this point, the current check may be too strict. Any idea on how to improve this?

chendingyan commented 3 years ago

Hi @xuyxu ,if you use "type_of_target" to check for input y values, I might add multiclass and multiclass-multioutput for univariate and multivariate regression, and also check the value in numpy array is numeric.

xuyxu commented 3 years ago

Hi @xuyxu ,if you use "type_of_target" to check for input y values, I might add multiclass and multiclass-multioutput for univariate and multivariate regression, and also check the value in numpy array is numeric.

That's a nice idea, and this should be easy to implement. I will appreciate it very much if you could contribute a PR for this enhancement ;-)

chendingyan commented 3 years ago

Hi @xuyxu ,if you use "type_of_target" to check for input y values, I might add multiclass and multiclass-multioutput for univariate and multivariate regression, and also check the value in numpy array is numeric.

That's a nice idea, and this should be easy to implement. I will appreciate it very much if you could contribute a PR for this enhancement ;-)

Submit a PR~

chendingyan commented 3 years ago

Hi @xuyxu , can you help me check my pr? How can I pass the code quality check?

xuyxu commented 3 years ago

Thanks for the PR @chendingyan, I will fix the code quality problem.

LAMDA-NJU / Deep-Forest

Feature Requests #14

New features

Python package

New language wrappers:

Fix