forestry-labs / Rforestry

https://forestry-labs.github.io/Rforestry/
34 stars 10 forks source link

Compatibility with sklearn estimators #98

Open edwardwliu opened 1 year ago

edwardwliu commented 1 year ago

We're interested in compatibility with sklearn estimators (see instructions) for integration with other libraries like Ray. The goal would be to pass the check_estimator() check. A few specific items to respect requirements include: class __init__ parameters should be explicit and changing the newdata arg to x , adhered by subsequent methods.

petrovicboban commented 1 year ago

I have no Data Science skills and experience, so I need more details here. It's very likely I would need serious DS training to get into it, and I can't be productive in this domain yet.

edwardwliu commented 1 year ago

I see your concerns. I shared some additional context via Slack, but IIUC you don't have access or to email for the next two weeks, is that correct? This request stems from an exploration of sklearn compatibility by others that I've uploaded here. I think running check_estimator() in the example then addressing pure syntax changes could be an appropriate starting point. Let me know if you run into specific issues or continue to have concerns.

petrovicboban commented 1 year ago

I still have no idea of what's going on in that code and what the goal is. No idea what sklearn/ray is too, I have no exposure to ds frameworks at all. DS is not something I can jump in just like that, without any theoretical knowledge. I'm sorry but you'll have to reassign this task to someone else.

edwardwliu commented 1 year ago

Thanks for sharing the context. I shared some details earlier over Slack, but realize you may not have access. Let's chat tomorrow about the best path forward!

edwardwliu commented 1 year ago

Chatted about this, appreciate you taking the time to sync. Documenting some of the additional context discussed here.

The Python package was originally developed with the intention of natively following an interface called "scikit-learn", but deltas have been introduced over time. The interface defines some requirements for class initialization, function naming, and parameter. Adhering to this interface allows easy integration with many other Python libraries that can parallelize model training or improve performance. The fixes should be syntactical and be limited to appropriately extending class requirements. However, if any errors touch model logic, requires data science context, or if any questions arise at all, please tag me and I'll resolve those myself as soon as I can.

The starting step is to run and eventually pass the following check:

from random_forestry import RandomForest
from sklearn.utils.estimator_checks import check_estimator

check_estimator(RandomForest())

The checks can be found in estimator_checks.py() with details outlined in the documentation starting from this section and below. The bulk of the change should be the following:

Other changes may be class specific so as mentioned above, let's work closely on this if questions arise!

Ilia-Shutov commented 1 year ago

Created PR https://github.com/forestry-labs/Rforestry/pull/146