dragnet-org / dragnet

Just the facts -- web page content extraction
MIT License
1.24k stars 179 forks source link

Compatibility with scikit-learn > 0.21.3 #109

Open igorkurinnyi opened 2 years ago

igorkurinnyi commented 2 years ago

Hello,

Currently, dragnet is not compatible with scikit-learn > 0.21.3. I did research and composed a table of compatibilities of pickled dragnet models with new sklearn versions.

Trained with version Compatible with versions Not compatible from version Error
1.0.1 1.0.1; 0.24.2 0.23.2 AttributeError: 'ExtraTreeClassifier' object has no attribute 'n_features_'
0.23.2 0.23.2; 0.22.1 0.21.3 ModuleNotFoundError: No module named 'sklearn.ensemble._forest'

Models were trained with: python 3.9, Cython==0.29.24, numpy==1.21.4, scipy==1.7.2

Is it possible to update the library with new models? I could help with a pull request.

ericluugg commented 2 days ago

Not sure its possible to train the models on newer versions, but for loading pretrained models on higher scikit-learn versions you can redirect imports so they're not looking for nonexistent paths From there repickle the model so you don't have to redirect the import everytime you want to load the model


import joblib
from sklearn.ensemble import _forest
from sklearn.tree import _classes
sys.modules['sklearn.tree.tree'] = _classes
sys.modules['sklearn.ensemble.forest'] = _forest
def dump_310():
    # Dump the model with new references
    model = joblib.load("/old_model.pkl.gz")
    joblib.dump(model, "updated_model.pkl.gz", compress=3)