TeamHG-Memex / soft404

A classifier for detecting soft 404 pages
56 stars 14 forks source link

python3, sklearn >0.18 #15

Closed codinguncut closed 5 years ago

codinguncut commented 7 years ago

I much appreciate this model. I tried to use it in a project (python 3.5, scikit-learn 0.19.0), and got multiple warnings/errors:

/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/sklearn/base.py:312: UserWarning: Trying to unpickle estimator FunctionTransformer from version 0.18.1 when using version 0.19.0. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)

WARNING:py.warnings:/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages
/sklearn/preprocessing/_function_transformer.py:156: DeprecationWarning: The parameter
 pass_y is deprecated since 0.19 and will be removed in 0.21
  "will be removed in 0.21", DeprecationWarning)

Of course I could try to downgrade my sklearn, but it would be great if "at least" the unpickling would survive minor sklearn updates (see #13 ;)

lucywang000 commented 6 years ago

The direct cause of the warning msg is because the pass_y param to FunctionalTransformer is deprecated in sklearn 0.19+, while we saved the pipeline (which has two functional transformers steps) with sklearn 0.18 (see #3).

So I think fixing #13 would also fix this one. (Seems this one and #13 is actually different)

The current pipeline is:

- html_to_item
- item_to_text
- CountVectorizer
- TfidfTransformer
- SGDClassifier

So I think we should remove the first two functional steps from the ppl to fix the deprecated warnings for this ticket.

@lopuhin Am i right?

lopuhin commented 6 years ago

So I think we should remove the first two functional steps from the ppl to fix the deprecated warnings for this ticket.

That's one way to solve this, but in this case, we'll need to call this steps manually when using the model, right? Another way to fix would be to fix just the actual deprecation warnings, but still have the first two steps in the pipeline. Do you think it's possible?

lopuhin commented 6 years ago

We discussed another option with @lucywang000 - to store the model in a more scikit-learn version independent way, explicitly storing the model's coefficient, tfidf weights, vectorizer dict and hyperparameters. @kmike do you know by chance any useful library/code snippet which can help with that?

kmike commented 6 years ago

do you know by chance any useful library/code snippet which can help with that?

There are limited solutions like https://github.com/jpmml/sklearn2pmml, though I'd prefer not to use them - likely they won't work out of box, and this particular package is AGPL-licensed.

Regarding saving only weigths and hyperparameters - I think it may help, though it'll be less useful than with pytorch, tf, etc., because scikit-learn objects are more complex / less granular. There still will be work involved on each sklearn release, to make sure saved parameters work with the new version, and that they're enough. An advantage is that re-training the model shouldn't be necessary in many cases - unless some pre-processing code changes.

The basic solution (re-train the model and use latest sklearn release) doesn't look too bad to me as well.

codinguncut commented 6 years ago

@lopuhin you could use some of the tools use for exporting models for deployment/ production. https://github.com/jpmml/sklearn2pmml https://github.com/nok/sklearn-porter https://github.com/scikit-learn/scikit-learn/issues/10319#issuecomment-351902767 http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/bonus/scikit-model-to-json.ipynb https://cmry.github.io/notes/serialize

lucywang000 commented 6 years ago

Based on the discussion so far, it looks like we have below three options, listed in increasing efforts order:

  1. re-train the model with sklearn 0.19+: simplest, though new problem may occur with sklearn 0.21 , 0.22.. in the future
  2. remove the preprocessing steps (html_to_item & item_to_text) from the pipeline: the warning msg "The parameter pass_y is deprecated since 0.19 and will be removed in 0.21" is printed exactly because html_to_item/item_to_text uses the functional transformer whose pass_y is deprecated in sklearn 0.19
  3. try to only pickle the hyperparams/vocab/weights, with custom code or with external tools like those mentioned by @codinguncut above. This may require quite some efforts.

My vote is option 2 to fix this quickly. What would you suggest? @lopuhin @kmike

lopuhin commented 6 years ago

@lucywang000 I think option 2 is good!

lopuhin commented 6 years ago

Btw, http://scikit-learn.org/stable/whats_new.html#version-0-20 scikit-learn 0.20 was just released :)

kmike commented 6 years ago

I'd not go with (3) for now; 1 and 2 sound fine.

lucywang000 commented 5 years ago

@lopuhin Let's close it since #16 is merged?

lopuhin commented 5 years ago

@lucywang000 , right, closing it. Thanks for the fix! 👍