Closed codinguncut closed 5 years ago
The direct cause of the warning msg is because the pass_y
param to FunctionalTransformer
is deprecated in sklearn 0.19+, while we saved the pipeline (which has two functional transformers steps) with sklearn 0.18 (see #3).
So I think fixing #13 would also fix this one. (Seems this one and #13 is actually different)
The current pipeline is:
- html_to_item
- item_to_text
- CountVectorizer
- TfidfTransformer
- SGDClassifier
So I think we should remove the first two functional steps from the ppl to fix the deprecated warnings for this ticket.
@lopuhin Am i right?
So I think we should remove the first two functional steps from the ppl to fix the deprecated warnings for this ticket.
That's one way to solve this, but in this case, we'll need to call this steps manually when using the model, right? Another way to fix would be to fix just the actual deprecation warnings, but still have the first two steps in the pipeline. Do you think it's possible?
We discussed another option with @lucywang000 - to store the model in a more scikit-learn version independent way, explicitly storing the model's coefficient, tfidf weights, vectorizer dict and hyperparameters. @kmike do you know by chance any useful library/code snippet which can help with that?
do you know by chance any useful library/code snippet which can help with that?
There are limited solutions like https://github.com/jpmml/sklearn2pmml, though I'd prefer not to use them - likely they won't work out of box, and this particular package is AGPL-licensed.
Regarding saving only weigths and hyperparameters - I think it may help, though it'll be less useful than with pytorch, tf, etc., because scikit-learn objects are more complex / less granular. There still will be work involved on each sklearn release, to make sure saved parameters work with the new version, and that they're enough. An advantage is that re-training the model shouldn't be necessary in many cases - unless some pre-processing code changes.
The basic solution (re-train the model and use latest sklearn release) doesn't look too bad to me as well.
@lopuhin you could use some of the tools use for exporting models for deployment/ production. https://github.com/jpmml/sklearn2pmml https://github.com/nok/sklearn-porter https://github.com/scikit-learn/scikit-learn/issues/10319#issuecomment-351902767 http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/bonus/scikit-model-to-json.ipynb https://cmry.github.io/notes/serialize
Based on the discussion so far, it looks like we have below three options, listed in increasing efforts order:
pass_y
is deprecated in sklearn 0.19My vote is option 2 to fix this quickly. What would you suggest? @lopuhin @kmike
@lucywang000 I think option 2 is good!
Btw, http://scikit-learn.org/stable/whats_new.html#version-0-20 scikit-learn 0.20 was just released :)
I'd not go with (3) for now; 1 and 2 sound fine.
@lopuhin Let's close it since #16 is merged?
@lucywang000 , right, closing it. Thanks for the fix! 👍
I much appreciate this model. I tried to use it in a project (python 3.5, scikit-learn 0.19.0), and got multiple warnings/errors:
Of course I could try to downgrade my sklearn, but it would be great if "at least" the unpickling would survive minor sklearn updates (see #13 ;)