TeamHG-Memex / soft404

A classifier for detecting soft 404 pages
56 stars 14 forks source link

soft404 doesn't work with scikit-learn 0.18+ #3

Closed kmike closed 7 years ago

kmike commented 7 years ago

For me the model fail to load:

sklearn/tree/_tree.pyx:632: KeyError
------------------------------------------------------------ Captured stderr call -------------------------------------------------------------
/Users/kmike/envs/deepdeep/lib/python3.5/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator SGDClassifier from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
/Users/kmike/envs/deepdeep/lib/python3.5/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator LogOddsEstimator from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
____________________________________________________________ test_predict_function ____________________________________________________________

    def test_predict_function():
>       assert probability('<h1>page not found, oops</h1>') > 0.9

tests/test_predict.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
soft404/predict.py:43: in probability
    default_classifier = Soft404Classifier()
soft404/predict.py:15: in __init__
    vect_params, vect_vocab, text_clf, clf = joblib.load(filename)
../../envs/deepdeep/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py:573: in load
    return load_compatibility(fobj)
../../envs/deepdeep/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle_compat.py:226: in load_compatibility
    obj = unpickler.load()
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickle.py:1039: in load
    dispatch[key[0]](self)
../../envs/deepdeep/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle_compat.py:177: in load_build
    Unpickler.load_build(self)
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickle.py:1510: in load_build
    setstate(state)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   KeyError: 'max_depth'

I think it makes sense to either upgrade the model to use scikit-learn 0.18.1, or to put training corpus to the repository, so that the model can be updated on a client.

lopuhin commented 7 years ago

The current corpus is too big (about 1G compressed), unfortunately. I'll check if it can be made smaller, and I'll put is on S3 anyway. Re-training the model now, I'll put it to a branch at first.

kmike commented 7 years ago

Yeah, 1GB is way too much.. S3 can be not good as a long-term solution because it costs $$, maybe we can use http://academictorrents.com/ or something like that? Someone still need to seed though

kmike commented 7 years ago

Do you recall how long does it take to run a crawl and get a similar dataset?

lopuhin commented 7 years ago

Pushed the model in d066986

Do you recall how long does it take to run a crawl and get a similar dataset?

The dataset is 117484 pages, so with 500 rpm it should take just 4 hours. But I have a note that crawling got much slower after some time due to scheduling issues which I never solved, so actual time was more than a day, I think.

lopuhin commented 7 years ago

It currently works with scikit-learn 0.18+, although the model is still serialized with joblib - see issue #13 about it, and #12 about training a classifier form scratch.

kmike commented 7 years ago

@lopuhin if the problem with crawling speed is the usual "all requests returned by scheduler are for the same domain, we hit downloader limits and do nothing" then something like https://github.com/TeamHG-Memex/linkdepth/blob/master/queues.py could help; to use it set 'scheduler_slot' request.meta key (like this: https://github.com/TeamHG-Memex/linkdepth/blob/b5c18819f61a25e586347c04c116bcabc44067af/linkdepth.py#L98) and tell scrapy to use these custom queues:

SCHEDULER_PRIORITY_QUEUE='queues.RoundRobinPriorityQueue'
SCHEDULER_DISK_QUEUE='queues.DiskQueue'

Another option is to use frontera; it uses a thing called OverusedBuffer to fight this issue.

lopuhin commented 7 years ago

Yes, I think that was the problem. Thanks for the pointers!