Closed siavash9000 closed 7 years ago
We don't have any immediate plans to upgrade it. IMO this is one of the major problems using sklearn, that serialization isn't supported in the basic API and therefore the pickled models are prone to breaking when upgrading versions.
Our training and test data is available at https://github.com/seomoz/dragnet_data if you wanted to retrain with version 0.18. I'd expect the performance to be similar if you use the same hyperparameter settings and models. We could then support both old and new versions of sklearn by including the existing models and new ones and checking sklearn.__version__
to determine which to load. Do you know if the current models work with sklearn version 0.17? I've only checked them up to 0.16 (as in the requirements.txt
).
ps -- we currently have several different model versions included in this release but only two are necessary to maintain going forward (the models loaded from kohlschuetter_weninger_readability_content_model.pickle.gz
and kohlschuetter_weninger_readability_content_comments_model.pickle.gz
)
k, thanks for the explanation. I will try to retrain with 0.18 and open a pull request.
Where can I find the hyperparameter settings and which features should I use? I am currently trying to train, following the example from the Readme, but I am indecisive which features to use. The readme suggests to use 'kohlschuetter' and 'weninger', but the resulting file is named 'kohlschuetter_weninger_content_model.pickle' instead of 'kohlschuetter_weninger_readability_content_model.pickle' .
Adding the feature 'readability' results in the following error:
AttributeError: 'cython_function_or_method' object has no attribute 'nfeatures'
in the init of NormalizedFeature.
The code for training:
from sklearn.ensemble import ExtraTreesClassifier
from dragnet.model_training import train_models
datadir = '/home/sefidrodi/src/dragnet_data'
outdir = '/home/sefidrodi/src/dragnet/dragnet/pickled_models/sklearn_0.18'
features_to_use = ['kohlschuetter', 'weninger', 'readability']
content_or_comments = 'both' # or 'content'
model = ExtraTreesClassifier(max_features=None, min_samples_leaf=75, n_jobs=4)
train_models(datadir, outdir, features_to_use, model, content_or_comments)
You can find the current state in my fork: https://github.com/siavash9000/dragnet
Just want to +1 this issue... I'd love to use dragnet
in a project, but the old version of scikit-learn
is incompatible with other dependencies.
Including the readability-based features was found to improve model performance, so we probably should have them, right? I'm happy to investigate the error, but it'll take a bit because I'm not familiar with the code. Thanks for the fork! :)
I think you are right regarding to readability-based features, but it is not clear to me how to add them without provoking the mentioned error. Hope you can find out. I will also try to invest some time if possible.
I'll take a look at this later today if time permits. I thought the hyperparameters were documented in code but after looking around couldn't find them which means I'll need to dig them up elsewhere. Readability features do improve the performance for the content only case so we'd like to bring them along to the new sklearn version.
@siavash9000 do you still get the same AttributeError
when trying to load the readability features on your fork?
yes, I still get the error. there seems to be a problem with the class ReadabilityFeatures in readability.pyx. i am unfortunately not familiar enough with cython and the code to understand what the problem is.
It looks like the AttributeError
is explained here: https://github.com/seomoz/dragnet/blob/master/dragnet/readability.pyx#L72
Unfortunately, the cython fix of using a class staticmethod plus a class attribute doesn't seem to work, since NormalizedFeature(readability_features)
is unable to find the nfeatures
attribute as it does for NormalizedFeature(kohlschuetter_features)
.
I did a little test in a Jupyter notebook:
Would it be a pain to refactor features.NormalizedFeature
so that it takes nfeatures
as a separate input to __init__
? Then they could be imported from the readability and kohlshuetter features modules in whatever form works.
Hey @matt-peters , just checking in — have you had a chance to check this out? Would what I proposed last week be feasible? If not, what would you recommend doing?
Hey @bdewilde, did you test your proposed approach?
Haven't worked on it yet, was working on another project while waiting for a response from the code's maintainer. Will try to make some progress on this in the next week or so...
For updating to a new version of sklearn, the only public models we need to maintain in dragnet.models
are:
content_extractor
content_comments_extractor
content_and_content_comments_extractor
The monkey patching of the blockifiers is important to maintain for performance.
Here's a snippet to get and display the hyper-parameters for the trained sklearn models:
vagrant@vagrant-ubuntu-trusty-64:~$ ipython
frPython 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:09:15)
Type "copyright", "credits" or "license" for more information.
IPython 5.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: from dragnet.models import content_extractor, \
...: content_comments_extractor
In [2]: content_extractor._block_model._skmodel.get_params()
Out[2]:
{'bootstrap': False,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_samples_leaf': 75,
'min_samples_split': 2,
'min_weight_fraction_leaf': None,
'n_estimators': 10,
'n_jobs': 1,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': None}
In [3]: type(content_extractor._block_model._skmodel)
Out[3]: sklearn.ensemble.forest.ExtraTreesClassifier
In [4]: content_comments_extractor._block_model._skmodel.get_params()
Out[4]:
{'bootstrap': False,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_samples_leaf': 65,
'min_samples_split': 2,
'min_weight_fraction_leaf': None,
'n_estimators': 10,
'n_jobs': 1,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': None}
In [5]: type(content_comments_extractor._block_model._skmodel)
Out[5]: sklearn.ensemble.forest.ExtraTreesClassifier
I expect the overall performance with different versions of sklearn should closely match the values with 0.16.1
reported here.
Unfortunately I didn't commit the final script to train/evaluate the models end-to-end but all the important pieces are in source control. In particular, dragnet.model_training.train_models
will train the models and dragnet.model_training.evaluate_models_tokens
will compute the F1 scores.
For both the content_extractor
and content_comments_extractor
I used ['kohlschuetter', 'weninger', 'readability']
as the features (features_to_use
argument to train_models
).
Hello! So, I've been trying to train new dragnet models, but am running into an error that doesn't make sense to me. Now that the readability.nfeatures issue is no longer a blocker (using the latest code in master), I instead hit a bug where the features matrix is not the same shape as the labels and weights vectors, which raises a ValueError
. Here's an example to reproduce:
from dragnet.model_training import train_models, evaluate_models_tokens
from sklearn.ensemble import ExtraTreesClassifier
datadir = '/path/to/dragnet_data'
output_dir = '/path/to/outputdir'
features = ['kohlschuetter', 'weninger', 'readability']
content_or_comments = 'content'
model = ExtraTreesClassifier(n_estimators=10, n_jobs=1, oob_score=False, bootstrap=False,
class_weight=None, criterion='gini', max_depth=None, max_features=None,
max_leaf_nodes=None, min_samples_leaf=75, min_samples_split=2,
min_weight_fraction_leaf=None, random_state=None, verbose=0, warm_start=None)
train_models(datadir, output_dir, features, model, content_or_comments=content_or_comments)
Output:
Reading the training and test data...
..done!
Got 966 training, 415 test documents
Initializing features
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-eb087a472072> in <module>()
5 min_weight_fraction_leaf=None, random_state=None, verbose=0, warm_start=None)
6
----> 7 train_models(datadir, output_dir, features, model, content_or_comments=content_or_comments)
/Users/burtondewilde/Desktop/github/dragnet/dragnet/model_training.py in train_models(datadir, output_dir, features_to_use, model, content_or_comments)
385 # initialize it
386 model_init = ContentExtractionModel(Blkr, [f], None)
--> 387 features, labels, weights = trainer.make_features_from_data(data, model_init, train=True)
388 mean_std = f.init_params(features)
389 f.set_params(mean_std)
/Users/burtondewilde/Desktop/github/dragnet/dragnet/model_training.py in make_features_from_data(self, data, model, training_or_test, return_blocks, train)
104
105 if features.shape[0] != len(labels) or len(labels) != len(weights):
--> 106 raise ValueError("Number of features, labels and weights do not match!")
107
108 if not return_blocks:
ValueError: Number of features, labels and weights do not match!
While iterating over the training html files, the ndarray shapes get out of whack — apparently, this happens at the 129th file. Here are the shapes before and after:
i = 127
len(blocks) = 69
training_features.shape = (69, 6)
features.shape = (15864, 6)
len(labels) = 15864
len(weights) = 15864
this_labels = [0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
i = 128
len(blocks) = 63
training_features.shape = (63, 6)
features.shape = (15927, 6)
len(labels) = 15927
len(weights) = 15927
this_labels = [0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
i = 129
len(blocks) = 352
training_features.shape = (352, 6)
features.shape = (16279, 6)
len(labels) = 15928
len(weights) = 15928
this_labels = [0]
Somehow, only a single label and weight is being added for 352 blocks in this file. @matt-peters , do you have any idea why this might be? It happens regardless of features included, and I don't understand why. I realize this is only tangentially related to this issue, so would be happy to open a new one.
Strange indeed. My guess is something went wrong in the initial pre-processing step in parsing the HTML and creating the "block corrected" files. I'd spot check the file that is giving you the issues. What does the HTML and block corrected file look like? How many lines are in the block corrected file?
@matt-peters Sure enough, the block-corrected file for this particular html file is a single line and deeply messed up:
0.0 0.0 â ¼ä½ å å â ç æ ã ¾ç æ æ æ¹ ãµ æ â ã ¾æ æ ã ¾æ æ ã¹ ä â â ³æ½ ç ³ç ç ç â ³æ ç æ ç ç â æ½ æ½ â ²ç æ ²æ ç â ä¼ ç ç æ ç æ½ ç ²âµ³ä ä å æ½ ç ²ã ³ç ç æ ã ¾æ æ ç æ â ½ç ³æ ¹ç æ ç â ç æ â ½æ ç æ ç ³â ç æ â ½æ â½ ã ã ²ã ã â ã ç ¼ç ç â ç æ â ½æ ç æ ç ç ç ç â ç æ ²â ½æ ⽠㠳㠲㠵㠵ã â¼¼æ ³æ ²ç ã ¾æ ³æ ²ç ç ç ¹ãµ ç ç â½ æ ªæ æ ³æ ²ç â ç ³ãµ â¼ æ ã ã ã ã â ã ¾ç ç ç ã¹ ç ¼ç ç â ç æ â ½æ ç æ ç ç ç ç â ç ...
In contrast to a normal file:
0.0 0.0 Transplanted lungs didn t come from Colo victims despite reports Vitals
0.0 0.0 MSN Hotmail More Autos My MSN Video Careers Jobs Personals Weather Delish Quotes White Pages Games Real Estate Wonderwall Horoscopes Shopping Yellow Pages Local Edition Traffic Feedback Maps Directions Travel Full MSN Index Bing
...
No idea why or how this is. For what it's worth, the corresponding html file is 9.html in the dragnet_data repo, and it looks "fine" when rendered in a browser.
Unfortunately I can't reproduce this. I just now created a fresh vagrant box and processed 9.html
to make 9.block_corrected.txt
and it looks fine.
vagrant@vagrant-ubuntu-trusty-64:/vagrant/data/dragnet_data$ ipython
Python 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:09:15)
Type "copyright", "credits" or "license" for more information.
IPython 5.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: from dragnet.data_processing import extract_gold_standard
In [2]: extract_gold_standard('./', '9')
Got all tokens for 9. 1944 in all blocks, 453 in gold standard content
Got all tokens for 9. 1944 in all blocks, 0 in gold standard comments
In [3]: exit
vagrant@vagrant-ubuntu-trusty-64:/vagrant/data/dragnet_data$ head block_corrected/9.block_corrected.txt
0.0 0.0 NBC s Costas says he plans to honor Israelis Olympic sports NBC Sports
0.0 0.0 Skip navigation
0.0 0.0 NBC Sports
0.0 0.0 Site powered by nbcnews com
0.0 0.0 Latest news
0.0 0.0 NBCNews com Top NBCNews com headlines Phelps sets record for most Olympic medals
0.0 0.0 Olympics
0.0 0.0 Sections
0.0 0.0 OlympicTalk
0.0 0.0 Event schedules
vagrant@vagrant-ubuntu-trusty-64:/vagrant/data/dragnet_data$
Maybe try removing your block_corrected
directory and re-running extract_gold_standard_all_traning_data
?
Okay, I added a condition in the loop that iterates over files during training/testing that skips over files for which the array shapes don't match, and... everything worked! There are some usability issues, so maybe I'll rummage around and submit a PR. :)
The bigger problem — specific to this issue — is that there's no way to train new models on newer versions of scikit-learn
because upon import dragnet tries to load the existing models, but the old ExtraTreesClassifier
class is incompatible with newer versions:
/Users/burtondewilde/.pyenv/versions/2.7.10/envs/dragnet-py2/lib/python2.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator ExtraTreeClassifier from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-466117d43890> in <module>()
1 import os
2
----> 3 from dragnet.model_training import train_models, evaluate_models_tokens
4 from sklearn.ensemble import ExtraTreesClassifier
/Users/burtondewilde/Desktop/github/dragnet/dragnet/__init__.py in <module>()
6 from dragnet.weninger import weninger_features_kmeans
7 from dragnet.readability import readability_features
----> 8 from dragnet.models import content_extractor, content_comments_extractor
9
10
/Users/burtondewilde/Desktop/github/dragnet/dragnet/models.py in <module>()
19 mode='rb') as fin:
20 if PY2 is True:
---> 21 content_extractor = pickle.load(fin)
22 else:
23 content_extractor = pickle.load(fin, encoding='bytes') # TODO: which encoding?
sklearn/tree/_tree.pyx in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:8128)()
KeyError: 'max_depth'
Any ideas? I'm worried that this will require a lot of refactoring...
Nice, glad you got it working! How many files did it have to skip? That's a little concerning though that I can't reproduce the issue with file 9.html
so I believe there is still a lurking problem.
As far as the import error -- is there any particular reason you need to import dragnet.models
when retraining? The sklearn models should only need to be deserialized if that is imported.
Sorry, should've mentioned: It only skipped that one file, but multiple times.
There's no reason that one would need to deserialize the models unless one intended to use them, of course. That said, the package isn't currently set up like that: see https://github.com/seomoz/dragnet/blob/master/dragnet/__init__.py#L8
I'm just not sure if this is going to be a rabbit hole or not. :)
Hey @matt-peters , after digging more deeply into the code, I'm trying to decide between a hacked solution and a more substantial refactor of the package that would 1. lazy-load models as needed, 2. rely more heavily on scikit-learn
for, say, feature normalization and model serialization, 3. be fully Py2/3 compatible and more flexible wrt versions of scikit-learn
, and 3. probably change the user-facing API to something along the lines of
from dragnet import ContentExtractor
content_extractor = ContentExtractor()
just_the_text = content_extractor(html_string)
Before (and if) I embark on the latter path, I'd love some feedback from you about these potential changes as well as clarification around the package's current structure, unused models and code, and the user-facing API. I'm happy to do that here, on another issue or PR, or via email (burtdewilde AT gmail.com), if you have some time. Let me know!
Can you expand a little more about why this is necessary for PY2/3 compatibility? This is the only potential reason that I can see that would make these changes worthwhile. I'm all for code cleanliness and agree the current API has some warts but time is probably better spent improving the model accuracy or improving performance in other ways.
There is a fair bit of code at Moz that depends on the current API so changing it in non-trivial ways is likely a no-go. Pending any potential blockers to support PY3 (which is crucial), my recommendation is to do the lazy loading / introspection on sklearn version when dragnet.models
is imported and leave everything else unchanged.
As for the import of the various models in __init__.py
: my inclination is to do one of two things:
SklearnVersionNotSupported
) that is raised if someone tries to import it without a supported version of sklearn. Then a user could explicitly trap the exception to use other parts of dragnet without the models if necessary (e.g. in your use case to train models).The second is backward compatible of course and will be needed anyway in dragnet.models
since we'll only support some but not all versions of sklearn so that's my preference.
Hi @matt-peters , I'm in a bit of a bind because of the way the code is structured and the user-facing API you've established. Without some of the refactoring I alluded to yesterday, I think my only option is # 2 — raising and catching an exception upon failed model load so that somebody can import the rest of the package (for purposes of training a new model, or whatever) without having to load the models. I guess I'll do that, then try to make progress on training new models for newer versions of sklearn and, maybe, Python 3.
Okay, I've made progress: Content and content+comments models have been trained using sklearn <= 0.17.1 and >= 0.18.0, and they go in separate directories as @siavash9000 did in his fork. The package automatically loads the models appropriate for a user's env when possible, otherwise issues a UserWarning
and sets the default content_extractor
and content_comments_extractor
to None
. Before submitting a PR, I'd like to confirm that the performance is as expected, based on your benchmarks.
Here are the token-level evaluation plots generated by model_training.evaluate_models_tokens()
, for content and content+comments:
That function also outputs "block errors", which are distinct from the previous evaluation metrics:
Training errors for final model (block level):
{'accuracy': 0.93372062632166708,
'auc': 0.7271612698021415,
'f1': 0.78510072801509956,
'precision': 0.82311606493821177,
'recall': 0.75044182621502209}
Test errors for final model (block level):
{'accuracy': 0.8482968456583232,
'auc': 0.4892596825186389,
'f1': 0.57430629669156874,
'precision': 0.62810037934053109,
'recall': 0.52899975423937085}
Training errors for final model (block level):
{'accuracy': 0.88429918977356425,
'auc': 0.7610870233498009,
'f1': 0.84736305932136979,
'precision': 0.86978764478764481,
'recall': 0.82606569900687543}
Test errors for final model (block level):
{'accuracy': 0.81810834581283132,
'auc': 0.6531392306943882,
'f1': 0.78203156155642906,
'precision': 0.81562685680332736,
'recall': 0.75109433136353687}
Is all of this sufficiently in line with your previous results, @matt-peters ?
Great! Those F1 scores are actually a little higher then the ones I reported in that blog post. I wonder if there have been improvements in the sklearn classifier between versions that gives them a boost? In any case, your description of the code changes sounds good. I'll look forward to the PR.
Resolved in #38
Hi,
I am using dragnet in combination with scikit-learn. I would like to use scikit-learn 0.18 together with dragnet, which is a little bit unconvenient, since dragnet depends on an older version of scikit and has problems loading the pretrained models.
Do you plan an update?
Thanks in advance, Siavash Sefid Rodi