dragnet-org / dragnet

Just the facts -- web page content extraction
MIT License
1.26k stars 180 forks source link

scikit-learn update #33

Closed siavash9000 closed 7 years ago

siavash9000 commented 8 years ago

Hi,

I am using dragnet in combination with scikit-learn. I would like to use scikit-learn 0.18 together with dragnet, which is a little bit unconvenient, since dragnet depends on an older version of scikit and has problems loading the pretrained models.

Do you plan an update?

Thanks in advance, Siavash Sefid Rodi

matt-peters commented 8 years ago

We don't have any immediate plans to upgrade it. IMO this is one of the major problems using sklearn, that serialization isn't supported in the basic API and therefore the pickled models are prone to breaking when upgrading versions.

Our training and test data is available at https://github.com/seomoz/dragnet_data if you wanted to retrain with version 0.18. I'd expect the performance to be similar if you use the same hyperparameter settings and models. We could then support both old and new versions of sklearn by including the existing models and new ones and checking sklearn.__version__ to determine which to load. Do you know if the current models work with sklearn version 0.17? I've only checked them up to 0.16 (as in the requirements.txt).

matt-peters commented 8 years ago

ps -- we currently have several different model versions included in this release but only two are necessary to maintain going forward (the models loaded from kohlschuetter_weninger_readability_content_model.pickle.gz and kohlschuetter_weninger_readability_content_comments_model.pickle.gz)

siavash9000 commented 8 years ago

k, thanks for the explanation. I will try to retrain with 0.18 and open a pull request.

siavash9000 commented 8 years ago

Where can I find the hyperparameter settings and which features should I use? I am currently trying to train, following the example from the Readme, but I am indecisive which features to use. The readme suggests to use 'kohlschuetter' and 'weninger', but the resulting file is named 'kohlschuetter_weninger_content_model.pickle' instead of 'kohlschuetter_weninger_readability_content_model.pickle' .

Adding the feature 'readability' results in the following error:

AttributeError: 'cython_function_or_method' object has no attribute 'nfeatures'

in the init of NormalizedFeature.

The code for training:

from sklearn.ensemble import ExtraTreesClassifier
from dragnet.model_training import train_models

datadir = '/home/sefidrodi/src/dragnet_data'
outdir = '/home/sefidrodi/src/dragnet/dragnet/pickled_models/sklearn_0.18'
features_to_use = ['kohlschuetter', 'weninger', 'readability']
content_or_comments = 'both'   # or 'content'

model = ExtraTreesClassifier(max_features=None, min_samples_leaf=75, n_jobs=4)
train_models(datadir, outdir, features_to_use, model, content_or_comments)

You can find the current state in my fork: https://github.com/siavash9000/dragnet

bdewilde commented 8 years ago

Just want to +1 this issue... I'd love to use dragnet in a project, but the old version of scikit-learn is incompatible with other dependencies.

Including the readability-based features was found to improve model performance, so we probably should have them, right? I'm happy to investigate the error, but it'll take a bit because I'm not familiar with the code. Thanks for the fork! :)

siavash9000 commented 8 years ago

I think you are right regarding to readability-based features, but it is not clear to me how to add them without provoking the mentioned error. Hope you can find out. I will also try to invest some time if possible.

matt-peters commented 8 years ago

I'll take a look at this later today if time permits. I thought the hyperparameters were documented in code but after looking around couldn't find them which means I'll need to dig them up elsewhere. Readability features do improve the performance for the content only case so we'd like to bring them along to the new sklearn version.

@siavash9000 do you still get the same AttributeError when trying to load the readability features on your fork?

siavash9000 commented 8 years ago

yes, I still get the error. there seems to be a problem with the class ReadabilityFeatures in readability.pyx. i am unfortunately not familiar enough with cython and the code to understand what the problem is.

bdewilde commented 8 years ago

It looks like the AttributeError is explained here: https://github.com/seomoz/dragnet/blob/master/dragnet/readability.pyx#L72

Unfortunately, the cython fix of using a class staticmethod plus a class attribute doesn't seem to work, since NormalizedFeature(readability_features) is unable to find the nfeatures attribute as it does for NormalizedFeature(kohlschuetter_features).

bdewilde commented 8 years ago

I did a little test in a Jupyter notebook:

screen shot 2016-12-01 at 5 45 54 pm

Would it be a pain to refactor features.NormalizedFeature so that it takes nfeatures as a separate input to __init__? Then they could be imported from the readability and kohlshuetter features modules in whatever form works.

bdewilde commented 7 years ago

Hey @matt-peters , just checking in — have you had a chance to check this out? Would what I proposed last week be feasible? If not, what would you recommend doing?

siavash9000 commented 7 years ago

Hey @bdewilde, did you test your proposed approach?

bdewilde commented 7 years ago

Haven't worked on it yet, was working on another project while waiting for a response from the code's maintainer. Will try to make some progress on this in the next week or so...

matt-peters commented 7 years ago

For updating to a new version of sklearn, the only public models we need to maintain in dragnet.models are:

The monkey patching of the blockifiers is important to maintain for performance.

Here's a snippet to get and display the hyper-parameters for the trained sklearn models:

vagrant@vagrant-ubuntu-trusty-64:~$ ipython
frPython 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:09:15) 
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from dragnet.models import content_extractor, \
   ...:     content_comments_extractor

In [2]: content_extractor._block_model._skmodel.get_params()
Out[2]: 
{'bootstrap': False,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_samples_leaf': 75,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': None,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': None}

In [3]: type(content_extractor._block_model._skmodel)
Out[3]: sklearn.ensemble.forest.ExtraTreesClassifier

In [4]: content_comments_extractor._block_model._skmodel.get_params()
Out[4]: 
{'bootstrap': False,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_samples_leaf': 65,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': None,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': None}

In [5]: type(content_comments_extractor._block_model._skmodel)
Out[5]: sklearn.ensemble.forest.ExtraTreesClassifier

I expect the overall performance with different versions of sklearn should closely match the values with 0.16.1 reported here.

Unfortunately I didn't commit the final script to train/evaluate the models end-to-end but all the important pieces are in source control. In particular, dragnet.model_training.train_models will train the models and dragnet.model_training.evaluate_models_tokens will compute the F1 scores.

For both the content_extractor and content_comments_extractor I used ['kohlschuetter', 'weninger', 'readability'] as the features (features_to_use argument to train_models).

bdewilde commented 7 years ago

Hello! So, I've been trying to train new dragnet models, but am running into an error that doesn't make sense to me. Now that the readability.nfeatures issue is no longer a blocker (using the latest code in master), I instead hit a bug where the features matrix is not the same shape as the labels and weights vectors, which raises a ValueError. Here's an example to reproduce:

from dragnet.model_training import train_models, evaluate_models_tokens
from sklearn.ensemble import ExtraTreesClassifier

datadir = '/path/to/dragnet_data'
output_dir = '/path/to/outputdir'
features = ['kohlschuetter', 'weninger', 'readability']

content_or_comments = 'content'
model = ExtraTreesClassifier(n_estimators=10, n_jobs=1, oob_score=False, bootstrap=False,
                             class_weight=None, criterion='gini', max_depth=None, max_features=None,
                             max_leaf_nodes=None, min_samples_leaf=75, min_samples_split=2,
                             min_weight_fraction_leaf=None, random_state=None, verbose=0, warm_start=None)

train_models(datadir, output_dir, features, model, content_or_comments=content_or_comments)

Output:

Reading the training and test data...
..done!
Got 966 training, 415 test documents
Initializing features
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-eb087a472072> in <module>()
      5                              min_weight_fraction_leaf=None, random_state=None, verbose=0, warm_start=None)
      6 
----> 7 train_models(datadir, output_dir, features, model, content_or_comments=content_or_comments)

/Users/burtondewilde/Desktop/github/dragnet/dragnet/model_training.py in train_models(datadir, output_dir, features_to_use, model, content_or_comments)
    385             # initialize it
    386             model_init = ContentExtractionModel(Blkr, [f], None)
--> 387             features, labels, weights = trainer.make_features_from_data(data, model_init, train=True)
    388             mean_std = f.init_params(features)
    389             f.set_params(mean_std)

/Users/burtondewilde/Desktop/github/dragnet/dragnet/model_training.py in make_features_from_data(self, data, model, training_or_test, return_blocks, train)
    104 
    105         if features.shape[0] != len(labels) or len(labels) != len(weights):
--> 106             raise ValueError("Number of features, labels and weights do not match!")
    107 
    108         if not return_blocks:

ValueError: Number of features, labels and weights do not match!

While iterating over the training html files, the ndarray shapes get out of whack — apparently, this happens at the 129th file. Here are the shapes before and after:

i = 127
len(blocks) = 69
training_features.shape = (69, 6)
features.shape = (15864, 6)
len(labels) = 15864
len(weights) = 15864
this_labels = [0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

i = 128
len(blocks) = 63
training_features.shape = (63, 6)
features.shape = (15927, 6)
len(labels) = 15927
len(weights) = 15927
this_labels = [0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

i = 129
len(blocks) = 352
training_features.shape = (352, 6)
features.shape = (16279, 6)
len(labels) = 15928
len(weights) = 15928
this_labels = [0]

Somehow, only a single label and weight is being added for 352 blocks in this file. @matt-peters , do you have any idea why this might be? It happens regardless of features included, and I don't understand why. I realize this is only tangentially related to this issue, so would be happy to open a new one.

matt-peters commented 7 years ago

Strange indeed. My guess is something went wrong in the initial pre-processing step in parsing the HTML and creating the "block corrected" files. I'd spot check the file that is giving you the issues. What does the HTML and block corrected file look like? How many lines are in the block corrected file?

bdewilde commented 7 years ago

@matt-peters Sure enough, the block-corrected file for this particular html file is a single line and deeply messed up:

0.0 0.0 â ¼ä½ å å â ç æ ã ¾ç æ æ æ¹ ãµ æ â ã ¾æ æ ã ¾æ æ ã¹ ä â â ³æ½ ç ³ç ç ç â ³æ ç æ ç ç â æ½ æ½ â ²ç æ ²æ ç â ä¼ ç ç æ ç æ½ ç ²âµ³ä ä å æ½ ç ²ã ³ç ç æ ã ¾æ æ ç æ â ½ç ³æ ¹ç æ ç â ç æ â ½æ ç æ ç ³â ç æ â ½æ â½ ã ã ²ã ã â ã ç ¼ç ç â ç æ â ½æ ç æ ç ç ç ç â ç æ ²â ½æ ⽠㠳㠲㠵㠵ã â¼¼æ ³æ ²ç ã ¾æ ³æ ²ç ç ç ¹ãµ ç ç â½ æ ªæ æ ³æ ²ç â ç ³ãµ â¼ æ ã ã ã ã â ã ¾ç ç ç ã¹ ç ¼ç ç â ç æ â ½æ ç æ ç ç ç ç â ç ...

In contrast to a normal file:

0.0 0.0 Transplanted lungs didn t come from Colo victims despite reports Vitals      
0.0 0.0 MSN Hotmail More Autos My MSN Video Careers Jobs Personals Weather Delish Quotes White Pages Games Real Estate Wonderwall Horoscopes Shopping Yellow Pages Local Edition Traffic Feedback Maps Directions Travel Full MSN Index Bing         
...

No idea why or how this is. For what it's worth, the corresponding html file is 9.html in the dragnet_data repo, and it looks "fine" when rendered in a browser.

matt-peters commented 7 years ago

Unfortunately I can't reproduce this. I just now created a fresh vagrant box and processed 9.html to make 9.block_corrected.txt and it looks fine.

vagrant@vagrant-ubuntu-trusty-64:/vagrant/data/dragnet_data$ ipython
Python 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:09:15) 
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from dragnet.data_processing import extract_gold_standard

In [2]: extract_gold_standard('./', '9')
Got all tokens for 9.  1944 in all blocks, 453 in gold standard content
Got all tokens for 9.  1944 in all blocks, 0 in gold standard comments

In [3]: exit
vagrant@vagrant-ubuntu-trusty-64:/vagrant/data/dragnet_data$ head block_corrected/9.block_corrected.txt 
0.0 0.0 NBC s Costas says he plans to honor Israelis Olympic sports NBC Sports       
0.0 0.0 Skip navigation      
0.0 0.0 NBC Sports       
0.0 0.0 Site powered by nbcnews com      
0.0 0.0 Latest news      
0.0 0.0 NBCNews com Top NBCNews com headlines Phelps sets record for most Olympic medals         
0.0 0.0 Olympics         
0.0 0.0 Sections         
0.0 0.0 OlympicTalk      
0.0 0.0 Event schedules      
vagrant@vagrant-ubuntu-trusty-64:/vagrant/data/dragnet_data$ 

Maybe try removing your block_corrected directory and re-running extract_gold_standard_all_traning_data ?

bdewilde commented 7 years ago

Okay, I added a condition in the loop that iterates over files during training/testing that skips over files for which the array shapes don't match, and... everything worked! There are some usability issues, so maybe I'll rummage around and submit a PR. :)

The bigger problem — specific to this issue — is that there's no way to train new models on newer versions of scikit-learn because upon import dragnet tries to load the existing models, but the old ExtraTreesClassifier class is incompatible with newer versions:

/Users/burtondewilde/.pyenv/versions/2.7.10/envs/dragnet-py2/lib/python2.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator ExtraTreeClassifier from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-466117d43890> in <module>()
      1 import os
      2 
----> 3 from dragnet.model_training import train_models, evaluate_models_tokens
      4 from sklearn.ensemble import ExtraTreesClassifier

/Users/burtondewilde/Desktop/github/dragnet/dragnet/__init__.py in <module>()
      6 from dragnet.weninger import weninger_features_kmeans
      7 from dragnet.readability import readability_features
----> 8 from dragnet.models import content_extractor, content_comments_extractor
      9 
     10 

/Users/burtondewilde/Desktop/github/dragnet/dragnet/models.py in <module>()
     19         mode='rb') as fin:
     20     if PY2 is True:
---> 21         content_extractor = pickle.load(fin)
     22     else:
     23         content_extractor = pickle.load(fin, encoding='bytes')  # TODO: which encoding?

sklearn/tree/_tree.pyx in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:8128)()

KeyError: 'max_depth'

Any ideas? I'm worried that this will require a lot of refactoring...

matt-peters commented 7 years ago

Nice, glad you got it working! How many files did it have to skip? That's a little concerning though that I can't reproduce the issue with file 9.html so I believe there is still a lurking problem.

As far as the import error -- is there any particular reason you need to import dragnet.models when retraining? The sklearn models should only need to be deserialized if that is imported.

bdewilde commented 7 years ago

Sorry, should've mentioned: It only skipped that one file, but multiple times.

There's no reason that one would need to deserialize the models unless one intended to use them, of course. That said, the package isn't currently set up like that: see https://github.com/seomoz/dragnet/blob/master/dragnet/__init__.py#L8

I'm just not sure if this is going to be a rabbit hole or not. :)

bdewilde commented 7 years ago

Hey @matt-peters , after digging more deeply into the code, I'm trying to decide between a hacked solution and a more substantial refactor of the package that would 1. lazy-load models as needed, 2. rely more heavily on scikit-learn for, say, feature normalization and model serialization, 3. be fully Py2/3 compatible and more flexible wrt versions of scikit-learn, and 3. probably change the user-facing API to something along the lines of

from dragnet import ContentExtractor

content_extractor = ContentExtractor()
just_the_text = content_extractor(html_string)

Before (and if) I embark on the latter path, I'd love some feedback from you about these potential changes as well as clarification around the package's current structure, unused models and code, and the user-facing API. I'm happy to do that here, on another issue or PR, or via email (burtdewilde AT gmail.com), if you have some time. Let me know!

matt-peters commented 7 years ago

Can you expand a little more about why this is necessary for PY2/3 compatibility? This is the only potential reason that I can see that would make these changes worthwhile. I'm all for code cleanliness and agree the current API has some warts but time is probably better spent improving the model accuracy or improving performance in other ways.

There is a fair bit of code at Moz that depends on the current API so changing it in non-trivial ways is likely a no-go. Pending any potential blockers to support PY3 (which is crucial), my recommendation is to do the lazy loading / introspection on sklearn version when dragnet.models is imported and leave everything else unchanged.

matt-peters commented 7 years ago

As for the import of the various models in __init__.py: my inclination is to do one of two things:

  1. Just remove this all together and raise an exception with a helpful error message if someone tries to access it.
  2. Hack it by defining a new exception (e.g. SklearnVersionNotSupported) that is raised if someone tries to import it without a supported version of sklearn. Then a user could explicitly trap the exception to use other parts of dragnet without the models if necessary (e.g. in your use case to train models).

The second is backward compatible of course and will be needed anyway in dragnet.models since we'll only support some but not all versions of sklearn so that's my preference.

bdewilde commented 7 years ago

Hi @matt-peters , I'm in a bit of a bind because of the way the code is structured and the user-facing API you've established. Without some of the refactoring I alluded to yesterday, I think my only option is # 2 — raising and catching an exception upon failed model load so that somebody can import the rest of the package (for purposes of training a new model, or whatever) without having to load the models. I guess I'll do that, then try to make progress on training new models for newer versions of sklearn and, maybe, Python 3.

bdewilde commented 7 years ago

Okay, I've made progress: Content and content+comments models have been trained using sklearn <= 0.17.1 and >= 0.18.0, and they go in separate directories as @siavash9000 did in his fork. The package automatically loads the models appropriate for a user's env when possible, otherwise issues a UserWarning and sets the default content_extractor and content_comments_extractor to None. Before submitting a PR, I'd like to confirm that the performance is as expected, based on your benchmarks.

Here are the token-level evaluation plots generated by model_training.evaluate_models_tokens(), for content and content+comments:

figs_content

figs_both

That function also outputs "block errors", which are distinct from the previous evaluation metrics:

Training errors for final model (block level):
{'accuracy': 0.93372062632166708,
 'auc': 0.7271612698021415,
 'f1': 0.78510072801509956,
 'precision': 0.82311606493821177,
 'recall': 0.75044182621502209}
Test errors for final model (block level):
{'accuracy': 0.8482968456583232,
 'auc': 0.4892596825186389,
 'f1': 0.57430629669156874,
 'precision': 0.62810037934053109,
 'recall': 0.52899975423937085}
Training errors for final model (block level):
{'accuracy': 0.88429918977356425,
 'auc': 0.7610870233498009,
 'f1': 0.84736305932136979,
 'precision': 0.86978764478764481,
 'recall': 0.82606569900687543}
Test errors for final model (block level):
{'accuracy': 0.81810834581283132,
 'auc': 0.6531392306943882,
 'f1': 0.78203156155642906,
 'precision': 0.81562685680332736,
 'recall': 0.75109433136353687}

Is all of this sufficiently in line with your previous results, @matt-peters ?

matt-peters commented 7 years ago

Great! Those F1 scores are actually a little higher then the ones I reported in that blog post. I wonder if there have been improvements in the sklearn classifier between versions that gives them a boost? In any case, your description of the code changes sounds good. I'll look forward to the PR.

matt-peters commented 7 years ago

Resolved in #38