updating / moving the training dataset?

bdewilde commented 5 years ago

Is there any interest in moving the dragnet_data repository from seomoz to dragnet-org (this) GitHub account? It would be nice to have the two repos together and under the same administrative control.

On a related note, is there any interest in updating the training data (and retraining the various models)? The HTML in the current data is quite old at this point, so the trained models don't know how to learn from, say, HTML5's new syntactic features. I'm sure content extraction performance on newer webpages suffers. I don't know what the legal issues are (if any) of compiling a new dataset, but if somebody could advise, I would be interested in taking on some of the work.

Lastly, if we opted to compile a new training dataset from scratch, we wouldn't have to move the old repository and could, instead, just make a new one alongside this.

matt-peters commented 5 years ago

New data would be amazing, as you said the web has changed substantially in the last few years. I'm up for moving over the dragnet_data repo to this github org. I'd recommend adding the new data to the existing data instead of replacing completely, this will almost certainly make any trained model on the datasets more robust across different types of markup.

b4hand commented 5 years ago

FYI, I can't do the org transfer anymore, but we can just clone/fork the repo to the new org. I think it would make sense as well.

bdewilde commented 5 years ago

Hi folks! I'm finally ready to move forward on this task. First things first: I'm not able to create a new dragnet-data repository under dragnet-org. Would somebody (want to) give me those permissions?

Next big question is, do we actually want to keep the same "content + comments" setup as before? I had some back-and-forth with @matt-peters a couple(!) years ago — https://github.com/seomoz/dragnet_data/issues/2 — and my needs are the same as then: comments aren't useful (plus, these days they're usually generated via javascript, so don't show up in the raw html), and content could be split into "body text" and "metadata" (byline, pubdate, maybe even image captions, etc.). What do y'all think?

bdewilde commented 5 years ago

Currently looking into using (a small subset of) Common Crawl data to build a new training dataset. It should be possible to write code that pulls down a sample of crawled web pages' HTML and text content; manually cleaning up the latter shouldn't be too hard. Since new pages are crawled regularly, we could have a basically endless supply of training data. :) Will keep y'all posted.

matt-peters commented 5 years ago

@bdewilde Thanks for the update, this would be awesome if implemented. A refresh of the training data would probably greatly improve the model significantly for newer web pages. If you have the bandwidth for updating the training and model code to handle multiple different types of content (e.g. body text and metadata) then I'm 100% supportive. If not, including additional types of content in the "content" label (author, publication date) makes sense, but would be incompatible with the old annotation so you'd probably need to annotate at least the same amount of pages (~1000) to match the performance of the existing model for that type of content. In any case, thanks for your continued work on this project 💯

bdewilde commented 5 years ago

Hi @matt-peters , happy to help! For simplicity's sake, I'm leaning towards lumping everything — title, byline, captions, and main article text — into "content", and skipping comments altogether. There's a case for splitting the metadata out, but it's definitely secondary, and I think it can wait.

The only thing I need now (besides time to pull a new training dataset together 😉) is GitHub permissions to create a new dragnet-data repo under the dragnet-org account, or someone else to do it for me and add me as an author. Could you do that, or point me to someone who can? Thanks a ton!

bdewilde commented 5 years ago

Update: I've manually compiled a training dataset of 200 (html, text) pairs on modern, news-y web pages from a variety of sites and in a variety of languages. The gold-standard, content-only text extractions include

title
visible byline / publish date info
visible image captions
main article text
visible text within embedded social media posts (e.g. tweets)

and do not include

image captions buried in photo galleries
photo credits when not appended directly to a caption
section/content tags, before or after the main article
urls of images displayed in embedded social media posts

Current block-level classification performance is F1 ~0.92. If I combine this dataset with CleanEval (which includes 680 examples), I get up to F1 ~0.95, but I'm not convinced it does a better job on the sort of modern, news-y web pages dominating my dataset. HTML really has changed in the past 10 years!

I'd like to get to ~500 examples, but this is a slow, not-fun process. Will keep y'all posted.

b4hand commented 5 years ago

FYI, @bdewilde: Since I added you to the org, you should be able to create repos in it now, so feel free to create the new dragnet-data repo.