dragnet-org / dragnet

Just the facts -- web page content extraction
MIT License
1.24k stars 179 forks source link

updating / moving the training dataset? #85

Closed bdewilde closed 4 years ago

bdewilde commented 5 years ago

Is there any interest in moving the dragnet_data repository from seomoz to dragnet-org (this) GitHub account? It would be nice to have the two repos together and under the same administrative control.

On a related note, is there any interest in updating the training data (and retraining the various models)? The HTML in the current data is quite old at this point, so the trained models don't know how to learn from, say, HTML5's new syntactic features. I'm sure content extraction performance on newer webpages suffers. I don't know what the legal issues are (if any) of compiling a new dataset, but if somebody could advise, I would be interested in taking on some of the work.

Lastly, if we opted to compile a new training dataset from scratch, we wouldn't have to move the old repository and could, instead, just make a new one alongside this.

matt-peters commented 5 years ago

New data would be amazing, as you said the web has changed substantially in the last few years. I'm up for moving over the dragnet_data repo to this github org. I'd recommend adding the new data to the existing data instead of replacing completely, this will almost certainly make any trained model on the datasets more robust across different types of markup.

b4hand commented 5 years ago

FYI, I can't do the org transfer anymore, but we can just clone/fork the repo to the new org. I think it would make sense as well.

bdewilde commented 5 years ago

Hi folks! I'm finally ready to move forward on this task. First things first: I'm not able to create a new dragnet-data repository under dragnet-org. Would somebody (want to) give me those permissions?

Next big question is, do we actually want to keep the same "content + comments" setup as before? I had some back-and-forth with @matt-peters a couple(!) years ago — https://github.com/seomoz/dragnet_data/issues/2 — and my needs are the same as then: comments aren't useful (plus, these days they're usually generated via javascript, so don't show up in the raw html), and content could be split into "body text" and "metadata" (byline, pubdate, maybe even image captions, etc.). What do y'all think?

bdewilde commented 5 years ago

Currently looking into using (a small subset of) Common Crawl data to build a new training dataset. It should be possible to write code that pulls down a sample of crawled web pages' HTML and text content; manually cleaning up the latter shouldn't be too hard. Since new pages are crawled regularly, we could have a basically endless supply of training data. :) Will keep y'all posted.

matt-peters commented 5 years ago

@bdewilde Thanks for the update, this would be awesome if implemented. A refresh of the training data would probably greatly improve the model significantly for newer web pages. If you have the bandwidth for updating the training and model code to handle multiple different types of content (e.g. body text and metadata) then I'm 100% supportive. If not, including additional types of content in the "content" label (author, publication date) makes sense, but would be incompatible with the old annotation so you'd probably need to annotate at least the same amount of pages (~1000) to match the performance of the existing model for that type of content. In any case, thanks for your continued work on this project 💯

bdewilde commented 5 years ago

Hi @matt-peters , happy to help! For simplicity's sake, I'm leaning towards lumping everything — title, byline, captions, and main article text — into "content", and skipping comments altogether. There's a case for splitting the metadata out, but it's definitely secondary, and I think it can wait.

The only thing I need now (besides time to pull a new training dataset together 😉) is GitHub permissions to create a new dragnet-data repo under the dragnet-org account, or someone else to do it for me and add me as an author. Could you do that, or point me to someone who can? Thanks a ton!

bdewilde commented 5 years ago

Update: I've manually compiled a training dataset of 200 (html, text) pairs on modern, news-y web pages from a variety of sites and in a variety of languages. The gold-standard, content-only text extractions include

and do not include

Current block-level classification performance is F1 ~0.92. If I combine this dataset with CleanEval (which includes 680 examples), I get up to F1 ~0.95, but I'm not convinced it does a better job on the sort of modern, news-y web pages dominating my dataset. HTML really has changed in the past 10 years!

I'd like to get to ~500 examples, but this is a slow, not-fun process. Will keep y'all posted.

b4hand commented 5 years ago

FYI, @bdewilde: Since I added you to the org, you should be able to create repos in it now, so feel free to create the new dragnet-data repo.

ovidiususan commented 4 years ago

Any news on this?

bdewilde commented 4 years ago

hi @ovidiususan , i've recently begun another attempt at this — pandemic lockdown has given me some extra free time 🙃 — using a different, more scalable method for pulling together high-quality training data. will post an update here when i have news to share, or just create a new dragnet data repo as discussed above. appreciate your patience, i left this on my backburner much longer than planned.

bdewilde commented 4 years ago

hi folks, i've made some decent progress on this task, and in fact have set up a work-in-progress repo w/ an initial iteration of the data and data-generating code: https://github.com/bdewilde/dragnet_data

i want to finish a few key to-do's — documentation, tests, and actually cleaning / filling in most of the gold-standard texts — then will see about transfering or duplicating the code and data over to this org. will keep y'all posted.

nehalecky commented 4 years ago

hey @bdewilde! hope all goes well, how's this effort looking?

bdewilde commented 4 years ago

Hi @nehalecky , very sorry about the delay on this. I've built up a training dataset of ~300 (html, text) pairs out of a total of ~1400, but progress is slow, and I keep getting detoured by other side projects. 🤦‍♂️ I'll try to push other projects aside so I can focus on finishing this project over the next few weeks. Will let you know here how it goes... 🙏

nehalecky commented 4 years ago

Hi @bdewilde! Wow, thanks for the quick reply, and totally sympathize: data labeling is hard work. 😓
This quarter, we're (at https://github.com/bomboradata) putting effort into enhancing our content extraction performance, and looking to contribute to labeled data. We'd like to contribute to this effort, and had a few questions:

  1. Appreciate your description of workflows here (https://github.com/bdewilde/dragnet_data#methodology-and-data), but am curious to understand how they might be enhanced by use of a data labeling and annotation tool, such as: https://github.com/heartexlabs/label-studio or https://prodi.gy/?
  2. There is still no license added to https://github.com/bdewilde/dragnet_data, and in particular, wanted to know how you and https://github.com/dragnet-org would view granting the labeled data a commercial license (e.g., https://creativecommons.org/publicdomain/zero/1.0/), which makes our ability to contribute to this so much easier?
  3. Finally, I'd like to ask you or the community if know of any other efforts to advance SoA OSS around content extraction use cases?

Thanks much, appreciated!

bdewilde commented 4 years ago

Oh gosh, thank you tons for the offer to help! I set the code up in such a way that it's locked me into the original ~1400 pages — a large fraction of which are about the early days of the covid-19 pandemic, so both repetitive and bleak — but I've been meaning to restructure so that I or multiple people could extract gold-standard texts in more manageable chunks. Will try to implement a good method for this ASAP.

As for your questions:

  1. I've looked at those two options extensively and particularly, but neither seems to have a good built-in solution for this particular task. This is almost in the same ballpark, but I couldn't figure out how to adapt it. I tried to automate part of the task by programatically extracting page metadata/text from the HTML when possible, but too large a fraction of those extractions are noisy and not of "gold-standard" quality. The manual method I wrote up and follow is slower but much safer. If you know of a good, managed workflow tool, please point me to it!
  2. To be totally honest, I've always been confused by code/data licensing, but am inclined toward a very permissive license. I don't know if that'll be CC0 1.0, or MIT, or something else. Input and expertise would be appreciated!
  3. I've scouted broadly but not deeply into more recent / "state-of-the-art" methods in the html content extraction task, but nothing has really struck me as a huge improvement over what dragnet does. Excluding the methods that fully render a page (JavaScript and all!) then use computer vision to identify main body text, many methods seem to "blockify" text and perform binary classification on blocks based on a mix of structural and content-based features. That's not to say we can't and shouldn't improve upon the existing algo, just that we may not have to reinvent the wheel here. :) I have some ideas on this, but have been waiting on a fresh training dataset (again, sorry about the delay...) before committing to any new methods.
matt-peters commented 4 years ago

Nice to see work on this front. FWIW, I'll all for changing the license on dragnet_data to something else that allows permissive re-use. EDIT: I see the original dragnet_data is still owned by seomoz org, and wasn't moved over to dragnet org. In that case, we'd need to work with someone there to change the license as I'm no longer a member.

bdewilde commented 4 years ago

Hi @nehalecky , I've made some changes to the code so that it's easier to do and track gold-standard extractions in batches, which should be nice for me alone, but with a couple minor git additions (branching, rebasing, and pull requests!), should also work for multiple annotators. I think. Here is the process — does it make sense?

bdewilde commented 4 years ago

Hi all, I've finished compiling 500 (HTML, gold-standard text extraction) example pairs, have added permissive licenses for both the associated code and data, and have transferred ownership of the project to this group: https://github.com/dragnet-org/dragnet_data

The project readme includes thorough instructions on how to add additional examples to the dataset, if one were so inclined, but I think 500 is a fine enough start. There are no tests, for which I hope you'll forgive me.

Given all this, I'm going to close this issue out. Mission accomplished!

shakeelsoogun commented 3 years ago

As a user of this library, greatly appreciate the hard work done here - anything to improve the quality is always a bonus and I do definitely sympathise with the pain of manually building the gold standard! Just wondering, since there was already a model bundled with this library, are there any plans to push a newly trained version of the model (either just to this repo or to pypi as well), or is the recommendation more to do this ourselves now? Would be nice to be able to just consume one already done, but equally don't mind doing this myself since the instructions to train it aren't too difficult.

bdewilde commented 3 years ago

@shakeelsoogun Thanks for asking :) I've hacked a bit on a major revamp of dragnet 's underlying models / methodology, since there's been some progress on this task over the past few years, but honestly I haven't had bandwidth to do much. It would be a lower lift to adapt the current setup to the new data, but it would still require changes to dragnet 's code base owing to changes made in structure and content of the new training dataset. This is on my to-do list, but I don't have any guesses on timelines. Don't let me deter you from training your own!