dalab / web2text

Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18
MIT License
168 stars 31 forks source link

how to improve accuracy? #10

Closed whatsdis closed 4 years ago

whatsdis commented 4 years ago

I've tested on different websites so far and it is only grabbing tiny excerpts it thinks is the main content. While the text is inside the main content, it is ignoring the rest of the text that is still part of the main content.

I've used the recipe to generate the final output text. How can I tweak this so that it can grab the expected main content text?

By default, is it using pre-trained weights? How can I "teach" it so that its accuracy will improve?

So far I tested:

https://news.ycombinator = grabs only the first submission

https://openai.com/blog/openai-pytorch/ = " In the past, we implemented projects in many frameworks depending on their relative strengths. We’ve now chosen to standardize to make it easier for our team to create and share optimized implementations of our models." missing the first sentence and the rest of the text.

tvogels commented 4 years ago

I assume the model you used was trained on the Cleaneval dataset. While this dataset is/was popular in Academica, it is quite old (more than 10 years), so websites from back then probably look quite different from current ones. To be able to achieve high accuracy on modern websites, I would recommend to train the model on a more modern dataset. In another Github issue, someone suggested Dragnet.