Closed whatsdis closed 4 years ago
I assume the model you used was trained on the Cleaneval dataset. While this dataset is/was popular in Academica, it is quite old (more than 10 years), so websites from back then probably look quite different from current ones. To be able to achieve high accuracy on modern websites, I would recommend to train the model on a more modern dataset. In another Github issue, someone suggested Dragnet.
I've tested on different websites so far and it is only grabbing tiny excerpts it thinks is the main content. While the text is inside the main content, it is ignoring the rest of the text that is still part of the main content.
I've used the recipe to generate the final output text. How can I tweak this so that it can grab the expected main content text?
By default, is it using pre-trained weights? How can I "teach" it so that its accuracy will improve?
So far I tested:
https://news.ycombinator = grabs only the first submission
https://openai.com/blog/openai-pytorch/ = " In the past, we implemented projects in many frameworks depending on their relative strengths. We’ve now chosen to standardize to make it easier for our team to create and share optimized implementations of our models." missing the first sentence and the rest of the text.