Open KadeMorton opened 3 weeks ago
Found the issue with things getting out of order, easy fix. Was in the web_svc.py
code and not newspaper. Will post a MR shortly.
@KadeMorton Can you give me an example URL for a page that duplicates data? I can't reproduce.
We've chatted over email, but in case anyone is looking over tickets I don't want it to look like we ignored you! https://www.cisa.gov/news-events/cybersecurity-advisories/aa23-242a is the URL that I passed over.
There are not a lot of duplicated sentences for this URL when you run it through Thread, but there are a couple and I've checked on the original source page, the duplicated text is definitely only there once. There are some reports where the dupes are quite prevalent, some where its just a few sentences, and then some where its none.
You've indicated you're well on your way and we can keep chatting over Slack.
Describe the bug Since adjustments were made to enhance the comprehensiveness of text extraction in Thread, utilizing the Newspaper library, new issues have arisen, including text duplication and incorrect ordering of content. This affects the data extracted from websites.
To Reproduce
Environment set-up:
Steps to reproduce the behaviour:
Expected behavior Text to not be duplicated, and text to be in order. It's understandable if some text is not brought over given the vast variety of websites, but what is brought over should not be duplicated and in order.
Thread details (please complete the following information):
git rev-parse --short HEAD
Desktop (please complete the following information):
Acceptance Criteria