arachne-threat-intel / thread

Thread is a tool for analysts to map finished reports and articles to MITRE ATT&CK®.
https://arachne.digital/thread
Apache License 2.0
2 stars 3 forks source link

Address Content Extraction Issues in Thread Using Newspaper Library #91

Open KadeMorton opened 3 weeks ago

KadeMorton commented 3 weeks ago

Describe the bug Since adjustments were made to enhance the comprehensiveness of text extraction in Thread, utilizing the Newspaper library, new issues have arisen, including text duplication and incorrect ordering of content. This affects the data extracted from websites.

To Reproduce

Environment set-up:

Steps to reproduce the behaviour:

  1. Go to the Thread homepage
  2. Click Enter Reports
  3. Fill in the associated information. Any URL will do
  4. Wait for the report to be processed
  5. Once the report is in News Review, click Analyse
  6. Look through the text Thread presents and compare it to the original website

Expected behavior Text to not be duplicated, and text to be in order. It's understandable if some text is not brought over given the vast variety of websites, but what is brought over should not be duplicated and in order.

Thread details (please complete the following information):

Desktop (please complete the following information):

Acceptance Criteria

cvallance commented 3 weeks ago

Found the issue with things getting out of order, easy fix. Was in the web_svc.py code and not newspaper. Will post a MR shortly.

@KadeMorton Can you give me an example URL for a page that duplicates data? I can't reproduce.

KadeMorton commented 3 weeks ago

We've chatted over email, but in case anyone is looking over tickets I don't want it to look like we ignored you! https://www.cisa.gov/news-events/cybersecurity-advisories/aa23-242a is the URL that I passed over.

There are not a lot of duplicated sentences for this URL when you run it through Thread, but there are a couple and I've checked on the original source page, the duplicated text is definitely only there once. There are some reports where the dupes are quite prevalent, some where its just a few sentences, and then some where its none.

You've indicated you're well on your way and we can keep chatting over Slack.