counterdata-network / story-processor

Story discovery engine for the Counterdata Network. Grabs relevant stories from various APIs, runs them against bespoke classifier models, post results to a central server.
Apache License 2.0
0 stars 2 forks source link

assess overlap in project story lists and remove duplicates #29

Closed rahulbot closed 5 months ago

rahulbot commented 1 year ago

Right now we fetch all the stories matching a project query independently. However, many of the projects include overlapping keywords and sources, which means I have a strong suspicion that a lot of them overlap. If even 20% of the URLs overlap within one day's run across projects, then we are creating significantly duplicative fetching work for ourselves later on when we fetch HTML (newscatcher) or parsed text (wayback-machine).

We should:

  1. enable some logging to measure this overlap on a day or two
  2. if it is a noticeable percentage, we should design an approach to make sure we don't refetch a story's content if we've already fetched it in that day's run

This can be assessed without actually running the page fetching, since it just depends on comparing URLs lists between projects, which are actually fetched pretty quickly.

My gut tells me this will speed up the fetching process because it will reduce the total number of URLs we need to fetch content for. The key is that we need to remember the N projects that a particular URL was associated with, and make sure that content goes into the classifier N times (once per project) after the content is fetched.

rahulbot commented 1 year ago

@gopigof reports almost 1/2 stories are repeated at first glance, and that is causing slow-down in #30

rahulbot commented 5 months ago

Closing based on prior fixes to URL-based deduplication.