assess overlap in project story lists and remove duplicates

rahulbot commented 1 year ago

Right now we fetch all the stories matching a project query independently. However, many of the projects include overlapping keywords and sources, which means I have a strong suspicion that a lot of them overlap. If even 20% of the URLs overlap within one day's run across projects, then we are creating significantly duplicative fetching work for ourselves later on when we fetch HTML (newscatcher) or parsed text (wayback-machine).

We should:

enable some logging to measure this overlap on a day or two
if it is a noticeable percentage, we should design an approach to make sure we don't refetch a story's content if we've already fetched it in that day's run

This can be assessed without actually running the page fetching, since it just depends on comparing URLs lists between projects, which are actually fetched pretty quickly.

My gut tells me this will speed up the fetching process because it will reduce the total number of URLs we need to fetch content for. The key is that we need to remember the N projects that a particular URL was associated with, and make sure that content goes into the classifier N times (once per project) after the content is fetched.

rahulbot commented 1 year ago

@gopigof reports almost 1/2 stories are repeated at first glance, and that is causing slow-down in #30

rahulbot commented 5 months ago

Closing based on prior fixes to URL-based deduplication.

counterdata-network / story-processor

assess overlap in project story lists and remove duplicates #29