Story discovery engine for the Counterdata Network. Grabs relevant stories from various APIs, runs them against bespoke classifier models, post results to a central server.
Apache License 2.0
0
stars
2
forks
source link
assess overlap in project story lists and remove duplicates #29
Right now we fetch all the stories matching a project query independently. However, many of the projects include overlapping keywords and sources, which means I have a strong suspicion that a lot of them overlap. If even 20% of the URLs overlap within one day's run across projects, then we are creating significantly duplicative fetching work for ourselves later on when we fetch HTML (newscatcher) or parsed text (wayback-machine).
We should:
enable some logging to measure this overlap on a day or two
if it is a noticeable percentage, we should design an approach to make sure we don't refetch a story's content if we've already fetched it in that day's run
This can be assessed without actually running the page fetching, since it just depends on comparing URLs lists between projects, which are actually fetched pretty quickly.
My gut tells me this will speed up the fetching process because it will reduce the total number of URLs we need to fetch content for. The key is that we need to remember the N projects that a particular URL was associated with, and make sure that content goes into the classifier N times (once per project) after the content is fetched.
Right now we fetch all the stories matching a project query independently. However, many of the projects include overlapping keywords and sources, which means I have a strong suspicion that a lot of them overlap. If even 20% of the URLs overlap within one day's run across projects, then we are creating significantly duplicative fetching work for ourselves later on when we fetch HTML (newscatcher) or parsed text (wayback-machine).
We should:
This can be assessed without actually running the page fetching, since it just depends on comparing URLs lists between projects, which are actually fetched pretty quickly.
My gut tells me this will speed up the fetching process because it will reduce the total number of URLs we need to fetch content for. The key is that we need to remember the N projects that a particular URL was associated with, and make sure that content goes into the classifier N times (once per project) after the content is fetched.