counterdata-network / story-processor

Story discovery engine for the Counterdata Network. Grabs relevant stories from various APIs, runs them against bespoke classifier models, post results to a central server.
Apache License 2.0
0 stars 2 forks source link

investigate newscatcher story limit logic #35

Closed rahulbot closed 11 months ago

rahulbot commented 1 year ago

Our fetchers send an update each night with total story counts to the #feminicides-story-update channel. The newscatcher update indicates that it is fetching 300k+ stories each night, which seems wrong. The logic for MAX_STORIES_PER_PROJECT looks pretty sound in queue_newscatcher_stories.py, and is currently set at 2,000 stories max per project. And it is only checking 59 projects, so it shouldn't ever check more than 120,00 stories total across all the projects. So why is it reporting so many stories, and some projects fetching more than the 2k limit? Is the counting wrong?

Excerpt from the Oct 10, 2023 report below. Note the total in the title and how Project 207 says it fetched more than 20k stories.

FEMINICIDE NEWSCATCHER UPDATE: 326422 STORIES (1324.33 MINS) - V3.6.0
Checking 59 projects.
Project 21 - Feminicidio Uruguay: 94 stories
Project 207 - Feminicídio na Região Norte: 20163 stories
Project 93 - AAPF Test Project, all US Media: 6135 stories
Project 177 - 한국어 테스트 (Korean Model Test): 610 stories
Project 212 - Cobertura de feminicídios: 20169 stories
Project 183 - 한국 페미사이드 기록: 610 stories
Project 181 - Basurización de los cuerpos de las mujeres: 129 stories
Project 185 - Trans femicide: 0 stories
Project 213 - Cobertura de feminicídios na ESPANHA: 1961 stories
Project 29 - Genetic Genealogy Cold Cases: 174 stories
rahulbot commented 1 year ago

After some discussion on Slack I pushed the release fresh yesterday to prod. Attached is the full log file created from that run and also the summary update posted to slack.

rahulbot commented 11 months ago

I found the root cause - a lack of deduplication when fetching HTML for URLs and then relinking that HTML content with the story's project metadata. I found this via a debug breakpoint in fetch_text in both WM and NC. Fixed in 338a5e5.