Closed rahulbot closed 11 months ago
After some discussion on Slack I pushed the release fresh yesterday to prod. Attached is the full log file created from that run and also the summary update posted to slack.
I found the root cause - a lack of deduplication when fetching HTML for URLs and then relinking that HTML content with the story's project metadata. I found this via a debug breakpoint in fetch_text
in both WM and NC. Fixed in 338a5e5.
Our fetchers send an update each night with total story counts to the #feminicides-story-update channel. The newscatcher update indicates that it is fetching 300k+ stories each night, which seems wrong. The logic for
MAX_STORIES_PER_PROJECT
looks pretty sound inqueue_newscatcher_stories.py
, and is currently set at 2,000 stories max per project. And it is only checking 59 projects, so it shouldn't ever check more than 120,00 stories total across all the projects. So why is it reporting so many stories, and some projects fetching more than the 2k limit? Is the counting wrong?Excerpt from the Oct 10, 2023 report below. Note the total in the title and how Project 207 says it fetched more than 20k stories.