counterdata-network / story-processor

Story discovery engine for the Counterdata Network. Grabs relevant stories from various APIs, runs them against bespoke classifier models, post results to a central server.
Apache License 2.0
0 stars 2 forks source link

projects using "US - State & Local" collection for MC failing #79

Closed rahulbot closed 1 month ago

rahulbot commented 1 month ago

There are 13 projects right now that use the "United States – state and local" collection (38379429). This is making their query fail due to some internal stuff on the media cloud side (use of url_search_str makes ES queries too slow right now).

In the short term, we can hack it so they use a slightly different collection that is only slightly modified and works better (262985212). I suggest we modify the media cloud fetter to check the collections for each project after it fetches the projects list from the central server, replacing any use of the original collection with this slightly modified one. And of course, had a big comment saying this is a tempo this feels kind of ugly, but I think is a reasonable solution to reduce the amount of changes needed while making the project work better.

This can be tested on projects 23, 24, 26, 29, 30, 84, 93, 128, 145, 146, 185, 186, 217.

rahulbot commented 1 month ago

I deployed this earlier today. @math4humanities can you check to see tomorrow if the projects That we're failing and use US State and Local succeeded at a higher rate?

math4humanities commented 1 month ago

From the last two runs, we are averaging a 6.5% failure rate. Performance is a whole lot better compared to the runs pre-fix with failure rates averaging around 40%.

rahulbot commented 1 month ago

Fantastic. Do you have actions to pursue for the remaining projects that are still failing regularly. Do you think they are connected to the query syntax issue unearthed in the work on newscatcher (#78)? Or are they the ones that still haveurl_search_str use? Some other still undetermined problem?

math4humanities commented 1 month ago

Yes, I have a couple things to explore. There is little overlap with the failing newscatcher projects and the ones timing out on mediacloud, so I believe the issues are unrelated. However, there is a far greater intersection with projects that use collections with url search strings. Anaelle and I will be making judgement calls if the use of url search strings is necessary in their associated collections. Also, since we've narrowed down failures, I want to try manually running each of the project queries, building up to their full collection sets, to try to catch issues we may not yet be privy to. I am hoping by the next Counterdata Network meeting we should have a concrete diagnostic and solution.