Closed ericaryan closed 11 months ago
Scraping lists are attached; here are the relevant notes:
The minus_2023_fcmt list is a list of domains that meet the following criteria:
However, not the best approach? It’s possible for one of them to be in the FCMT this year but to also have earlier ClaimReview data embedded… So the full list of sites that meet 1 and 2 is also attached.
We'll have to write a one-off exception for AfricaCheck for this format:
{"@context"=>"https://schema.org/",
"@graph"=>
[{"@type"=>"ClaimReview",
"claimReviewed"=>""In total, Kenya exports about US$890 million in goods to the US each year."",
"url"=>"http://africacheck.org/fact-checks/reports/us-ambassador-kenya-margaret-meg-whitman-loves-muck-around-data-do-her-numbers",
"itemReviewed"=>
{"@type"=>"CreativeWork", "name"=>"YouTube", "url"=>"https://youtu.be/w4FLWYe4Tqc?t=2777", "datePublished"=>"2023-09-01T14:00:00+0200", "author"=>{"@type"=>"Organization", "name"=>"Meg Whitman", "sameAs"=>"https://ke.usembassy.gov/ambassador-margaret-meg-whitman/"}},
"datePublished"=>"2023-10-27T12:42:20+0200",
"dateModified"=>"2023-10-27T12:57:35+0200",
"author"=>"Organization Africa Check http://africacheck.org/about/authors/makinia-juma-sylvia https://twitter.com/ Array",
"reviewRating"=>{"@type"=>"Rating", "ratingValue"=>"1", "alternateName"=>"Incorrect", "bestRating"=>"6", "worstRating"=>"1"}}]}```
Scraping for Insights is going to be quite fundamentally different from the stuff that Zenodotus is doing already for MediaReview. Instead of getting pushed what we want to scrape we have to go out and find it. This requires something bigger, something more nimble and something more ubiquitous that can take any URL and go through the whole site looking for any ClaimReview or MediaReview we can find.
While I'm a fan of monoliths this type of thing is also already handled by other projects. The best known and most used is probably Scrapy. This is a project in Python that is designed to handle situations exactly like this, and can be deployed to some pretty great infrastructure out of the box. We write what we're looking for the spider to find (in Python, but we can all handle that) and then let it go on our list of URLs. From there we can set it to push anything it finds to Zenodotus for saving/archiving. That way Zenodotus still just receives pushes and if our scraping infrastructure goes down nothing else does.