Set up scraping infrastructure for Insights

ericaryan commented 1 year ago

Scraping for Insights is going to be quite fundamentally different from the stuff that Zenodotus is doing already for MediaReview. Instead of getting pushed what we want to scrape we have to go out and find it. This requires something bigger, something more nimble and something more ubiquitous that can take any URL and go through the whole site looking for any ClaimReview or MediaReview we can find.

While I'm a fan of monoliths this type of thing is also already handled by other projects. The best known and most used is probably Scrapy. This is a project in Python that is designed to handle situations exactly like this, and can be deployed to some pretty great infrastructure out of the box. We write what we're looking for the spider to find (in Python, but we can all handle that) and then let it go on our list of URLs. From there we can set it to push anything it finds to Zenodotus for saving/archiving. That way Zenodotus still just receives pushes and if our scraping infrastructure goes down nothing else does.

[ ] Get list of urls to scrape
[ ] Set up test Scrapy instance with one URL we know has embedded ClaimReview
[ ] Test Scrapy to make sure it's picking up what we want
[ ] Write Zenodotus endpoint to receive ClaimReview/MediaReview (the end point should be agnostic, probably we can modify what we're using for receiving from FactStream)
[ ] Test full integration
[ ] Add one more URL, confirm it still works
[ ] Add five more, etc.
[ ] Add them all.

joelwluther commented 1 year ago

Scraping lists are attached; here are the relevant notes:

The minus_2023_fcmt list is a list of domains that meet the following criteria:

They’re in our fact-checking database.
The Google API returns ClaimReview data for the domain.
They’re not present recently in the FCMT feed (I arbitrarily used 1/1/23 as a cutoff for this)

However, not the best approach? It’s possible for one of them to be in the FCMT this year but to also have earlier ClaimReview data embedded… So the full list of sites that meet 1 and 2 is also attached.

to_scrape.txt to_scrape_minus_2023_fcmt.txt

cguess commented 1 year ago

We'll have to write a one-off exception for AfricaCheck for this format:


{"@context"=>"https://schema.org/",
 "@graph"=>
  [{"@type"=>"ClaimReview",
    "claimReviewed"=>""In total, Kenya exports about US$890 million in goods to the US each year."",
    "url"=>"http://africacheck.org/fact-checks/reports/us-ambassador-kenya-margaret-meg-whitman-loves-muck-around-data-do-her-numbers",
    "itemReviewed"=>
     {"@type"=>"CreativeWork", "name"=>"YouTube", "url"=>"https://youtu.be/w4FLWYe4Tqc?t=2777", "datePublished"=>"2023-09-01T14:00:00+0200", "author"=>{"@type"=>"Organization", "name"=>"Meg Whitman", "sameAs"=>"https://ke.usembassy.gov/ambassador-margaret-meg-whitman/"}},
    "datePublished"=>"2023-10-27T12:42:20+0200",
    "dateModified"=>"2023-10-27T12:57:35+0200",
    "author"=>"Organization Africa Check http://africacheck.org/about/authors/makinia-juma-sylvia https://twitter.com/ Array",
    "reviewRating"=>{"@type"=>"Rating", "ratingValue"=>"1", "alternateName"=>"Incorrect", "bestRating"=>"6", "worstRating"=>"1"}}]}```

TechAndCheck / zenodotus

Set up scraping infrastructure for Insights #496