TechAndCheck / tech-and-check-alerts

Daily tip sheet for fact checkers
MIT License
13 stars 6 forks source link

Optimize CNN portal crawler to only scrape calendar #278

Open emilyliu7321 opened 5 years ago

emilyliu7321 commented 5 years ago

We should change the CNN portal crawler to only return URLs that are in the calendar portion of the CNN transcript website instead of scraping the entire page.

Presumably, we will not need to scrape for transcripts prior to a certain date.

slifty commented 5 years ago

This feels like a good optimization -- two things to consider.

Considerations

Scraping farther back in history

One tradeoff is that it will remove the ability to scrape before whatever CNN shows on the calendar (which appears to be the past 10 days)

image

Here's the example calendar we would be scraping -- notice how it only shows Nov 17 to Nov 26. I imagine tomorrow it will show Nov 18 to 27.

Coupling to CNN format

The current crawler is minimally coupled to the current page structure -- it crawls ANY "transcript" url (e.g. any URL that has the form /TRANSCRIPT/{...}).

This change would require us to either define a slightly more specific URL check (e.g. /TRANSCRIPT/{dd-dd-dddd}) or scrape a specific section of the DOM (e.g. $('.cnnTransCal a')).

In either case if CNN changed the DOM or the url format for the calendar links it would break the more specific crawl logic.

This seems unlikely, but still.

Options

I think item 1 (the history thing) is the only real tradeoff concern here. I'm not as worried about CNN changing their link format / if they do we can react.

So, our options...

  1. Implement the optimization directly (e.g. scrape horizon cannot be greater than 10).
  2. Invoke this an optimization based on the date horizon (e.g. if date horizon is greater than 10, do a full scrape; otherwise only scrape the calendar)
  3. Do nothing.

I think #2 is a reasonable strategy, and it isn't out of place to put that in the CnnPortalCrawler worker directly before it returns the list of links (or as a util).