carpentries-incubator / lc-webscraping

Introduction to web scraping
37 stars 28 forks source link

Update Canadian URLs & associated screenshots. #29

Closed pansapiens closed 5 years ago

pansapiens commented 5 years ago

This PR fixes the Canadian Parliament URLs which were inconsistent between the text and example XPath code ( vs., and updates one URL to the latest redirect.

The page structure for Canadian Parliament pages had also changed slightly - screenshots were updated to reflect the current versions so that the XPath displayed in the Scraper screenshots matches the current state.

pansapiens commented 5 years ago

A related note: while this PR fixes the material (in episode 3) to match the current state of the target pages, I don't think the lesson is maintainable long-term in it's current state due to the underlaying input data (web pages at URLs we don't control), being subject to frequent changes (eg, see issues and

@RichardPBerry has suggested using the stable snapshots provided by the The Internet Archive, unfortunately these present a few problems (namely robot.txt preventing scrapy working with default settings). We should really find a way to make the Internet Archive work for our use case, or use a similar static snapshots hosted somewhere to prevent continual breakage of the workshop material.

libcce commented 5 years ago

@pansapiens we recently launched a Curriculum Advisory Committee and one of the items that was discussed in our first meeting was to schedule targeted sprints. I can see us either addressing the content being scraped or exploring the use of IA like you suggest in the targeted sprint. Do you think a targeted sprint would help?

pansapiens commented 5 years ago

@libcce Yes, this would seem like a good focused task to address in a sprint. There are tradeoffs with the various approaches (technical and pedagogical), so it would be worthwhile exploring to get some kind of consensus and a stable solution.

My current assessment of the snapshot hosting options looks something like this - it might make a useful starting point.

Hosting option Unmodified content Stable content Renders like original in browser No robots.txt * 'Real URL'**
Internet Archive (normal link) no yes yes no substring
Internet Archive (id_ link) yes yes no no substring
Direct link yes no yes yes yes
Static file on Github yes yes yes? yes? substring possible
Other snapshot hosting ? ? ? ? ?

* - Scrapy can be configured to ignore robots.txt, however this isn't ideal due to 1) added complexity to code in workshop material 2) ethical considerations - scraping content that authors have explicitly flagged that they don't want scraped is a grey area and probably shouldn't be encouraged.

** - I feel like there's value in having a URL that reflects the original, so that workshop participants get a sense of scraping a 'real' data source.

libcce commented 5 years ago

@pansapiens this is helpful! And looping in @jt14den @ccronje @katrinleinweber @erikamias @laufers so they know. The Curriculum Advisory Committee is still fairly new, so nothing yet on when/how we might consider sprints.

katrinleinweber commented 5 years ago

I'm a bit confused by the No robots.txt column. No means, there is such a filter, or not? How about renaming to robots.txt blocks scrapers? and yes or no.

Could we also host a verbatim copy of the site through this repo? The [Carpentries/assessment]( for example has at least one HTML page directly included, in learner-assessment/code which is rendered here.

I presume we can mimic that by grabbing an Internet-Archive-d copy of a site, comitting it to this repo and maybe updating its hyperlinks to relative, no?

pansapiens commented 5 years ago

@katrinleinweber yes in the 'No robots.txt' column means there is no robots.txt preventing scraping (so Scrapy will work using default settings). Sorry about the double-negative - the purpose was to make the 'yes' in each cell mean "this suits our purposes best" to allow different methods to be more easily assessed.

We could keep a verbatim copy in the repo. This is the 'Static file on Github' option.