carpentries-incubator / lc-webscraping

Introduction to web scraping
https://carpentries-incubator.github.io/lc-webscraping/
Other
37 stars 28 forks source link

Update Canadian URLs & associated screenshots. #29

Closed pansapiens closed 5 years ago

pansapiens commented 5 years ago

This PR fixes the Canadian Parliament URLs which were inconsistent between the text and example XPath code (ontl.on.ca vs. ola.org), and updates one URL to the latest redirect.

The page structure for Canadian Parliament pages had also changed slightly - screenshots were updated to reflect the current versions so that the XPath displayed in the Scraper screenshots matches the current state.

pansapiens commented 5 years ago

A related note: while this PR fixes the material (in episode 3) to match the current state of the target pages, I don't think the lesson is maintainable long-term in it's current state due to the underlaying input data (web pages at URLs we don't control), being subject to frequent changes (eg, see issues https://github.com/LibraryCarpentry/lc-webscraping/issues/23 and https://github.com/LibraryCarpentry/lc-webscraping/issues/24).

@RichardPBerry has suggested using the stable snapshots provided by the The Internet Archive, unfortunately these present a few problems (namely robot.txt preventing scrapy working with default settings). We should really find a way to make the Internet Archive work for our use case, or use a similar static snapshots hosted somewhere to prevent continual breakage of the workshop material.

libcce commented 5 years ago

@pansapiens we recently launched a Curriculum Advisory Committee and one of the items that was discussed in our first meeting was to schedule targeted sprints. I can see us either addressing the content being scraped or exploring the use of IA like you suggest in the targeted sprint. Do you think a targeted sprint would help?

pansapiens commented 5 years ago

@libcce Yes, this would seem like a good focused task to address in a sprint. There are tradeoffs with the various approaches (technical and pedagogical), so it would be worthwhile exploring to get some kind of consensus and a stable solution.

My current assessment of the snapshot hosting options looks something like this - it might make a useful starting point.

Hosting option Unmodified content Stable content Renders like original in browser No robots.txt * 'Real URL'**
Internet Archive (normal link) no yes yes no substring
Internet Archive (id_ link) yes yes no no substring
Direct link yes no yes yes yes
Static file on Github yes yes yes? yes? substring possible
Other snapshot hosting ? ? ? ? ?

* - Scrapy can be configured to ignore robots.txt, however this isn't ideal due to 1) added complexity to code in workshop material 2) ethical considerations - scraping content that authors have explicitly flagged that they don't want scraped is a grey area and probably shouldn't be encouraged.

** - I feel like there's value in having a URL that reflects the original, so that workshop participants get a sense of scraping a 'real' data source.

libcce commented 5 years ago

@pansapiens this is helpful! And looping in @jt14den @ccronje @katrinleinweber @erikamias @laufers so they know. The Curriculum Advisory Committee is still fairly new, so nothing yet on when/how we might consider sprints.

katrinleinweber commented 5 years ago

I'm a bit confused by the No robots.txt column. No means, there is such a filter, or not? How about renaming to robots.txt blocks scrapers? and yes or no.

Could we also host a verbatim copy of the site through this repo? The [Carpentries/assessment](https://github.com/carpentries/assessment/ for example has at least one HTML page directly included, in learner-assessment/code which is rendered here.

I presume we can mimic that by grabbing an Internet-Archive-d copy of a site, comitting it to this repo and maybe updating its hyperlinks to relative, no?

pansapiens commented 5 years ago

@katrinleinweber yes in the 'No robots.txt' column means there is no robots.txt preventing scraping (so Scrapy will work using default settings). Sorry about the double-negative - the purpose was to make the 'yes' in each cell mean "this suits our purposes best" to allow different methods to be more easily assessed.

We could keep a verbatim copy in the repo. This is the 'Static file on Github' option.