carpentries-incubator / lc-webscraping

Introduction to web scraping
https://carpentries-incubator.github.io/lc-webscraping/
Other
37 stars 28 forks source link

Links #23

Open RichardPBerry opened 6 years ago

RichardPBerry commented 6 years ago

Hi,

More broken links: https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L81 https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L82 https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L198 https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L220 https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L221 https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L288 https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L303 https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L337 https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L365 https://github.com/LibraryCarpentry/lc-webscraping/blame/gh-pages/_episodes/04-scrapy.md#L405

Given that Canadian parliament is currently dissolved there is also not much information under current members, so it might be better to point here: https://www.ola.org/en/members/parliament-41

libcce commented 6 years ago

Does it make sense to change the links now since the pages will ultimately be updated with new information? Or should we link to a particular snapshot in time using the Wayback Machine, for example: https://web.archive.org/web/20170715183551/http://www.ontla.on.ca/web/members/members_current.do?locale=en

RichardPBerry commented 6 years ago

I think it would be worth updating the link to the wayback version, because it future proofs against both dissolving of the parliament, and website changes. In fact it's probably a great example of one of the perils of web scraping in that your can code break if the website updates!

RichardPBerry commented 6 years ago

Actually I just tried to scrape the IA page with scrapy and received a "Forbidden by robots.txt" error. Currently it is set to deny all user agents bar the ia_archiver. I guess this might require someone from LC contacting IA to gain permission for scraping for training purposes?

pansapiens commented 6 years ago

I've hit similar issues in episode 3 (PR https://github.com/LibraryCarpentry/lc-webscraping/pull/29).

Another issue with Internet Archive snapshots is that an extra <div> is injected, that changes page structure and often the XPath generated by the Scraper browser plugin. The workaround for this is to use the id_ variant of the IA URL which gives the original unmodified page (eg, http://webarchive.parliament.uk/20150218214039/http://www.parliament.uk/mps-lords-and-offices/mps/ vs. http://webarchive.parliament.uk/20150218214039id_/http://www.parliament.uk/mps-lords-and-offices/mps/ ). Unfortunately this results in broken links to CSS and images, making the page look broken - the content can still be reliably scraped using Scraper.

With regard to robots.txt - Scrapy can be configure to ignore this (https://www.simplified.guide/scrapy/ignore-robots), but it could be a controversial workaround with regard to the ethics of web scraping.

I think we do need to use stable snapshots of pages hosted somewhere - it may be that the Internet Archive isn't the solution to that however.