Review web scraping lesson and get it ready for publication

ostephens commented 7 years ago

The web scraping lesson https://github.com/data-lessons/library-webscraping was initially developed by @timtomch

The contents of @timtomch's lesson has been copied to this repository to be reviewed and amended before it is ready for publication as part of the library carpentry materials.

Use https://github.com/data-lessons/library-webscraping/issues for issues to be worked on during the sprint

drjwbaker commented 7 years ago

When you (someone) gets a minute, please report back on progress during the sprint and close this issue. Ta!

ldko commented 7 years ago

During the 2017 sprint we worked on setup and README files and outlined a plan to:

Remove Chrome extensions section to focus more on scraping with Python as a way to learn and apply programming concepts rather than using an available tool that may not be supported for very long.
Add CSS Selector examples in addition to the XPATH examples that are there as it will lead into the episodes about selecting items with BeautifulSoup.
Replace Scrapy instructions with Requests and BeautifulSoup (#6, #14).
Laid out a general outline of how the teaching of BeautifulSoup (#11, #12 might go--following the structure of a scraping tutorial used by University of Oklahoma but using the UN site to get data about Security Council resolutions for the examples.

We also discussed modifying where/how ethics are brought up in the lesson and possible benefits of using URLs from archive.org Wayback Machine for the scraping examples, since those should be static as opposed to using a live production site that may change at any time. Also, we set up a GitHub Project in the library-webscraping repo for tracking progress.

Due to having fewer people participating in work on the web scraping lesson on day 2 of the sprint, there was not much done to actually make the changes to the content structure that were proposed the first day.

data-lessons / librarycarpentry

Review web scraping lesson and get it ready for publication #35