Closed jnothman closed 6 years ago
Thanks so much for this write up @jnothman
It feels like there could be room for different web scraping lessons here - an 'intro to web scraping with tools' - focus on a tool, include introduction to HTML/CSS; and a more advanced lesson - possibly 'web scraping with Python'.
I could see this being multiple episodes within a single lesson - but it would have to be clear that the intention wasn't to use all the episodes in one teaching session. (@drjwbaker has suggested a similar approach in the OpenRefine lesson to me previously)
I feel that any tool introduced should follow the selectors we are teaching - so if we are teaching css selectors it seems odd then to use a tool that uses similar but different selector syntax.
Was there any feedback from the participants in terms of how useful they found it and whether it met with their expectations?
yes, having some optional components makes sense, but in any case a lesson will be cropped to fit its schedule and audience when presenting it. One question is whether there are alternatives (hard to maintain) or extensions (still has its challenges).
I did not collect feedback in an organised manner but hope to get a list of participant emails to ask for feedback after the fact. (In focusing on developing the lesson I didn't prepare enough for that aspect.) I had one strong "I'm struggling" response between CSS selectors and visual scraping. I had the sense that most other people were following along well and asking appropriate questions about the exercises, and I got a couple of positive comments.
On 4 Jul 2017 6:25 pm, "Owen Stephens" notifications@github.com wrote:
Thanks so much for this write up @jnothman https://github.com/jnothman
It feels like there could be room for different web scraping lessons here
- an 'intro to web scraping with tools' - focus on a tool, include introduction to HTML/CSS; and a more advanced lesson - possibly 'web scraping with Python'.
I could see this being multiple episodes within a single lesson - but it would have to be clear that the intention wasn't to use all the episodes in one teaching session. (@drjwbaker https://github.com/drjwbaker has suggested a similar approach in the OpenRefine lesson to me previously)
I feel that any tool introduced should follow the selectors we are teaching - so if we are teaching css selectors it seems odd then to use a tool that uses similar but different selector syntax.
Was there any feedback from the participants in terms of how useful they found it and whether it met with their expectations?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/data-lessons/library-webscraping/issues/41#issuecomment-312814341, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6wzmOgIGtSPhOGcUvTK-MUl8NLiwks5sKfb9gaJpZM4OMUdH .
Great workshop, and great summary @jnothman, I think you picked out all the keypoints. I will add that having attended with absolutely 0 web scraping experience I got a lot out of it!
I agree that this could be broken down into basic (visual) and advanced (code based) lesson. The mechanics of how to do that best I would leave to you... :)
The only things I would add are: a) some diagrams or perhaps looking at the structure of a very simple webpage using the element inspector might be a good way to get give those not familiar with HTML some more solid grounding (looking at the structure of the course material page can be a bit overwhelming)
b) maybe after introducing the concept of one or two selectors it would be good to jump straight into the visual scraper tool and try this out on the simple webpage. This could be followed up by the more in-depth discussion of various CSS selectors and the UNSC example. I think this would help cement the concept and break up the theoretical discussion at the start.
Last point, personally I think the UNSC example is really good. The quirks of this site show how difficult good scraping could be.
This afternoon, I had 3h (including 10 min break) to present web scraping. I presented from https://ctds-usyd.github.io/2017-07-03-resbaz-webscraping/. I am not a trained SWC instructor, and not used to the narrative format of SWC lessons. I am also an experienced software engineer, so while I am used to some amount of teaching, it was hard for me to recall how much ground work there is to this topic. In the context of ResBaz, I was presenting to a group of research students, librarians, ?academics, etc. from Sydney universities. I did not get anything in the way of a survey, but hope to ask the ResBaz organisers to email students for their comments.
There were about 22 students, though 40 had signed up. Despite the Library Carpentry resolutions of a few weeks ago to focus on coding scrapers, I had decided to make something accessible to non-coders. In the end, we did not cover the coding part at all. I don't think we suffered greatly for this.
What we managed to cover
We covered, perhaps, half the material:
Good points
Things deserving attention
Overall
There is far too much narrative before getting hands dirty. Even so, students seemed to appreciate the "what web scraping is not" at least to some extent. Could probably be moved to conclusion.
Students who were not well grounded in the structure of web pages struggled.
I had two projector screens. Even so, it is challenging to set up a visual projection that covers: the lesson, the page being scraped, source code or element inspector for a page being scraped, the scraping tool or code...
I think it would be good to focus on a visual scraper, but then have a number of scripts in several scraping frameworks and languages available as supplementary material to the lesson. A discussion of the nuances of coding these things by hand can be left brief, or available with more description for an extended lesson.
I feel that visual scrapers are a good way to demonstrate what we're up to with little coding competence required, and are in practice a useful technology to grok.
The key thing we need to consider is to what extent we make this available with a "choose your own adventure: CSS vs XPath; visual vs requests/lxml vs scrapy" approach, or as a single well-honed curriculum that works for most people.
CSS selectors
<catfood>
example is poorer for only having one of each tag name.Visual scraping
:nth-of-type
which refers to tag name.href
, and spoke of machine-readable publication dates (with microdata) in news sites. Also could have mentioneda
'stitle
attr. What else? Worth writing a paragraph on in the lesson, perhaps.I'll offer my lessons across to this repo shortly.
Anything to add, @nikzadb, @anushi, @RichardPBerry?