Review web scraping lesson structure and content

ostephens commented 7 years ago

Need to do an overall review of the structure & content of this session and decide if it is the right stuff for a library carpentry lesson on web scraping. Suggest we use an Etherpad to agree a syllabus for a lesson and review against this existing lesson

weaverbel commented 7 years ago

Ok @ostephens - shall I create the pad or will you? Easy for me to do.

ostephens commented 7 years ago

Pls go ahead @weaverbel

weaverbel commented 7 years ago

Here you go @ostephens http://pad.software-carpentry.org/scrape

ostephens commented 7 years ago

OK - looking at this and thinking about it this morning, my view is that we should remove the Python part of the lesson (either completely, or separate it into it's own 'advanced' lesson).

There is really useful and relevant stuff in the lesson without leaping to programming tools - and I think probably enough content for 3hrs anyway. I think the additional overhead and barriers introduced by going to programming is that we potentially lose people who would otherwise take some really good concepts from the lesson. It would also give more time to spend on XPath (and potentially CSS or jQuery selectors)

So I'm proposing we take out Episode 04 entirely. Any views?

kimpham54 commented 7 years ago

When we've taught this lesson in the past we've requested that people come with some knowledge of programming (not necessarily Python), but that didn't always happen and people were still able to go through the exercises. I agree that the web scraping python lesson is advanced - it makes a lot of assumptions that people understand the basic concepts programming, including knowledge of object oriented programming

Although we've been able to deliver the full lesson in 3 hours, it is definitely possible to go more in-depth and provide more instruction in each of the sections. The original XPath lesson I developed was more generic and was not aimed at web scraping - it addressed using XPath for XML documents, and also provided a brief introduction to XQuery.

Here are some suggested structures for a web scraping syllabus:

1

XPath/XSLTs/XQuery on documents (XML, HTML)
Intro to Web scraping (prereq: XPath lesson, maybe mention that they only need a particular portion to get started?)
Web Scraping with Python (prereq: XPath lesson, Intro to Web Scraping)

2

Intro to Web scraping + XPath Intro
Web Scraping with Python (prereq: XPath lesson, Intro to Web Scraping)

kimpham54 commented 7 years ago

pinging @timtomch for your thoughts as well!

ostephens commented 7 years ago

I think a Library Carpentry intro to web scraping shouldn't require programming knowledge as a pre-requisite.

I think XPath is a really useful thing to learn and makes complete sense in this context and so I'm in favour of keeping this in.

I really like all the surrounding material - what is web scraping/HTML DOM/ethics etc. all really good IMO

runderwood commented 7 years ago

From a scholarly point of view, I would say that a non-programmatic approach to web scraping is not likely to be all that useful, at least in my experience. If something can be scraped without programming, typically folks just do it through brute force.

As for XPATH, while useful, it is a rather narrow DSL. I'd be more interested in making sure people are introduced to the idea of document structures/trees, markup syntax/semantics, and selectors, broadly. For munging HTML, I've yet to do any heavy lifting in web scraping without the use of a parsing library like BeautufilSoup, so, for my part, I'd like us to really consider retaining something along those lines in the lesson.

alixk commented 7 years ago

I agree that Python/programming knowledge shouldn't be a pre-req for an LC workshop. I like the idea of the bulk of the lesson being non-programming oriented, although it might be useful for learners to see/play around with BeautifulSoup at the end. I think it's pretty approachable and exciting for people to see how it works. (I added this on the etherpad, but I've taught the ProgHist lesson on BeautifulSoup and it worked very well: http://programminghistorian.org/lessons/intro-to-beautiful-soup)

I also agree that having a conversation about ethics in the introduction of the lesson is really important.

ostephens commented 7 years ago

@runderwood My experience differs in terms the usefulness of a non-programmatic approach to web scraping - but it's about having the right tool.

I agree that the more general concepts are important, but XPath and XLST is something we've been asked for repeatedly as part of Library Carpentry and this seems like a good place to introduce that syntax.

ostephens commented 7 years ago

I should be clear that I'm not against a separate lesson that has programming as a pre-requisite - I just think that it doesn't belong in the 'intro' lesson.

runderwood commented 7 years ago

I have never used XSLT in web scraping. I've written mountains of the stuff, but I just can't imagine it being something we'd wade into here, especially if we're avoiding programming.

ostephens commented 7 years ago

@runderwood - sorry wasn't being clear - I meant 'introducing the XPath syntax' not 'introducing XSLT' - I'm not suggesting adding xslt into this lesson

runderwood commented 7 years ago

@ostephens OK! Phew.

I like XPATH, but one could argue that CSS-like selectors are more relevant in the web world. And I think those point directly toward XPATH.

BeautifulSoup's document traversal facilities are arguably more comprehensible than XPATH proper. Alternatively, LXML's HTML parsers/traversers are more XPATH-oriented.

Whether we teach XPATH or not, I still think some basic programming is a necessity if this is to be maximally useful to researchers. Also, the non-programmatic resources mentioned so far are all proprietary and severely limiting.

ostephens commented 7 years ago

@runderwood Agreed that CSS selectors as relevant in web world and I think we should consider adding these to the lesson - but alongside XPath IMO

I still think some basic programming is a necessity if this is to be maximally useful to researchers.

But this is not aimed at researchers, but at librarians.

Also, the non-programmatic resources mentioned so far are all proprietary and severely limiting.

I don't disagree particularly I know better tools for doing scraping (without programming) but I think you end up teaching the tool - which I think is a bad idea I think the tools used in this lesson are lightweight enough that the focus becomes teaching the concepts not the tool

I think the question for me is whether introducing the basics here is possible and useful without breaking out into programming. I'm inclined to think it is, but I think I'm currently outnumbered in that regard on here...

ostephens commented 7 years ago

@alixk when you taught http://programminghistorian.org/lessons/intro-to-beautiful-soup - how long did you have available and what experience did people coming to the lesson have?

alixk commented 7 years ago

Yeah, in my (limited) experience, it seems as though tools like browser extensions or proprietary tools tend to not have a very long lifespan, where Python/BeautifulSoup is more dependable. And builds upon the shell lesson.

What about a two-part lesson similar in style to spreadsheets + OpenRefine? 1.5 hours for What is Web Scraping, Ethics, Landscape of resources; and then 1.5 hours for Python, BeautifulSoup.

runderwood commented 7 years ago

@ostephens I definitely like your idea of working XPATH into the lesson. As for non-programmatic approaches, I think there is definitely a place for that. I'm just not sure this lesson is it. The LC ethos, as I understand it, doesn't seem compatible with purely conceptual lessons nor does it seem in line with proprietary, non-programmatic approaches to scraping. The notion here, I think, is that these lessons knit together all these different more-or-less UNIX-ish tools to do interesting things and empower librarians/researchers. If this is our end, I'd think we'd want to move this lesson in a direction not too radically different from its present form, generally speaking -- programmatic and very concrete and hands-on.

ostephens commented 7 years ago

@runderwood I don't entirely agree with that summation of the LC ethos, but I am in favour of concrete rather than abstract

ostephens commented 7 years ago

@alixk @runderwood @kimpham54 OK from this discussion, and having reviewed http://programminghistorian.org/lessons/intro-to-beautiful-soup you are starting to convince me.

Rather than a 1.5/1.5 split I think I might go for a 1 hr/2 hr split - which reflects the current structure I think?

ostephens commented 7 years ago

Are we all agreed that BS is a better starting point than Scrapy for this lesson?

ostephens commented 7 years ago

OK - so if we are saying we are going to use Python with BS here, do you think that we should dive straight in? That is making the Python/BS install the starting point rather than using the Chrome extension as a starting point?

ostephens commented 7 years ago

BTW just to show I'm not entirely making up the idea you can do useful scraping without Python (IMO), here is a tutorial I wrote using Google Sheets to scrape data from the Early English Short Title catalogue - based on a real world use case http://www.meanboyfriend.com/overdue_ideas/2015/06/using-google-sheets-with-estc/

runderwood commented 7 years ago

@ostephens That is really, really interesting, both the target (the short title catalogue) and the approach.

But I would note that, as effective as this seems to have been, it is a) using a proprietary tool b) in a way that amounts to scripting/programming.

Something like this:

=importXml(concat(“http://estc.bl.uk/”,A2),”//td[@class=td1]//a[contains(@href,’&set_entry’)]/@href”)

...while technically mostly declarative, isn't necessarily more comprehensible than Python code (in this case, actually, I think it's less so).

These same techniques, in any case, can be applied in a Python environment, and with more potential for broader application -- one will hit a hard limit on the utility of the Google Sheets approach long before you'll find something you can't script in Python.

copystar commented 7 years ago

Just a small suggestion. While I think the Short Title catalogue is a great example for librarians to work on scraping, I would also like to suggest that we use Wikipedia as a potential source of scaping material.

ostephens commented 7 years ago

@copystar why wikipedia?

runderwood commented 7 years ago

@copystar Are there applications for scraping of Wikipedia not covered by its API?

ostephens commented 7 years ago

@runderwood I agree - the approach has definite short comings! It's just that I've found this approach a way of introducing key concepts in a concrete, hands-on, without having to get people started with Python and without having to install any s/w locally.

runderwood commented 7 years ago

@ostephens Understood. People familiar with Software Carpentry and LC have indicated to me that Bash and Python are generally taken as a given, so I feel like we wouldn't be breaking with precedent there.

But I'm definitely sharing your blog post with colleagues.

ostephens commented 7 years ago

Re API vs Scraping - how far is this differentiation important? Using a simple API would be easier than doing a difficult scrape. Could we start with an API example - well structured data, and then move onto scraping - more difficult HTML parsing?

ostephens commented 7 years ago

@runderwood in case it's of interest, I also did a similar one for introducing APIs http://www.meanboyfriend.com/overdue_ideas/2016/06/introduction-to-apis-using-iiif/

runderwood commented 7 years ago

@ostephens I think they're very different, in practice. An API implies that the data is being made easy to obtain, with documentation, a nod to standards, etc. Web scraping is usually working around in-built or incidental barriers to aggregating data.

I think the approach in the existing lesson makes a lot of sense. I'd be interesting in taking its approach, and even its use case, as a core for reworking.

copystar commented 7 years ago

@ostephens For an introduction for beginners, using Wikipedia as an example would allow them to use a source that they are already familiar with. When I first tried working with XPath, I found using the structure of Wikipedia very straightforward. And once you can master webscraping a column of data from a long list from Wikipedia (or Wikidata), you now have the ability to draw on every subject matter.

But that being said, I understand what you mean about not needing to use web scraping because Wikipedia already provides an API. But for beginners, they likely don't have this option yet.

Again, this is just a suggestion. I'm more than happy to work with the examples already selected.

ostephens commented 7 years ago

@copystar no examples selected so far. The existing lesson uses members of parliament - but I'm not convinced this is a good set of examples for LC.

Wikipedia/wikidata seem reasonable examples to me

ostephens commented 7 years ago

I've got to go now - will be back online either later this evening or tomorrow

copystar commented 7 years ago

Ok. I'm going to start working on a version of https://github.com/qut-dmrc/web-scraping-intro-workshop. While this example does make use of Python, I think it's a approach of using Requests and Beautiful Soup is less intimidating than Scrappy

runderwood commented 7 years ago

@copystar Requests and BS are super important packages, so I'm all for using them here. :+1:

Maybe before we start hacking away, we could nail down the structure. I think if we can get that hammered out, since we have a general sense of the toolset we'd like to use, we can then talk about what use case we'd like to pursue.

copystar commented 7 years ago

@runderwood Good idea! Will hold off. Thanks

runderwood commented 7 years ago

@copystar Sweet. When we're ready, I can fork so we can do pull requests there. No need to start from scratch!

ldko commented 7 years ago

@copystar we will go ahead and layout the structure on this etherpad http://pad.software-carpentry.org/scrape and if that structure is agreeable, we can claim parts to work on--also the etherpad has a chat built in if we want to discuss anything via chat.

ldko commented 7 years ago

@copystar we have outlined a tentative new structure out on the etherpad, do you have any comments or changes you want to add on the etherpad?

alixk commented 7 years ago

Sorry for being MIA--back now and checking out the proposed structure!

copystar commented 7 years ago

@ldko It looks good to me!

ldko commented 7 years ago

This was the structure proposed in the etherpad during the Sprint: Proposed Structure: 1 Intro What is web scraping? & data ethics brief intro 2 Document structure & Selectors 2.1 XPath (content looks fine, after developing the BeautifulSoup lessons, we should revisit this to align with them as needed) 2.2 CSS Selectors 3 Introduction to scraping with BeautifulSoup (based on existing OU lesson plan but use the Security Council Resolutions - http://www.un.org/en/sc/documents/resolutions/ ) 4 Advanced Web scraping using Python and BeautifulSoup (UN Security Council Resolutions) A. Count total number of security council resolutions per year and print totals B. Generate a CSV with a row for each resolution, including year, resolution #, description, and link to PDF 5 Conclusion, including group conversation on ethics of web scraping

data-lessons / library-webscraping-DEPRECATED

Review web scraping lesson structure and content #7