Closed ostephens closed 6 years ago
Ok @ostephens - shall I create the pad or will you? Easy for me to do.
Pls go ahead @weaverbel
Here you go @ostephens http://pad.software-carpentry.org/scrape
OK - looking at this and thinking about it this morning, my view is that we should remove the Python part of the lesson (either completely, or separate it into it's own 'advanced' lesson).
There is really useful and relevant stuff in the lesson without leaping to programming tools - and I think probably enough content for 3hrs anyway. I think the additional overhead and barriers introduced by going to programming is that we potentially lose people who would otherwise take some really good concepts from the lesson. It would also give more time to spend on XPath (and potentially CSS or jQuery selectors)
So I'm proposing we take out Episode 04 entirely. Any views?
When we've taught this lesson in the past we've requested that people come with some knowledge of programming (not necessarily Python), but that didn't always happen and people were still able to go through the exercises. I agree that the web scraping python lesson is advanced - it makes a lot of assumptions that people understand the basic concepts programming, including knowledge of object oriented programming
Although we've been able to deliver the full lesson in 3 hours, it is definitely possible to go more in-depth and provide more instruction in each of the sections. The original XPath lesson I developed was more generic and was not aimed at web scraping - it addressed using XPath for XML documents, and also provided a brief introduction to XQuery.
Here are some suggested structures for a web scraping syllabus:
1
2
pinging @timtomch for your thoughts as well!
I think a Library Carpentry intro to web scraping shouldn't require programming knowledge as a pre-requisite.
I think XPath is a really useful thing to learn and makes complete sense in this context and so I'm in favour of keeping this in.
I really like all the surrounding material - what is web scraping/HTML DOM/ethics etc. all really good IMO
From a scholarly point of view, I would say that a non-programmatic approach to web scraping is not likely to be all that useful, at least in my experience. If something can be scraped without programming, typically folks just do it through brute force.
As for XPATH, while useful, it is a rather narrow DSL. I'd be more interested in making sure people are introduced to the idea of document structures/trees, markup syntax/semantics, and selectors, broadly. For munging HTML, I've yet to do any heavy lifting in web scraping without the use of a parsing library like BeautufilSoup, so, for my part, I'd like us to really consider retaining something along those lines in the lesson.
I agree that Python/programming knowledge shouldn't be a pre-req for an LC workshop. I like the idea of the bulk of the lesson being non-programming oriented, although it might be useful for learners to see/play around with BeautifulSoup at the end. I think it's pretty approachable and exciting for people to see how it works. (I added this on the etherpad, but I've taught the ProgHist lesson on BeautifulSoup and it worked very well: http://programminghistorian.org/lessons/intro-to-beautiful-soup)
I also agree that having a conversation about ethics in the introduction of the lesson is really important.
@runderwood My experience differs in terms the usefulness of a non-programmatic approach to web scraping - but it's about having the right tool.
I agree that the more general concepts are important, but XPath and XLST is something we've been asked for repeatedly as part of Library Carpentry and this seems like a good place to introduce that syntax.
I should be clear that I'm not against a separate lesson that has programming as a pre-requisite - I just think that it doesn't belong in the 'intro' lesson.
I have never used XSLT in web scraping. I've written mountains of the stuff, but I just can't imagine it being something we'd wade into here, especially if we're avoiding programming.
@runderwood - sorry wasn't being clear - I meant 'introducing the XPath syntax' not 'introducing XSLT' - I'm not suggesting adding xslt into this lesson
@ostephens OK! Phew.
I like XPATH, but one could argue that CSS-like selectors are more relevant in the web world. And I think those point directly toward XPATH.
BeautifulSoup's document traversal facilities are arguably more comprehensible than XPATH proper. Alternatively, LXML's HTML parsers/traversers are more XPATH-oriented.
Whether we teach XPATH or not, I still think some basic programming is a necessity if this is to be maximally useful to researchers. Also, the non-programmatic resources mentioned so far are all proprietary and severely limiting.
@runderwood Agreed that CSS selectors as relevant in web world and I think we should consider adding these to the lesson - but alongside XPath IMO
I still think some basic programming is a necessity if this is to be maximally useful to researchers.
But this is not aimed at researchers, but at librarians.
Also, the non-programmatic resources mentioned so far are all proprietary and severely limiting.
I don't disagree particularly I know better tools for doing scraping (without programming) but I think you end up teaching the tool - which I think is a bad idea I think the tools used in this lesson are lightweight enough that the focus becomes teaching the concepts not the tool
I think the question for me is whether introducing the basics here is possible and useful without breaking out into programming. I'm inclined to think it is, but I think I'm currently outnumbered in that regard on here...
@alixk when you taught http://programminghistorian.org/lessons/intro-to-beautiful-soup - how long did you have available and what experience did people coming to the lesson have?
Yeah, in my (limited) experience, it seems as though tools like browser extensions or proprietary tools tend to not have a very long lifespan, where Python/BeautifulSoup is more dependable. And builds upon the shell lesson.
What about a two-part lesson similar in style to spreadsheets + OpenRefine? 1.5 hours for What is Web Scraping, Ethics, Landscape of resources; and then 1.5 hours for Python, BeautifulSoup.
@ostephens I definitely like your idea of working XPATH into the lesson. As for non-programmatic approaches, I think there is definitely a place for that. I'm just not sure this lesson is it. The LC ethos, as I understand it, doesn't seem compatible with purely conceptual lessons nor does it seem in line with proprietary, non-programmatic approaches to scraping. The notion here, I think, is that these lessons knit together all these different more-or-less UNIX-ish tools to do interesting things and empower librarians/researchers. If this is our end, I'd think we'd want to move this lesson in a direction not too radically different from its present form, generally speaking -- programmatic and very concrete and hands-on.
@runderwood I don't entirely agree with that summation of the LC ethos, but I am in favour of concrete rather than abstract
@alixk @runderwood @kimpham54 OK from this discussion, and having reviewed http://programminghistorian.org/lessons/intro-to-beautiful-soup you are starting to convince me.
Rather than a 1.5/1.5 split I think I might go for a 1 hr/2 hr split - which reflects the current structure I think?
Are we all agreed that BS is a better starting point than Scrapy for this lesson?
OK - so if we are saying we are going to use Python with BS here, do you think that we should dive straight in? That is making the Python/BS install the starting point rather than using the Chrome extension as a starting point?
BTW just to show I'm not entirely making up the idea you can do useful scraping without Python (IMO), here is a tutorial I wrote using Google Sheets to scrape data from the Early English Short Title catalogue - based on a real world use case http://www.meanboyfriend.com/overdue_ideas/2015/06/using-google-sheets-with-estc/
@ostephens That is really, really interesting, both the target (the short title catalogue) and the approach.
But I would note that, as effective as this seems to have been, it is a) using a proprietary tool b) in a way that amounts to scripting/programming.
Something like this:
=importXml(concat(“http://estc.bl.uk/”,A2),”//td[@class=td1]//a[contains(@href,’&set_entry’)]/@href”)
...while technically mostly declarative, isn't necessarily more comprehensible than Python code (in this case, actually, I think it's less so).
These same techniques, in any case, can be applied in a Python environment, and with more potential for broader application -- one will hit a hard limit on the utility of the Google Sheets approach long before you'll find something you can't script in Python.
Just a small suggestion. While I think the Short Title catalogue is a great example for librarians to work on scraping, I would also like to suggest that we use Wikipedia as a potential source of scaping material.
@copystar why wikipedia?
@copystar Are there applications for scraping of Wikipedia not covered by its API?
@runderwood I agree - the approach has definite short comings! It's just that I've found this approach a way of introducing key concepts in a concrete, hands-on, without having to get people started with Python and without having to install any s/w locally.
@ostephens Understood. People familiar with Software Carpentry and LC have indicated to me that Bash and Python are generally taken as a given, so I feel like we wouldn't be breaking with precedent there.
But I'm definitely sharing your blog post with colleagues.
Re API vs Scraping - how far is this differentiation important? Using a simple API would be easier than doing a difficult scrape. Could we start with an API example - well structured data, and then move onto scraping - more difficult HTML parsing?
@runderwood in case it's of interest, I also did a similar one for introducing APIs http://www.meanboyfriend.com/overdue_ideas/2016/06/introduction-to-apis-using-iiif/
@ostephens I think they're very different, in practice. An API implies that the data is being made easy to obtain, with documentation, a nod to standards, etc. Web scraping is usually working around in-built or incidental barriers to aggregating data.
I think the approach in the existing lesson makes a lot of sense. I'd be interesting in taking its approach, and even its use case, as a core for reworking.
@ostephens For an introduction for beginners, using Wikipedia as an example would allow them to use a source that they are already familiar with. When I first tried working with XPath, I found using the structure of Wikipedia very straightforward. And once you can master webscraping a column of data from a long list from Wikipedia (or Wikidata), you now have the ability to draw on every subject matter.
But that being said, I understand what you mean about not needing to use web scraping because Wikipedia already provides an API. But for beginners, they likely don't have this option yet.
Again, this is just a suggestion. I'm more than happy to work with the examples already selected.
@copystar no examples selected so far. The existing lesson uses members of parliament - but I'm not convinced this is a good set of examples for LC.
Wikipedia/wikidata seem reasonable examples to me
I've got to go now - will be back online either later this evening or tomorrow
Ok. I'm going to start working on a version of https://github.com/qut-dmrc/web-scraping-intro-workshop. While this example does make use of Python, I think it's a approach of using Requests and Beautiful Soup is less intimidating than Scrappy
@copystar Requests and BS are super important packages, so I'm all for using them here. :+1:
Maybe before we start hacking away, we could nail down the structure. I think if we can get that hammered out, since we have a general sense of the toolset we'd like to use, we can then talk about what use case we'd like to pursue.
@runderwood Good idea! Will hold off. Thanks
@copystar Sweet. When we're ready, I can fork so we can do pull requests there. No need to start from scratch!
@copystar we will go ahead and layout the structure on this etherpad http://pad.software-carpentry.org/scrape and if that structure is agreeable, we can claim parts to work on--also the etherpad has a chat built in if we want to discuss anything via chat.
@copystar we have outlined a tentative new structure out on the etherpad, do you have any comments or changes you want to add on the etherpad?
Sorry for being MIA--back now and checking out the proposed structure!
@ldko It looks good to me!
This was the structure proposed in the etherpad during the Sprint: Proposed Structure: 1 Intro What is web scraping? & data ethics brief intro 2 Document structure & Selectors 2.1 XPath (content looks fine, after developing the BeautifulSoup lessons, we should revisit this to align with them as needed) 2.2 CSS Selectors 3 Introduction to scraping with BeautifulSoup (based on existing OU lesson plan but use the Security Council Resolutions - http://www.un.org/en/sc/documents/resolutions/ ) 4 Advanced Web scraping using Python and BeautifulSoup (UN Security Council Resolutions) A. Count total number of security council resolutions per year and print totals B. Generate a CSV with a row for each resolution, including year, resolution #, description, and link to PDF 5 Conclusion, including group conversation on ethics of web scraping
Need to do an overall review of the structure & content of this session and decide if it is the right stuff for a library carpentry lesson on web scraping. Suggest we use an Etherpad to agree a syllabus for a lesson and review against this existing lesson