Open davclark opened 8 years ago
Thanks, these are terrific points and exactly the kind of information we're looking for on how best to develop content.
Much of data science in fact does come down to the 'data wrangling' of database queries, text mining and basically extracting and cleaning to get the data that you need. We are definitely focused on teaching this part of the skills set with OpenRefine, SQL and text-mining lessons. The first two points I think we're working on addressing, but this is the first I'd heard about the third and is a great idea of a topic to include.
In this hackathon, we explicitly want to identify topics like this that are more specific to working with different social science data types, and also to develop the lessons so they're using datasets of interest and relevance to the audience.
There's definitely not just one type of social scientists. Social scientists work with a broad range of data types and questions. There is unlikely to be just one version of a 'Social Sciences Data Carpentry'. Instead it would be great to have different modules that can be mixed and matched depending on the needs of the audience.
From John D at UC Davis:
In no particular order, folks who might be interested in this would be the usual suspects, but there’s no one in the DSS to the best of my knowledge who is into ‘data carpentry’ per se. Of course the distinction is moot in my mind because the basic skills for analytical work are fundamental, no matter the label. How available any of these folks are is the big question, but you could try reaching out. Colin especially I know has put together several tutorials similarly to the ones on datacarpentry.org.
I might be interested in attending. Perhaps we can talk about it Thursday. I’ve been meaning to go up to UCB to talk to the big data and DLab folks, so maybe I can leverage this event.
The datacarpentry.org site has a some good material which you could lift, but I’m unclear if this is meant to be just a software tutorial with social science examples or teaching data analytics to neophyte social scientist researchers. The former is easy because you can just lift the mostly likely candidates from datacarpentry.org, i.e. Excel, R, Python, and maybe SQL (others seem a bit too esoteric), throw in some social science data (survey data is mostly universal) and Bob’s your uncle. Some basic best practices could be included (especially in the Excel one because one must keep an eye on data conversion) and shared computing environments (UNIX & RDP) might be a topic depending on the audience. Length of the workshop is determined by the number of applications you teach and the depth you go into each. You could even manage flow, for example, by using Excel for data entry, Python for data cleaning, and R for analysis.
If you are concentrating on the latter, that is teach data analytics from beginning to end, this is much more ambitious. Keeping peoples’ attention over 2-4 days would also be pretty tough I imagine. Regardless, I would probably break the tasks down as: data entry/acquisition, data manipulation/cleaning, analysis, and housekeeping. Topics to be covered might look like this:
data entry/acquisition – topics could range widely from accessing deposition, data transformation, types of data, scrapping, flat-file v. relational databases, historical data sources, OCR
data manipulation/cleaning – vastly underrated, and very time consuming step, that would include topics like harmonization of measurement, merging disparate data sources, reshaping cross-section data as longitudinal, missing data/imputation, weighting schemes
analytical – classic approach v. data-mining, visualization, interpretation, hypothesis testing, tough topic because it is highly dependent on training/discipline
end-project housekeeping – documentation, data archive/management/publishing, GitHub?
Considering the length of the workshops, hands on tutorials would be useful to keep interest up. A recurring problem with this sort of workshop is removing the technology as an impediment. Learning R is fine for graduate students, but I’d prefer something more ubiquitous for undergrads, such as Excel or maybe a web based app.
from Richard M:
I don't have strong opinions about teaching basic data literacy, as I've not really taught it. The standard workshops seem useful enough, when I've asked students who enrolled in them.
Beyond basic materials, seems important to impress upon students that correlations among variables are commonplace. We shouldn't be surprised to find correlations. So whatever they end up doing with data, they need to remember that finding correlations is easy. But finding causes is hard. Data science cannot be a substitute for theory.
This is a basic meta-theoretical literacy issue, as it seems like so much data science assumes the answer is in there and can be fished out with enough computation. More commonly, I think, the answer is not in there, but some correlations can be found and some religion can be built around them.
In the most succinct form, the lesson is: Correlation is everyplace, but rarely does it mean anything.
I'm not sure how to impress this point on beginners. But Tyler Vigens website is a good start, at least. http://www.tylervigen.com/spurious-correlations
Maybe finding some silly correlations in the General Social Survey would be a useful exercise?
From Psychology and Center for Mind and Brain:
I agree with CMB that our grad students and faculty wouldn't be likely to have need of the kinds of introductory level matters that are highlighted on the website. More advanced quantitative procedures -- data mining; longitudinal data analyses; network analyses; etc. -- would be of greater interest, but probably would require more than 4 days (and there are already groups that run 1-2 week summer workshops on such topics).
I suppose that one topic that might be useful would be on data archiving for accessibility to other researchers. Several funding agencies require this now, and I know that some faculty are unfamiliar with how and where to do this.
It's possible that some undergrads would benefit from other kinds of introductory modules listed, but between the courses we already offer and the internships that many students get in labs, I think that most of the interested students are already getting the training...
Hi All: I like @dumit 's typology. I would place special emphasis on the first two items: data acquisition and data manipulation. In my department (political science), there is a lot of interested in getting data, particularly off the web through webscraping and APIs. @ckrogs and I have put together materials on webscraping and APIs for our Computational Social Science course. We also have stuff on information retrieval, text analysis, etc.
I would actually downvote the analytics section, because 1) this is very disciplinary specific, 2) is already covered well in home departments.
I agree with @rochelleterman's comments on the typology, and add that I've found undergrads with no prior command line experience can quickly pick up enough R to be useful. So I'm optimistic that an open source scripting language like R or Python can be central to this project, even for social science undergrads with no prior experience.
I'd recast 'end-project housekeeping' (which makes these topics sound like low-status afterthoughts) as 'reproducible and open research' and include discussion of repositories, software licensing and copyright options for publications.
from Kim who teaches sociology
Here is a list of "basic data skills" often required for social science research:
Finding, “cleaning” and using datasets – What types of data are used in social science research (data already collected, data the researcher generates/collects) – What are the pros and cons of different kinds of data – Where to look for existing datasets and other sources of data – Manipulating and cleaning data ∗ recoding and creating variables ∗ reshaping, combining, and collapsing datasets – doing all of these in a way that is reproducible
From computer code to tables and figures – Outputting results efficiently – Graphics and data visualization (both for data exploration and for presentation of results from statistical analyses)
Topics in data analysis – Using weights – Missing data – Power calculations
In terms of programs: – Excel is very useful and its utility is usually underestimated by social science students – Stata is very commonly used – but R is probably going to overtake commercial packages and may be the best investment for budding social scientists
I hope this helps. Thanks very much for investing in this! Best, Kim