dlab-trainings / social-data-carpentry-2015

Inaugural hackathon for Social Software Carpentry
Other
6 stars 1 forks source link

Ideas from Mark G. #6

Open davclark opened 8 years ago

davclark commented 8 years ago

Some limitations of the data carpentry approach come to mind, based on my experience working with students. Every student needs a unique set of tools. I do not find very believable the idea that there is a common tool-kit for social science data. Thus a data carpentry workshop risks giving students the impression that there are only a few ways to interact with data-- this could artificially constrain their thinking. I could be persuaded by the idea of promoting general data- and computing literacy, as a way of preparing students to build the unique tool-kit they will need.

Here are a few particulars about social science data, based on on-the-ground work with students:

  • [ ] Social science databases may be very awkwardly structured and hard to work with. The Panel Study of Income Dynamics comes to mind. I worked with an E-wing grad student on a one-quarter project with the PSID. All of our efforts were spent understanding the structure of the database and figuring out how to extract what she needed. We did not do any analysis, sadly, and it was frustrating for both of us. The skills needed for dealing with datasets like the PSID are database queries, text processing and the like.
  • [ ] Missing data are very common in the social sciences. Often the worst thing that can be done is to throw out cases that have missing elements-- this strongly risks introducing bias. Probabilistic imputation ("makin' stuff up"-- in a principled way) is the current standard of practice. These issues usually come up at the analysis stage. But students should be made aware of how to handle missing data early in their training. The book Statistical Analysis with Missing Data by Little and Rubin lays out the main ideas very well.
  • [ ] Students should learn about how to display and analyze ordered categorical variables (also called ordinal data), as these are very common in social science datasets. They take the form of questions where people answer "strongly disagree", "disagree somewhat" or similarly. The impulse that should be nipped in the bud is to code the ordered categories as integer values, 1, 2... and then apply statistical methods for numeric observations. We have good methods that allow ordered categories to stay just as they are. You may hear from other social scientists that methods for ordered categories are a lot of extra fuss, but in my experience the end product is far more satisfying if ordered categories are analyzed properly.
tracykteal commented 8 years ago

Thanks, these are terrific points and exactly the kind of information we're looking for on how best to develop content.

Much of data science in fact does come down to the 'data wrangling' of database queries, text mining and basically extracting and cleaning to get the data that you need. We are definitely focused on teaching this part of the skills set with OpenRefine, SQL and text-mining lessons. The first two points I think we're working on addressing, but this is the first I'd heard about the third and is a great idea of a topic to include.

In this hackathon, we explicitly want to identify topics like this that are more specific to working with different social science data types, and also to develop the lessons so they're using datasets of interest and relevance to the audience.

tracykteal commented 8 years ago

There's definitely not just one type of social scientists. Social scientists work with a broad range of data types and questions. There is unlikely to be just one version of a 'Social Sciences Data Carpentry'. Instead it would be great to have different modules that can be mixed and matched depending on the needs of the audience.

dumit commented 8 years ago

From John D at UC Davis:

In no particular order, folks who might be interested in this would be the usual suspects, but there’s no one in the DSS to the best of my knowledge who is into ‘data carpentry’ per se. Of course the distinction is moot in my mind because the basic skills for analytical work are fundamental, no matter the label. How available any of these folks are is the big question, but you could try reaching out. Colin especially I know has put together several tutorials similarly to the ones on datacarpentry.org.

I might be interested in attending. Perhaps we can talk about it Thursday. I’ve been meaning to go up to UCB to talk to the big data and DLab folks, so maybe I can leverage this event.

The datacarpentry.org site has a some good material which you could lift, but I’m unclear if this is meant to be just a software tutorial with social science examples or teaching data analytics to neophyte social scientist researchers. The former is easy because you can just lift the mostly likely candidates from datacarpentry.org, i.e. Excel, R, Python, and maybe SQL (others seem a bit too esoteric), throw in some social science data (survey data is mostly universal) and Bob’s your uncle. Some basic best practices could be included (especially in the Excel one because one must keep an eye on data conversion) and shared computing environments (UNIX & RDP) might be a topic depending on the audience. Length of the workshop is determined by the number of applications you teach and the depth you go into each. You could even manage flow, for example, by using Excel for data entry, Python for data cleaning, and R for analysis.

If you are concentrating on the latter, that is teach data analytics from beginning to end, this is much more ambitious. Keeping peoples’ attention over 2-4 days would also be pretty tough I imagine. Regardless, I would probably break the tasks down as: data entry/acquisition, data manipulation/cleaning, analysis, and housekeeping. Topics to be covered might look like this:

  1. data entry/acquisition – topics could range widely from accessing deposition, data transformation, types of data, scrapping, flat-file v. relational databases, historical data sources, OCR
  2. data manipulation/cleaning – vastly underrated, and very time consuming step, that would include topics like harmonization of measurement, merging disparate data sources, reshaping cross-section data as longitudinal, missing data/imputation, weighting schemes
  3. analytical – classic approach v. data-mining, visualization, interpretation,  hypothesis testing, tough topic because it is highly dependent on training/discipline
  4. end-project housekeeping – documentation, data archive/management/publishing, GitHub?

Considering the length of the workshops, hands on tutorials would be useful to keep interest up. A recurring problem with this sort of workshop is removing the technology as an impediment. Learning R is fine for graduate students, but I’d prefer something more ubiquitous for undergrads, such as Excel or maybe a web based app.

dumit commented 8 years ago

from Richard M:

I don't have strong opinions about teaching basic data literacy, as I've not really taught it. The standard workshops seem useful enough, when I've asked students who enrolled in them.

Beyond basic materials, seems important to impress upon students that correlations among variables are commonplace. We shouldn't be surprised to find correlations. So whatever they end up doing with data, they need to remember that finding correlations is easy. But finding causes is hard. Data science cannot be a substitute for theory.

This is a basic meta-theoretical literacy issue, as it seems like so much data science assumes the answer is in there and can be fished out with enough computation. More commonly, I think, the answer is not in there, but some correlations can be found and some religion can be built around them.

In the most succinct form, the lesson is: Correlation is everyplace, but rarely does it mean anything.

I'm not sure how to impress this point on beginners. But Tyler Vigens website is a good start, at least. http://www.tylervigen.com/spurious-correlations

Maybe finding some silly correlations in the General Social Survey would be a useful exercise?

dumit commented 8 years ago

From Psychology and Center for Mind and Brain:

I agree with CMB that our grad students and faculty wouldn't be likely to have need of the kinds of introductory level matters that are highlighted on the website. More advanced quantitative procedures -- data mining; longitudinal data analyses; network analyses; etc. -- would be of greater interest, but probably would require more than 4 days (and there are already groups that run 1-2 week summer workshops on such topics).

I suppose that one topic that might be useful would be on data archiving for accessibility to other researchers. Several funding agencies require this now, and I know that some faculty are unfamiliar with how and where to do this.

It's possible that some undergrads would benefit from other kinds of introductory modules listed, but between the courses we already offer and the internships that many students get in labs, I think that most of the interested students are already getting the training...

rochelleterman commented 8 years ago

Hi All: I like @dumit 's typology. I would place special emphasis on the first two items: data acquisition and data manipulation. In my department (political science), there is a lot of interested in getting data, particularly off the web through webscraping and APIs. @ckrogs and I have put together materials on webscraping and APIs for our Computational Social Science course. We also have stuff on information retrieval, text analysis, etc.

I would actually downvote the analytics section, because 1) this is very disciplinary specific, 2) is already covered well in home departments.

benmarwick commented 8 years ago

I agree with @rochelleterman's comments on the typology, and add that I've found undergrads with no prior command line experience can quickly pick up enough R to be useful. So I'm optimistic that an open source scripting language like R or Python can be central to this project, even for social science undergrads with no prior experience.

I'd recast 'end-project housekeeping' (which makes these topics sound like low-status afterthoughts) as 'reproducible and open research' and include discussion of repositories, software licensing and copyright options for publications.

dumit commented 8 years ago

from Kim who teaches sociology

Here is a list of "basic data skills" often required for social science research:

Finding, “cleaning” and using datasets – What types of data are used in social science research (data already collected, data the researcher generates/collects) – What are the pros and cons of different kinds of data – Where to look for existing datasets and other sources of data – Manipulating and cleaning data ∗ recoding and creating variables ∗ reshaping, combining, and collapsing datasets – doing all of these in a way that is reproducible

From computer code to tables and figures – Outputting results efficiently – Graphics and data visualization (both for data exploration and for presentation of results from statistical analyses)

Topics in data analysis – Using weights – Missing data – Power calculations

In terms of programs: – Excel is very useful and its utility is usually underestimated by social science students – Stata is very commonly used – but R is probably going to overtake commercial packages and may be the best investment for budding social scientists

I hope this helps. Thanks very much for investing in this! Best, Kim