Davis pilot workshop reflection

Feedback

For the post-workshop survey, there was no option for ours, and the drop-down choose-your-workshop question is required with no "other" option, so I asked people to choose the Philipines workshop and make a note that they were in Davis. Hopefully it's not to hard to separate those out now, but there should probably be an other option for that question. Also not sure why ours didn't show up there.
Here are the feedback results from end of morning both days and end of afternoon day 1.
I got an email this afternoon from a professor who was in the workshop and loved it and did the capstone project Saturday morning after. :)
Reflections

In general, I thought it went well, learners seemed happy with it. Day 1 especially (spreadsheets, openrefine, and R lessons 1-4) I think is pretty solid as is.

Day 2 could use some tinkering. Single-table dplyr took the whole morning. Afternoon consisted of tidyr::gather, statistical modeling, writing functions, and dynamic documents, in that order.

By the end of day 2, students were pretty fried. I don't know if there is a way around that: Forging new neural connections for two days is just exhausting. But the (my) tendency to cram material in the second afternoon needs to be avoided. Reserving space for a capstone exercise might help with this, or students might be too spent to do that kind of independent work at the end. An alternative is a showcase of possible next steps: Here's the kind of natural language processing you can do in R (showing without teaching) and your first resource to start learning it, and the same for social network analysis, structural equation modeling, etc.

Lesson 5 - `dplyr`

People like learning dplyr, understandably so. It handles most of what most people do. The basic structure is good, I think.

Piping of data.frames to the first argument in the subsequent function didn't sink in with some students, even though I felt like I went over it quite a few times. In exercises, several students would include data.frames in functions that were receiving them from a pipe. I think this is a symptom of the various arguments to dplyr functions not being clear enough, and the issue below about the structure of mutate and summarise being different than the others is part of this. Introducing all the verbs with intermediate assignment and then introducing piping at the very end might help.

The structure of mutate and summarise is different than the other verbs because they contain a colName = that the others don't. Maybe pointing explicitly to that syntactical difference a couple times -- "these two functions create new columns, and we give those columns names with colName =" -- would help. Assignment to columns within piped functions and assigning the resulting data.frame to a variable is complicated, and at least some learners have a hard time groking the component parts.

Piping to head at the end of dplyr chains inevitably leads to students copy-and-pasting code and assigning the head of a data.frame to a variable. head should be taught, with str and summary, but we can keep it separate from piping by using tbl_df's nice printing. Maybe start the dplyr lesson with conversion to tbl_df. It's conceptually easy and would only take 30 seconds at the beginning of the lesson and would avoid headaches further along.

Lesson 6 - `tidyr`

Something was missing from the gather part of lesson 6. I was trying to move quickly and so gave a pretty quick explanation and worked one example before giving the students an exercise and moving on. I don't think the motivation was clear, and a lot of students had trouble with the various arguments (key and value especially) to gather. A second example, perhaps bigger and more realistic would be useful. Separately from this lesson a student asked about working with three-dimensional arrays in R, he had subject-by-time-by-electrode data... tidying a dataset like that could be cool. Making a stronger connection between tidy data and ggplot might help motivate. E.g. If you wanted to plot this wide data and map the various conditions to color, how would you do it in ggplot? You can't easily, but with gather you can convert it to the form that ggplot (and lm and more) expect.

Lesson 9 - statistical modeling

The social scientists were hungry for this, as we rely heavily on statistical models. The content that is there worked well, and I love the connection with ggplot. Introducing a few more functions (t-test, anova) might be useful and low cost.

Lesson 7 - writing functions

I rushed through this. Students were able to write their own (F_to_C) function and source their code/functions.R files, so they actually got quite a bit rather quickly. Some saw the payoff in terms of organization, but we need a better motivating function after the temperature conversion examples. Something that makes learners say "oh yeah, I do that over and over, it would be great to write one function and just be able to call that."

Lesson 8 - dynamic documents

This lesson needs some improvement. Making our own custom .Rmd template will help; that way we can introduce students gently (the first code chunk in the default template is probably overwhelming!). I'd start with basics of markdown and later introduce code chunks and then code chunk options.

Part of the problem is that this forces a break from the model of the rest of the workshop, especially if the instructor has been piping a live-script to learners' browsers. Not sure what to do with that, but again the custom template might help by getting instructor and learners doing the same things in the same place.

What's missing?

Text processing. A brief introduction to paste, gsub, grep, etc. would be useful to many. I haven't used stringr, but I understand it uses a more consistent syntax than the base functions so would likely provide a gentler introduction.
lists and lapply. Not sure this belongs in the first two days, but it enables automated read/write and so much more.

Generally went well Michael, but here are some thoughts and they should only be taken in the spirit of constructive criticism. While I think you did an excellent job covering the material outlined, and you obviously have a mastery of the material, I would have structured the lesson another way because I think the lesson tended to over-emphasized advanced topics and underemphasize base concepts.

First, I think OpenRefine was a red herring at best. An idiosyncratic Java package with a dubious future and questionable scalability is not something I would have spent time on. Would be much better in my mind to learn just a few of the tasks that OpenRefine can do in R (such as grep, regexp, etc.). Yes, I know folks liked it but they also like using Excel, and that is exactly what we are trying to move them away from.

Day 1 covered most of the high points of an average introduction-to-stats-software course, although I think a bit more time could be spent on the nature of data in R (data-frames verus scalar/vectors/matricies) than we did, and a bit less on logical data (important topic, but only in context of data manipulation/generation). Topics that might have been added include: importing data from other applications, transforming variables (not covered until Day 2), summary statistics (for categorical as well as continuous data), base functions (math, string, logical, dates), replacing/recoding data, and metadata (variable & value labels).

In some ways the instructor was handicapped by choice of gapminder data, and if you really want to focus on social science topics, you are going to want a more mixed data set (e.g. survey data). Looking through the tidyr material, I just noticed we skipped join altogether. I do not usually cover this in a short class, but that would have been a useful topic as well, although you would need a bit more substantive example to drive the lesson.

Day 2 was really a hodge-podge of marginally related topics, and the lesson flow suffered as a result. For the most part the dplyr section was on target, but I do have a general reservation of teaching idiosyncratic library functions. Now I realize that R is mostly just idiosyncratic library functions, but I'm always hesitant to teach foreign functions before the students have even a basic understanding of the underlying base language. Now I realize due to R's evolutionary development, what is a foreign library today may be in the base tomorrow, and I do not have enough experience to judge the merits of dplyr versus the alternatives, but I did want this issue to be raised, whether it be considered or disregarded.

The function section was pretty much a throwaway. Yes, it is a useful topic but it has very limited application for a novice R user. In my opinion, they would have gotten more out of a discussion of loops, than the brief exposure to functions.

Similiarly, the dynamic documents section was very cool from a programming point of view, and an unnessary diversion from a teaching point of view. A very useful bit of technology, but not really something I would spend time teaching beginners.

The statistics section was fine but without a bit on non-parametrics (frequencies and crosstabs at least) it felt somewhat lacking. There are a ton of other techniques you could have covered (ANOVA, T-test, logistic regression) but I agree that there is only so much time you want to spend on this section. Perhaps a discussion of what is built-in and what needs to be installed as a package might have been helpful.

Versioning was a major issue throughout the course. Something has to be done, probably at the beginning of the class, to make sure everyone is using the same version of R and the libraries. This issue cropped up way too often, although with the nature of R, it might be unavoidable.

I am not sure how married the Data Carpentry program is to the two day workshop, but my recommendation would be to par down the class to its basics and squeeze it into a day. If you must have two days, then you may want to split the class into part I and part II, using part I as the prerequisite.

Finally, I think there is an underlying philosophical/pedagocial theme that runs through the course the I would encourage you to re-evaluate. This notion that courseplans can and should be improvised is antithetical in my mind to a properly paced class that flows logically and keeps the students' interest. Perhaps what I say is heretical, flying in the face of a guiding principle of Data Carpentry, but my experience has show me that the more defined the course, the more logical the process, and the tighter the presentation, the more effective the class. This is a guiding principle of education that we like to ignore in higher education because we are supposedly 'beyond' the need for such structure. My argument is we are not, we just tend to be too lazy and rationalize when our shortcuts fail -- we all do it, myself included. Note, a good part of this problem unavoidably stems from the moduluar nature of the course. I think the first R lesson flows the best because the topics are better linked conceptually than then the later lessons, so the problem is surmountable.

In conclusion, I am very impressed with how developed the materials are and with Michael's ability to teach a very difficult topic to an 'unusual' audience. Feedback was requested so I thought I would add a few thoughts, in the hopes of refining the overall lesson, and it was not my intent to sound hypercritical, only to point out the shortcomings I perceived.

Thanks Michael for getting the conversation started here. I vastly enjoyed the workshop and thought that it was exceptionally good, particularly for our first pilot of the social sciences material. I have no experience teaching R, but do have training in instructional design and experience working with social scientists (education researchers), so will limit my comments to those areas.

1) Based on DC's overall goals of promoting reproducible research and being accessible to a community of computational novices, OpenRefine is an excellent way to start out these lessons. This section allowed workshop learners to see immediately how they can improve the reproducibility of their workflow, in a highly accessible and non-intimidating way. That gives you buy-in (because learners are engaging in an authentic task) and enables you to move forward with more advanced topics.

2) A suggestion in transitioning from OpenRefine to R might be to have students write out/export the cleaned data from OpenRefine and read into R. This would help people see what a good workflow might look like (clean data in OpenRefine --> migrate to R for analysis (plotting, stats, etc).

3) I agree with @ucd-ssds-jedaniels that Day 2 didn't flow as well as Day 1. I ended up leaving around 3pm the second day, so didn't get to see the dynamic documents lesson, but I would suggest that that lesson may not be as critical to the core goals of the workshop and could be expendable. If we moved in that direction, it would enable expansion of some of the other topics. It will be interesting to see which topics learners felt were the most critical to their research. Perhaps those sections could be expanded in lieu of the dynamic documents lesson.

4) I'd like to respond directly to @ucd-ssds-jedaniels 's statement that:

This notion that courseplans can and should be improvised is antithetical in my mind to a properly paced class that flows logically and keeps the students' interest. Perhaps what I say is heretical, flying in the face of a guiding principle of Data Carpentry, but my experience has show me that the more defined the course, the more logical the process, and the tighter the presentation, the more effective the class.

This isn't simply a "guiding principle" of Data Carpentry. There is an extensive body of education research literature showing that responding to learners' difficulties in real time (ie during class) is a more effective method of teaching than working through a pre-set lesson plan to cover the prescribed material without getting any feedback from learners. This is why our workshops are taught the way they are and why the minute cards at mid-point and end of each day are essential components of our teaching. I think Michael did a fantastic job of responding to learner feedback both from minute cards and during class time and have no doubt that learners had a more useful learning experience from this than they would if he had walked through a pre-defined set of examples without getting learner input. That being said, I think this was implemented better the first day than the second, which may have been due to trying to fit too much in day 2.

5) Dplyr was a great choice to cover with this group. It gives them an immediately useful tool for simplifying their workflow and making it more reproducible. This is fully in line with the Carpentry objectives and is much more important to us than teaching an appreciation of the structure of the language.

6) The statistics lesson was a great thing to include. Again, I heartily agree with what Michael said in class about us not being there to teach the learners statistics. The point of the statistics lesson is to demonstrate that, if you know the underlying statistics (which we expect you to learn elsewhere) and know how to troubleshoot, you can implement pretty much any statistical method in R. I think a very useful exercise here would be something like "Think of a statistical test that you use frequently in your research. Search for an install a package, or find the function in base R, that enables you to do that test." This builds on the "culture of lifelong learning" that DC is trying to promote and also makes the lesson more immediately applicable to the individual learner's research.

7) Functions are important to cover (more so than loops, especially for the social sciences crowd), but the temperature conversion example struck me as very irrelevant to this group. Maybe a currency conversion? Someone with more social sciences background than me can probably provide a more appropriate example.

8) With respect to versioning, I agree that this was a bit of an issue, but provides a great (and authentic) example to learners of when to and not to care about the scary red text that pops up on their screens. Better for them to see a versioning issue for the first time in a workshop than on their own.

Overall, I think the curriculum has some tweaking to do, but that it was a very solid pilot that provided a useful learning experience for the participants.

Hi, Michael,

I am going to echo Erin's comments that this was a great launch for the social studies R workshop. A couple of notes I took during the class. "MCQ: Data Reduction" challenge was unclear to students because it asked them to look at countries with "per capita gdp less than a dollar a day", but then they had to report out "the annual per-capita gdp". Consider changing the multiple choice answer options to refer to the same gdp. I think you also wanted to move that same question after the "mutate" section. I just looked at the lesson, and it looks like you have already done that. Also, you wanted to add to the pre-workshop instructions that anyone who has previously installed R needs to update it prior to the workshop. On another challenge question, "Challenge – Part 1 Calculate the variance – var() of countries’ gdps in each year. Is country-level GDP getting more or less equal over time?" I was confused whether the question wants the variance of income in a single country over the course of multiple years or the variance in income between all the countries. It may have been just me, though.

I saw a suggestion in the student feedback about challenge questions for homework. I think that's a good idea. That way they can take as much time as they need and find out what is truly unclear vs. not having had a chance to process a lot of new information.

Again, looking through the student feedback, it looks like they were happy by the pace on the first day, but had too much the second day. Would the Data Carpentry consider moving pieces of the workshop around? For example, on day 1 start with OpenRefine, continue with intro to R, project management, subsetting and data.frame, maybe some tidyr for day 1. On day 2, start with statistics and plotting, and finish with dynamic documents and spreadsheets and best practices--more easy to absorb even when the students are tired. I have qualms about suggesting doing the applied stuff prior to best practices, especially since spreadsheets leads so nicely into OpenRefine, but on the other hand it is important to take into account the students ability to learn as they get more tired.

Vessela

Hi Michael,

Sorry for the delay in getting a response to this - I've been traveling for the past few weeks. Most of my comments have been mentioned above (and I was only present for Day 1), but I would support re-arranging of concepts so that Day 2 doesn't end up being overwhelming. I liked OpenRefine (hadn't been exposed to it before) and I can definitely see myself using it in the future - but as a learner I would have also liked knowing about the tools available to do the same thing in R (even just a mention of them so I could look them up later if I wanted to streamline my workflow).

Myfanwy

data-lessons / gapminder-R