Reproducible-Science-Curriculum / Reproducible-Science-Hackathon-Dec-08-2014

Workshop bringing together individuals interested in developing curriculum, workflows, and tools to strengthen reproducibility in research.
32 stars 3 forks source link

Low-tech steps towards reproducibility #13

Open jennybc opened 9 years ago

jennybc commented 9 years ago

Looking over the existing issues, I get all excited about the things I will learn next week!

But I am reminded of advice I got from someone much wiser re: teaching … if you think your material is really interesting, it's probably way too advanced.

I think we also need to give serious attention to basic habits that are useful, like naming and organizing files well, and simple implementation of core concepts, like pseudo-Makefiles in whatever someone's language of choice is. Version control is important because that's how you work up the nerve to do make clean and see if you really can replay your analysis. Capturing the random seeds for anything stochastic is also vital.

Let's not take the basics for granted. I'd love to hear other people's suggestions of habits and tools with very high reproducibility "bang for your buck".

kbroman commented 9 years ago

This is of great interest to me. Version control is a tough sell; how to at least get people moving in the right direction?

I'm reminded of “Baby steps for the Open Curious.” We could use a similar “Baby steps towards Reproducibility.”

pschloss commented 9 years ago

But I am reminded of advice I got from someone much wiser re: teaching … if you think your material is really interesting, it's probably way too advanced."

I think this is a really good point. I think there have been a lot of good ideas posted already, but I worry about the intended audience for most of them. I think question one should be who is our audience? I think the audience should be the people that are making the transition from bench science to desk science or are integrating desk science (I just made that up!). On this note, two items that I think would get the biggest bang for the buck would be...

  1. Develop RR tools for GUI/webtool users and convince them that GUIs/webtools are a problem. Like it or not, most of the people I run into that are getting going on desk science maybe have experience with the command line, but are still using tools like excel, prism, etc. Yes, they need to learn R, Python, mySQL, whatever. But they need to be given reproducible tools that pull them into those languages so that they can form good habits early. As an example, it is very common for the bioinformatics in a microbiome paper to be distilled down to one or two sentences. When you scratch the surface you realize that the citation doesn't match the method and only processed, not raw data are available.
  2. Develop real world case studies for discussions that show the complexity and value of making one's research more reproducible. A situation that resonated with my students was the recent graduation of a student weeks before a paper was published and we didn't have a digital notebook for the project (mea culpa). Now I'm stuck trying to help people through his analysis. In this simple case there are many points that could be made for a lot of the stuff we want to push people towards. It would be great to develop a couple of introductory case studies for participants to see the importance of maintaining documentation for insuring reproducibility.
  3. The importance of making data and metadata available. Many people are amazingly resistant to the idea of opening their data and methods up. Some of the quotes in this Nature News Article are pretty spot on with my experience. For instance: "The study is described so that it could be replicated by another expeditionary team who were willing to dive across Oman collecting rare fish under several hard-earned licences". Whether it's additional case study discussions or sending people down rabbit holes of trying to find raw data and metadata, we need to get people to see the openness issue from the side of the person trying to go beyond the original work and the value of having others riff off of their original work without the largely unfounded paranoia that they'll be scooped.

I think it would be awesome if we could get people to leave a 2-day tutorial on reproducibility to appreciate that there is a problem, they can benefit from improved reproducibility, to understand that a lot of GUI and web-based tools are probably part of the problem, and to give them some simple tools and workflows to improve the reproducibility of their own tools.

jennybc commented 9 years ago

I think it's hard and unnecessary to leave GUIs and mice behind. Not to mention, an impossible sell.

The real point is to not let that stuff get baked into the product. For example, I love using the RStudio IDE but nothing about my finished work requires someone to have it. Even the shell has autocompletion, which aids the user, but still results in full commands being expressed and executed. If a GUI helps you build a reproducible workflow, by all means take advantage. You just don't want the resulting workflow to rely on, e.g. faithfully executed mouse clicks.

This is why I advocate the use of Git clients for novices. It does not preclude or interfere with command line Git but can be much easier at the start. Why do we all love GitHub so much? After all, Git servers have been around a long time, enabling collaboration. I think the graphical interface of Github is a big contributor to its success.

Yes also to case studies, which then drives home the importance of data and metadata availability.

jennybc commented 9 years ago

This has really got me thinking … in the data world, we emphasize the power of our visual processing. "A picture is worth a thousand words" and all that. So then I think it's misguided to do an about face when it comes to actually doing computational work and project an attitude that GUIs are inherently flawed. The point is more subtle than that.

pschloss commented 9 years ago

Sorry! I should have been more specific - I don't think we should throw all GUIs under the bus. I think GitHub and RStudio as services are amazing and I'm constantly impressed by the little things they do to make things more user friendly.

I'm really reacting to the observation of people using excel, prism, and websites like blast/mg-rast/rdp to click options and do their analysis without a way to keep track of what they've done for others to reproduce. Also, in the case of websites, the underlying databases may change in a non-transparent manner. Perhaps there are ways to improving the reproducibility of these types of analytical workflows, but they're just headaches in my world. My point is that we need to bring these issues up for people to appreciate and if anyone has ideas on how to improve their reproducibility that would be great.

jennybc commented 9 years ago

@pschloss No worries -- didn't assume you were anti-GUI. It's an important point to think about both for substance and, uh, evangelism. I've certainly been in situations where we veterans project an attitude of "GUI bad, CLI good" which is unnecessarily difficult for the people we are trying to persuade.

R has a GUI called R Commander that's not as slick and modern as RStudio. But it has a great feature: it writes the R code corresponding to your mouse clicks.

OTOH I hear enthusiasm about this data cleaning tool OpenRefine but AFAICT it is GUI only -- no way to capture / replay the cleaning.

I look forward to talking to some more of the experts next week, such as @tnabtaf from Galaxy, about this issue of good GUIs and responsible GUI usage.

tracykteal commented 9 years ago

I agree about meeting people where they are. Even motivated researchers won't act reproducibly if the technical overhead is too high. (Hmm, can we get "Act reproducibly" stickers?)

A key to using GUIs is the logging and history. Tools like Galaxy produce logs so there's tracking of what the steps were. I know mothur from @pschloss has good logs too. OpenRefine has a history that can be exported along with the data and imported back in to OpenRefine to reproduce the analysis. We've been demonstrating OpenRefine in Data Carpentry, but not discussing the history aspect of it, so it would be great to put together an OpenRefine reproducible research module. It is a really useful tool. We had one minute card that said "OpenRefine changed my life!!!".

Also, on the motivation front, it seems like doing reproducible research for the "you of 6 months from now" is particularly motivating whether or not people are as interested in making it reproducible for another researcher. Everyone remembers times when they've gone back to an analysis and can't remember what they did before.

I look forward to talking more about all this of next week!

jennybc commented 9 years ago

@tracykteal Glad to hear this about OpenRefine!

cboettig commented 9 years ago

Great thread!

I think it's important to emphasize this not as simply a high-tech/low-tech (or scripting vs interface) discussion but rather "tools people already know & use" vs new tools. For instance, I think there's a substantial graveyard of little-used GUIs for reproducibility (& other tasks), often built by competent academic software teams, on the (mistaken) grounds that GUIs are just easier to learn. Non-GUI based tools can displace dominant GUI tools (I'm thinking of Mesquite->R at the moment; but maybe I'm wrong about that example).

I really like @tracykteal 's example of learning from when/why researchers did adopt a new tool, and what combination of training, ease-of-use, and value to the user is necessary there.

I think the other theme great theme here is that the primary problem for reproducibility is a lack of openness in sharing methods and data. I agree that we'd make much broader progress towards reproducibility if everyone published data along with their papers, rather than worrying about whether people still use Excel etc. While it's easy to blame a culture of fear and data hugging, I think tools may play a key role. Victoria Stodden has found in her surveys that when surveyed anonymously, researchers don't publish data or software because it is time-consuming to do so, rather than any fear of being scooped. Perhaps in public it is better to act as though one's data cannot be published because it is such a gold mine, while in reality it is more a concern about what it isn't than what it is. Can the right tools make it easier to maintain and/or share publish appropriately tidy data? It seems that more researchers share software on the user-friendly platforms of today like Github than did formerly on platforms like sourceforge. Can the right tools promote openness & reproducibility?

kbroman commented 9 years ago

I spent the day writing Initial steps towards reproducible research.

It helped me to organize some of my thoughts for next week.