Reproducible-Science-Curriculum / Reproducible-Science-Hackathon-Dec-08-2014

Workshop bringing together individuals interested in developing curriculum, workflows, and tools to strengthen reproducibility in research.
32 stars 3 forks source link

Encouraging people along path towards reproducibility #15

Open kbroman opened 9 years ago

kbroman commented 9 years ago

For scientists who see the value in having a reproducible research workflow, it's gotten relatively easy to help them to adopt improved practices. But it seems like the majority are just not interested.

How to get more scientists to put more effort into adopting better practices? Is there a need to further lower the barrier to entry, or is this a public relations issue: refining the "pitch"?

My course on tools for reproducible research was way under-subscribed last year, and there seems to be a similarly low level of interest this year. I'm not sure what to do about that.

dmlond commented 9 years ago

I agree with you that many researchers are just not that interested in this yet. One thing that this brings to my mind is the intended modularity of the output of this hackathon. We need to develop modules of instruction that target researchers with different data management needs, with different paths towards reproducibility. Each of these paths should encourage the researcher to move from simple behaviors to more complex engineering (but emphasize that all are at different points on the path to enlightenment).

Excel is really all I need: these researchers are happy storing all of their data in one or more excel spreadsheets. Their data is relatively small, tabular in nature, and do not require complex compute infrastructure to analyze.

  1. switch from excel to comma separated text files of data analyzed in R or Python
  2. organize their data and code in standard directory structures
  3. use git to version control data and R/python source
  4. publish data and R/python source via github
  5. learn how to use RDesktop or iPython Notebooks
  6. learn how to use github as a collaborative environment like what we are doing in this course
  7. learn how to use vagrant or docker to create a compute environment with their data and code that others can use to reproduce/reuse their analysis

I need special programs to analyze my data: these researchers have gotten to the point where they need to use, or even write, one or more programs in some programming language. This might be to manifest a special algorithm that they have developed (Burrows-Wheeler, etc.), or it might be to facilitate automation of the analysis of data using existing tools.

  1. favor open source tools over tools with restrictive licenses
  2. favor commandline tools with parameters over graphical user interfaces
  3. make their own programs open source with a known open source license
  4. use git to version control their code along with documentation
  5. use good software engineering practices to organize and document their code
  6. maintain their code as a distinct package, separate from their research data
  7. learn how to use vagrant or docker to create a shareable machine image with their applications + all required libraries
  8. follow/modify steps 1-7 of the excel user to organize data and scripts for their actual research

I need massive parallelization for my data: These researchers need lots of compute infrastructure to analyze their data. They write complex code/scripts to analyze their data, using special systems such as Compute Grids, Hadoop, etc.

  1. organize code and data in standard directory structures
  2. automate movement of data and code to/from their parallelization infrastructure based on the standard directory structure, so that others can insert their data into the same structure and use their code to analyze it (assuming they have the right parallelization infrastructure available)
  3. paths 1-7 in excel, modified for their own needs
  4. others?

Workflow Enactors: These researchers have discovered the utility of Taverna, Galaxy, etc. They already understand the benefits of these tools in making their analyses reproducible/reusable by others. My own proclivities towards scripting and pipelines shows here, as I do not have a good idea for their path to enlightenment, but I know we have experts here.

jennybc commented 9 years ago

This such an important point. I'll keep it short here, since we'll have opportunity to discuss in person soon.

But one idea re: re-usable curriculum is to construct an ice-breaker team-oriented exercise to be used at the beginning of two day workshops. Like a scavenger hunt but priming people for the nitty gritty of reproducible research :grin:

It could be lighthearted and memorable but still make not-so-subtle points about the value of documentation, meta-data and the power of the shell/R/Python. Teams could even be given materials accomplishing the same -- rather simple! -- analysis with various levels of reproducibility baked in. Then the post hoc discussion could cover what aspects of the task and materials combined to make things hard vs easy.

This does not address @kbroman 's point about pervasive ennui towards reproducible research, since it targets people who've committed to a 2-day workshop.

kbroman commented 9 years ago

To fight the pervasive ennui, we might point to the example of Neville:

N is for Neville (from the GoreyStore)

cboettig commented 9 years ago

Yup, agree this is probably the key issue & looking forward to our discussions on it! I'd be keen to explore more options regarding 'the pitch.' My impression is that the pitch, heck even the term itself, implies other people reproducing your results, primarily after publication. Is that generalization accurate? Would it be more effective to emphasize more immediate benefits (perhaps a la SWC's "what you learn in a week will save you two week's of work every year" or whatever it is) in efficiency, or ease of collaboration, or ease of teaching etc? (And is that actually true for the tools & approaches we'd be teaching? If not, why?)

Personally, I think some tools have a clear case of immediately allowing people to do something they couldn't do before, and others don't. Similarly, some have learning curves that are just too high (perhaps needlessly, do to design issues more than inherent ones).

Perhaps there's a question of scope here too; in that we seem to have a strong emphasis on only computational reproducibility.

tracykteal commented 9 years ago

I love @jennybc idea about the scavenger hunt at the beginning of the workshop. It would a very tangible way of pointing out what needs to be in place to reproduce research, even your own of 6 months ago, and would set the stage for the rest of the workshop. Very motivating! Although it doesn't help get people in the door. People do seem very motivated to learn how to use tools that make them more effective though. Could workshops be "Effective Data Analysis and Reproducible Research" or "Practical Approaches to Reproducible Research (for the you of 6 months from now)".