ReScience / ReScience-article

ReScience article repository
13 stars 7 forks source link

R-words #5

Open rougier opened 8 years ago

rougier commented 8 years ago

The idea is to converge on definition of these words according to our respective scientific domains. At this point it is not even sure that all words are relevant to all domains and this may also depend on the kind of software we consider (see discussion in #4)

Rerunable:

Repeatable:

Replicable:

Reproducible:

Reusable:

Remixable:

Reimplementable:

oliviaguest commented 8 years ago

I will only attempt one because the rest seem too similar/unknown to me.

Reimplementable: Is there enough information in the specification (i.e., journal article, or referenced within the journal article) to recreate the model (within the theory or account) from scratch? If yes, then the model is reimplementable. If no, then even if the experiments can be carried out within the original (presumably opaque) codebase, then the model is not able to be reimplemented (given the current specification).

rougier commented 8 years ago

Rerunable: Is it possible to re-run the model (same computer, same system, same program) and get the exact same results ? It may seem obvious that the answer is yes but it is not that obvious actually. For example, if you're using a random generator and did not set or record the seed, then you cannot guarantee a re-run. Same if you fed manually some parameters when starting your model without a mechanism to save them or read them from a file that is changed after the run.

oliviaguest commented 8 years ago

What's the history/reason behind the capitalisation of rerun?

rougier commented 8 years ago

Replicable: Is it possible to re-run the model (same program) on a different computer using a different system or different version. Does your specification give enough information concerning required library and their respective version number ? Does your model relies on system specific libraries (use of system specific library) ? Does it correctly handle system-specific features (float precision, endianess, etc.)

rougier commented 8 years ago

@oliviaguest None, just correct it.

rougier commented 8 years ago

Reproducible = Reimplementable (for me)

oliviaguest commented 8 years ago

Ah, so for me it's more complex, like (although I might need to think about it more):

(Rerun + Reimplement) ~ Reproduce = Replicate

R-words on LHS are somehow weighted.

Edited. OK, not sure. But I think maybe best to nest them? I have seen other definitions out there.

oliviaguest commented 8 years ago

Something needs to be said about replicating the data also, in my opinion. If you are modelling data, then surely the experiment that produced the training data and testing data has to also be replicable.

khinsen commented 8 years ago

@oliviaguest Your definition of reimplementable looks fine to me, but it should be made clear that it applies to a human-readable document (such as a paper), not to software or computed results like the other R-words. I also think that this term matters to us, because it defines the ideal candidate paper for a replication to be published in ReScience.

khinsen commented 8 years ago

@rougier I am fine with your definition of "rerunable". The tricky part is the transition from there to "replicable".

The idea is that there are aspects of a computation that should be modifiable without affecting the results, according to expectations shared in the community. A computation is then called "replicable" if it satisfies those expectations. Typically the expectations include results independent of minor version changes in everything, and of the use of different compilers and operating systems.

The big problem is of course that these expectations are never written down explicitly, and it is unlikely that there is a complete consensus about them in any community. But without a clear list of criteria, it is impossible to verify if a computation is replicable. To make it worse, some people's expectations are about obtaining the exact same results at the bit level, whereas others consider it normal that "small" variations happen, though nobody ever seems to be able to define "small" in this context.

khinsen commented 8 years ago

Repeatable: = rerunable.

khinsen commented 8 years ago

Reproducible: the result of a computational study is reproducible if its human-readable description is reimplementable and if a reimplementation leads to results whose scientific interpretation is the same as for the original results.

Note that reproducibility can change over time, for two reasons:

  1. A new reimplementation can lead to different results than any previous one.
  2. The differences between the results of different implementations can become scientifically relevant if the state of the art of the field makes significant progress.
khinsen commented 8 years ago

Reusable: a piece of code or a dataset is reusable if its characteristics are sufficiently well described that it can safely be transferred to another context.

For a piece of code, there are two interesting relations to other R-words:

  1. Reusable code is well-documented code, meaning that its documentation is reimplementable.
  2. Reusable code is related to replicable computations through a clear statement of dependencies in the documentation.
oliviaguest commented 8 years ago

Where do changes that are not central to theory, so can be abstracted away in ideal circumstances, get relegated to? For example I have come across cases where a non-theoretically important implementation detail which should not affect the model (e.g., quicksort vs another sorting algo) ends up affecting the model because the authors were not careful. The type of sorting algo used is categorically not part of the theory and should not be, but was still integral to the replication of the results (because careful consistent modelling was not carried out). It should be part of the spec, ideally, but it was neither part of the spec nor abstracted away enough during investigations of the model so the results ended up depending on a theoretically irrelevant point.

And - very relatedly - where do details that are important to the theory but have not been discovered as such belong in this r-hierarchy of words? It is a similar but importantly different case in which an implementation detail needs to be promoted to the theory-level because it is actually theoretically important, e.g., it is important for quicksort to be mentioned as the theory depends on it and not just because of sloppy modelling.

oliviaguest commented 8 years ago

PS: I mentioned "r-word" as a slightly flippant comment. I am now a little sorry it has caught on as it makes me feel conflicted.

khinsen commented 8 years ago

@oliviaguest I'd say that the cases you describe are outside of the R-word universe. They are well covered by traditional terms such as "mistake", "oversight", etc. Their symptom is usually non-reproducibility. In fact, I'd say that a major motivation to test for reproducibility is to catch situations such as those you describe.

khinsen commented 8 years ago

Some general comments about the R-word definitions:

  1. We should provide a short definition for clarity in our paper, and for uniformity of usage in the context of ReScience. But I'd leave it at that - discussing any of these concepts in depth can quickly turn into a dissertation on the philosophy of science.
  2. We should be careful to state what each word applies to: a piece of code, a complete computation, a result, a paper, ...
oliviaguest commented 8 years ago

I don't understand, if they are outside the words we are defining then I'm really confused. :confused:

khinsen commented 8 years ago

@oliviaguest Don't worry. It's a good shorthand for this discussion. I hope it won't end up in the text of our paper!

khinsen commented 8 years ago

@oliviaguest Perhaps "outside" isn't the best term. They are of course related, being specific cases of non-reproducibility. But I don't think we need a specific new term for each cause of non-reproducibility. We don't want to blow up the cost of future editions of the Oxford Dictionary.

oliviaguest commented 8 years ago

Aha! Now I see the confusion, @khinsen. I am asking which umbrella word they fit under, not to give them unique names! Which word that you are defining explains those cases? And if it is the same word - why? I am curious, as I do not know the answer and feel there are many similar sounding words to me. I use a very different way of talking about these issues, so I am trying to fit (re-describe) my experiences to match the general concepts I see you defining here.

khinsen commented 8 years ago

@oliviaguest The common category is "reproducibility" in my opinion.

Your second case is almost the textbook definition of a cause of non-reproducibility. Scientist A publishes a study. Scientist B tries to reproduce the scientific conclusion using a modified study, and fails. Comparison of the two studies then shows that something that everybody considered a technical detail actually is important and should be promoted to a part of the theory.

Your first case is very similar, except that the comparison of the two studies shows that study A was not designed carefully enough. The theory has survived another round.

So the common point is that a reproduction attempt fails, and the analysis of the failure improves everybody's understanding. Just the Happy End that we need to keep our funders happy.

jsta commented 8 years ago

My feeling is that the term remixable is very dependent on the details of the license assigned to the work. However, it also has a practical aspect. It is very difficult to remix a model if the codebase is not made up of modular pieces (functions).

khinsen commented 8 years ago

@jsta My main question concerning remixable is: what is it about? Mixing suggests a large collection of things. What are those things? Functions in a library? If so, what is the mix resulting from mixing functions?

jsta commented 8 years ago

@khinsen I am not sure. A subset of the original? remixable may be a tough one!

oliviaguest commented 8 years ago

Is everything open source remixable?

khinsen commented 8 years ago

If you take "mixing" from a legal point of view, probably yes. Otherwise, we need to decide first what "remixable" really means!

gdetor commented 8 years ago

It's not always the case. Imagine that you have a hybrid code (open source and proprietary), then you have to acquire the proprietary license as well. Otherwise you cannot mix the hybrid code with any other code. The use of NAG library would be an example. So I think you have to verify that all of the mixing parts are under a "open source" license.

oliviaguest commented 8 years ago

Perhaps more important than or equally important to definitions: A metric?

How do you choose a reproducibility metric? by @IanHawke

khinsen commented 8 years ago

Very important indeed, but in my opinion this is a research topic for many years to come. At this time, we can do no more than mention the problem and refer to papers such the one by Mesnard and Barba. Another reference along these lines is a recent paper in Science about reproducibility of DFT computations in materials science.

Note also that the problem concerns only computational models derived from continuous mathematics. That's of course a huge part of computational science, but not all of it. As a consequence, all the R-words can be defined independently, pretending that all of science can be done using discrete maths. All science is based on simplifying assumptions, so this could be ours.

oliviaguest commented 8 years ago

What about the more general point of criteria?

khinsen commented 8 years ago

I'd say that at the level of generality we work at, these criteria follow from the definitions of each R-word, with one option being "domain-specific, we can't say any more here". We should probably say something about the criteria in each definition.

As an example, rerunable makes sense only if the criterion is bitwise identical results, not counting metadata such as time stamps. At the other extreme, the criteria for being reproducible are necessarily domain specific.

oliviaguest commented 8 years ago

I think some general meta principles might be required though... I might be talking cross-purposes with you, but I have a feeling that criteria or at least meta criteria (criteria for criteria) can be nailed down.

oliviaguest commented 8 years ago

This articles gives definitions for replication and reproduction very clearly: http://biostatistics.oxfordjournals.org/content/10/3/405.full

oliviaguest commented 8 years ago

And here is another: Replicability is not Reproducibility: Nor is it Good Science

khinsen commented 8 years ago

@oliviaguest Thanks for those references! I remember the second one well, because I disagree with its conclusion, but its definitions of replicability vs reproducibility are indeed very clear. The first one seems to use the exact inverse definitions, and defines in detail only the one we call replicability. Unfortunately, in the criteria for replicability, there's again the "reasonable bounds for numerical tolerance", which is what Ian wrote about in his blog post.

oliviaguest commented 8 years ago

Slight tangent but I don't personally think any definition is Gospel from any paper nor do I like the idea of a prescriptive/normative definition-war. Principally, because I think modelling and non-modelling have differences that transcend these words and a modeller telling everybody what word to use just won't work anyway. The best we can do is define terms when we use them.

oliviaguest commented 8 years ago

I know that's not what's being attempted here, it's just (being Cypriot and reading above about the OED as if it's dictating as opposed to describing) I'm explicitly aware of language centralisation.

khinsen commented 8 years ago

I am not interested in prescribing anything either, assuming that we have the power to do so which I seriously doubt. I would like to see some more standardization of vocabulary, but that's beyond my influence. In the meantime, I just want to be clear about the definitions we use ourselves.

rougier commented 8 years ago

The ACM just issued an announcement Result and Artifact Review and Badging where they proposed some definitions:

labarba commented 8 years ago

The ACM adoption is unfortunate and ahistorical in the computing community.

My students and I have been working for a long time on a literature review to sort through the disarray of terminology on reproducibility. Here I will share some notes. (I have a draft of a blog post or essay, but it's abandoned for a few months now. These tidied-up notes will help.)

The phrase "reproducible research" in computational science is traced back to geophysicist Jon Claerbout at Stanford, who started in the '90s this tradition in his lab that all the figures and tables in their papers should be easily re-created, even running just one command. The oldest published paper we found that addresses their method is:

Claerbout, Jon, and Martin Karrenbach. "Electronic documents give reproducible research a new meaning." Proc. 62nd Ann. Int. Meeting of the Soc. of Exploration Geophysics, pp. 601-604 (1992), doi: 10.1190/1.1822162 http://library.seg.org/doi/abs/10.1190/1.1822162

Some of the content of that paper is very outdated (it’s 1992 after all), but the way the “goals” of reproducible research are presented is interesting:

Claerbout relates some of the story of “reproducible research” coming out of Stanford in an essay on his website:

“Reproducible Computational Research: a history of hurdles, mostly overcome,” Jon Claerbout, http://sepwww.stanford.edu/sep/jon/reproducible.html

He mentions that with Matthias Schwab, they submitted an article to “Computers in Physics” about the reproducible-research concept, but it was rejected—the magazine was later bought by IEEE and turned into “Computing in Science and Engineering,” where it was eventually published years later as:

M. Schwab, M. Karrenback, and J. Claerbout (2000), Making scientific computations reproducible, CiSE 2(6):61–67.

At Stanford, statistics professor David Donoho learned of Claerbout’s methods in the early 1990s, and began adopting (and later promoting) them. A well-cited early paper from his group is:

Buckheit, Jonathan B., and David L. Donoho. Wavelab and reproducible research, Volume 103, Lecture Notes in Statistics, pp 55-81. Springer New York, 1995. PDF as a Stanford Technical Report

This paper is often cited for the quote: “an article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data that produced the result” … but Donoho says that this statement was made paraphrasing Jon Claerbout, so it should not be solely attributed to Donoho when cited.

Buckheit and Donoho make the commitment: "When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures.”

Citing the work of Claerbout, they say: “… reproducibility of experiments in seismic exploration requires having the complete software environment available in other laboratories and the full source code available for inspection, modification, and application under varied parameter settings.”

and later: “… publishing figures or results without the complete software environment could be compared to a mathematician publishing an announcement of a mathematical theorem without giving the proof.”

See also:

Donoho, D., Maleki, A., Rahman, I., Shahram, M., & Stodden, V. (2008). 15 years of reproducible research in computational harmonic analysis. Technical report. PDF at Stanford Reports

This last article has this interesting response to the imagined argument “True reproducibility means reproducibility from first principles”: "If you exactly reproduce my results from scratch, that is quite an achievement! But it proves nothing if your implementation fails to give my results since we won’t know why. The only way we’d ever get to the bottom of such discrepancy is if we both worked reproducibly.”

The influence of Claerbout and Donoho permeates through a large portion of the recent reproducibility movement in computational science. Victoria Stodden, de facto spokeswoman for reproducibility in the conference circuit, was Donoho’s PhD student.

I mentioned above a paper by Schwab, Karrenback and Claerbout (2000), published in CiSE, a joint publication of the IEEE Computer Society and AIP (American Institute of Physics). The publication ran a Special Issue on Reproducible Research in 2009 that included several more-or-less well-cited papers:

Donoho, D. L., Maleki, A., Rahman, I. U., Shahram, M., & Stodden, V. (2009). Reproducible research in computational harmonic analysis. Computing in Science & Engineering, 11(1), 8-18. LeVeque, R. J. (2009). Python tools for reproducible research on hyperbolic problems. Computing in Science & Engineering, 11(1), 19-27. Peng, R. D., & Eckel, S. P. (2009). Distributed reproducible research using cached computations. Computing in Science & Engineering, 11(1), 28-34. Stodden, V. (2009). The legal framework for reproducible scientific research: Licensing and copyright. Computing in Science & Engineering, 11(1), 35-40.

The use of the term “reproducible research” is consistent throughout them. “Reproducible computational research, in which all details of computations—code and data—are made conveniently available to others, is a necessary response to this crisis.” (Donoho et al.) "The idea of “reproducible research” in scientific computing is to archive and make publicly avail- able all the codes used to create a paper’s figures or tables, preferably in such a manner that readers can download the codes and run them to repro- duce the results.” (LeVeque) “Full replication of a study’s results with independent methods, data, equipment, and protocols has long been, and will continue to be, the standard by which scientific claims are evaluated. … an intermediate step … [a] minimum standard is reproducible research, which requires that datasets and computer code be made available to others for verifying published results and conducting alternate analyses.” (Peng and Eckel)

Enter the scene Roger Peng. His paper makes a clear distinction between reproducible research, and full replication study. The distinction also appears in his earlier publication:

Peng, R. D., Dominici, F., & Zeger, S. L. (2006). Reproducible epidemiologic research. American journal of epidemiology, 163(9), 783-789, doi: 10.1093/aje/kwj093 http://aje.oxfordjournals.org/content/163/9/783.short

which says: “… because of the time, expense, and opportunism of many current epidemiologic studies, it is often impossible to fully replicate their findings. An attainable minimum standard is “reproducibility,” which calls for data sets and software to be made available for verifying published findings and conducting alternative analyses.”

And the distinction is accentuated in:

Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226-1227. DOI: 10.1126/science.1213847, http://science.sciencemag.org/content/334/6060/1226.full

(343 citations, checked today on Google Scholar) It says: "Replication is the ultimate standard by which scientific claims are judged. With replication, independent investigators address a scientific hypothesis and build up evidence for or against it. […] Researchers across a range of computational science disciplines have been calling for reproducibility, or reproducible research, as an attainable minimum standard for assessing the value of scientific claims, particularly when full independent replication of a study is not feasible … This standard falls short of full replication because the same data are analyzed again, rather than analyzing independently collected data.”

But why do we often see an emphasis on reproducing the figures, tables, etc. in a published computational study? I found a nice explanation in:

Kovacevic, J. (2007, April). How to encourage and publish reproducible research. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 (Vol. 4, pp. IV-1273). IEEE. PDF on the author’s website

She writes: "Throughout history, scientific achievements have been roughly divided into two categories: theoretical and experimental. In either of these, the “reproducibility” was established in a specic way: In theoretical disciplines, such as mathematics for example, abstract results—theorems—are built starting from “given truths”—axioms, on which a logical pyramid is built—a proof. […] The issue of reproducibility is settled at that point; the proof allows anyone to reproduce the steps leading to the theorem. […] In experimental disciplines, such as biology for example, the reproducibility has another form. The biologist forms a hypothesis […] and then proceeds to prove or disprove the hypothesis by performing experiments. Thus, what mathematicians would call a proof would, in biology, be the methodology, the set of experiments as well as the resulting data and its interpretation, that would prove the hypothesis. While, when written, such works go through the same process of peer review, the result does not become a “theorem” until at least another independent group is able to perform the exact same experiments and confirm the results. Of course, to truthfully replicate the experiments, the paper has to provide enough specific detail about the experiments to allow another group to mimic it—the reproducibility criterion.” She goes on to explain that research in computational sciences has inherited from both theory and experiment, but not clearly adopted a standard of reproducibility, until the reproducible research “movement, with Claerbout as one of the pioneers." (J. Kovacevic was a long-time Editor-in-Chief of IEEE Transactions on Image Processing at the time.)

OK, so why do we have the terms completely swapped in the ACM adoption?

At least within computational fields, the “swapped” terms are traced back to a (frankly, misguided and irate) paper by C. Drummond—already mentioned above in this Issue thread.

Drummond, C. (2009). Replicability is not reproducibility: nor is it good science. Proc. of the Evaluation Methods for Machine LearningWorkshop at the 26th ICML, Montreal, Canada.

For background, bear in mind that the computing community publishes primarily on conferences, which are peer-reviewed. But within the conferences, there are workshops that have less selective review. This is a workshop paper, aimed at the machine-learning community.

Drummond admits that he is swapping one term for another one: “I use X for what others call Y.” He’s arbitrarily renaming “replication” … “Requiring a complete description or an on-line log would again suggest replication is the aim.” then declares exactly the opposite of what R. Peng says: “… replication is clearly at one end of the range." and exactly the opposite to what Donoho says: “… simply checking that they can reproduce the tables and graphs of the paper would seem to do little to validate the work.”

It seems that a lot of people have been influenced by the swapping of terms Drummond made in 2009—I will speculate that the ranting quality of his paper gave it some magnetism (like in political news headings these days).

See also:

Drummond, C. (2012). Reproducible Research: a Dissenting Opinion. Preprint: http://cogprints.org/8675/

and commentary by R. Peng in: http://simplystatistics.org/2012/11/15/reproducible-research-with-us-or-against-us-3/

But especially, read this:

Replicability vs. reproducibility — or is it the other way around?, October 31, 2015, by Mark Liberman: The language of science, http://languagelog.ldc.upenn.edu/nll/?p=21956

This is an essay by Mark Liberman, Christopher H. Browne Distinguished Professor of Linguistics at the University of Pennsylvania. He teaches introductory linguistics, as well as big data in linguistics, and computational analysis and modeling of biological signals and systems (among other topics). Regarding the confusion with the swapped terms, Liberman concludes: "As far as I can tell, it's a difference between people influence by Drummond's provocative but deeply confused article, and everybody else in a dozen different fields.”

I found this blog post where the author corrected the swapped terminology after becoming aware of this! http://lgatto.github.io/rr-what-should-be-our-goals/


Additional references using a terminology that is consistent with Claerbout/Donoho/Peng are:

The the “TOP Guidelines” (Transparency and Openness Promotion), Standards for Promoting Reproducible Research in the Social-Behavioral Sciences (2014), https://mfr.osf.io/render?url=https://osf.io/ud578/?action=download%26mode=render

Report of the National Science Foundation's Subcommittee on Replicability in Science: "Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science” (2015) PDF


I have more! But I will leave it there for now, as this Issue comment is already 2,000+ words.

benmarwick commented 8 years ago

On the question of definitions of 'reproducibility' and 'replicability', I think the idea of convergence on definitions noted above might be impossible because these terms have totally opposite definitions in different fields.

@labarba's comprehensive summary of the literature captures what I think is the common and widespread use, outside of the ACM, political science, and one or two other areas. Incidentally, it seems like the two terms are used synonymously in this paper in this sentence "However, good intention are not sufficient and a given computational results can be declared reproducible if and only if it has been actually replicated in a the sense of a brand new open-source and documented implementation."

The article What does research reproducibility mean? simlarly summarises the prevailing definitions for most researchers in my field and related areas. They present reproducibility as

"the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results".

This is distinct from replicability:

"which refers to the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected."

They further define some new terms: methods reproducibility, results reproducibility, and inferential reproducibility.

But, as Lorena has noted (I'm looking forward to seeing the rest of her review!), the definitions in this Science paper, which are also consistent with a long history of discussions of scientific reproducibility, as noted in the linguistic analysis at the Language Log blog, are totally opposite to the ACM, which take their definitions from the International Vocabulary of Metrology. Here are the ACM definitions:

Reproducibility (Different team, different experimental setup) The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently

Replicability (Different team, same experimental setup) The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts.

The problem with these definitions is that that IVM is the wrong place to look for modern definitions of these terms, This is because it's exclusively concerned with measurement of physical properties. It does not engage at all with computational contexts. Computational contexts are a big part of the contemporary reproducibility discussion, thanks largely to the work of @victoriastodden.

We can also note with interest the recent Nature News articles Muddled meanings hamper efforts to fix reproducibility crisis and 1,500 scientists lift the lid on reproducibility. These report on the general problem of a lack of a common definition of reproducibility, despite a widespread recognition that it's a problem. Those are helpful to demonstrate that there is a range of definitions in common use.

The main point here is that any discussion of definitions of these terms needs to acknowledge this diversity as part of the challenge of promoting these values and behaviours in science broadly. If this diversity is neglected, and you are writing for audiences spanning many fields (I hope this for ReScience!), there is a risk of being irrelevant for researchers in fields that have different definitions to the ones you've adopted.

I understand that you do have to present some kind of definitions in this paper, and I guess that the ones you choose will depends which research community you want to signify your affiliations with. There's no problem with that, so long as you note (perhaps with a brief comment and carefully chosen citations) that there is substantial diversity in how the terms are used across the sciences. It's great for science generally that more people are concerned with these issues, even if they don't agree on the definitions!

labarba commented 8 years ago

@benmarwick writes:

the definitions in this Science paper, which are also consistent with a long history of discussions of scientific reproducibility, as noted in the linguistic analysis at the Language Log blog, are totally opposite to the ACM, which take their definitions from the International Vocabulary of Metrology.

[...]

The problem with these definitions is that that IVM is the wrong place to look for modern definitions of these terms, This is because it's exclusively concerned with measurement of physical properties. It does not engage at all with computational contexts.

I wholeheartedly agree—going to IVM for inspiration on what definitions to adopt was misguided. (It is possible, too, that some folks in that committee were influenced by the Drummond papers. SIGH.)

The ACM is the Association for Computing Machinery. Although we may resign ourselves to the impossibility of a convergence of terminology across all disciplines, within computational disciplines there is a clear history of adoption. I have given more than a dozen references above, spanning 25 years, and there are many more.

(If anyone is adding a counter-example—like, "chemists use the opposite meaning"—please, do include a reference, rather than leaving it as hearsay.)

The Science paper @benmarwick cited:

Goodman, S. N., Fanelli, D., & Ioannidis, J. P. (2016). What does research reproducibility mean?. Science translational medicine, 8(341), 341ps12-341ps12. DOI: 10.1126/scitranslmed.aaf5027 http://stm.sciencemag.org/content/8/341/341ps12.full

... recognizes that: “… basic terms—reproducibility, replicability, reliability, robustness, and generalizability—are not standardized”, while clearly adopting the Claerbout/Donoho/Peng usage. They say:

“ … the modern use of ‘reproducible research’ was originally applied not to corroboration, but to transparency, with application in the computational sciences. Computer scientist [mistake: geophysicist] Jon Claerbout coined the term and associated it with a software platform and set of procedures that permit the reader of a paper to see the entire processing trail from the raw data and code to figures and tables.” [used in] “epidemiology, computational biology, economics and clinical trials…" [refs. provided]

Goodman et al. propose a new lexicon as a way out of the confusion:

A good portion of this article derives from a talk given by Goodman at a workshop of the National Academy of Sciences, titled: "Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results." Goodman gave there a useful clustering of disciplines into "groups with similar cultures," as follows:

When discussing the diversity of definitions that @benmarwick mentions, we can look at this clustering and see which group each usage falls in.

I already gave a dozen+ references for computational sciences.

In epidemiology and social science, the meaning is consistent with Claerbout/Donoho/Peng—cf. Peng, Dominici & Zeger (2006), on epidemiology, and the NSF 2015 [PDF] report for social sciences. In clinical research, there is "no clear consensus as to what constitutes a reproducible study" (Goodman et al., 2016) but the usage of the terms is consistent: one replicates the findings (while reproducibility refers to the process of investigations).

For the group of natural world-based sciences, I don't have in my notes references for astronomy or ecology (yet), but we heard from @benmarwick that the usage in archaeology is consistent with the above.

The pattern of usage is clear: reproducible study and replicable findings.

khinsen commented 8 years ago

While I agree that IVM is not the last word on terminology for science at large, I don't consider it absurd either to turn to it for "prior art" in choosing terms. Computational science has different issues than experimental science, but in the end, both are forms of doing science and their practitioners should be able to talk to each other. It makes more sense to me to extend the traditional terms from experimental science to computational scenarii where this is possible.

jsta commented 8 years ago

I wanted to follow-on from @labarba's most recent comment with a reference from natural-world sciences (ecology):

Cassey, P. and Blackburn, T.M., 2006. Reproducibility and repeatability in ecology. BioScience, 56(12), pp.958-959.

It seems that they follow the most common and widespread use of the terms detailed in @labarba's comprehensive summary except that they switch out the term Replicability for Repeatability.

labarba commented 7 years ago

I published this on Medium: "Barba group reproducibility syllabus"

It's not addressing terminology, but rather a summary of the top-10 references chosen in my group as the basic reading list on reproducibility. Topical to this thread as a complement of the lit review I started above.

rougier commented 7 years ago

I just added a link on: http://rescience.github.io/about/

heplesser commented 7 years ago

I only became aware of this discussion a few days ago. Even though the trains has left the station a while ago, I would like to add a few comments.

That terminology (reproducibility: re-run the same code; replicability: independent implementation) jars with my general (intuitive) understanding of the terms and was probably the reason why I based the terminology proposed in Crook, Davison and Plesser (2013) on Drummond (2009), while otherwise disagreeing with Drummond. Merriam-Webster differentiates "reproduction" and "replication" as follows (see Synonym Discussion section of the entry):

"reproduction implies an exact or close imitation of an existing thing. ... replica implies the exact reproduction of a particular item in all details ... but not always in the same scale."

A reproduction is a "exact or close imitation", while a replica is an "exact reproduction in all details", thus a replica is a kind of reproduction that is particularly close to the original. Since re-doing the same study running the same software on the same data is closer to the original than an independent implementation, common usage, in my opinion, suggest that "replication" fits better for re-running using the same software as the original.

Furthermore, a quick Google search turned up 18.4 million hits for "reproducible research", but only 0.5 million for "replicable research". So "reproducible" seems to be the far more common term. Now I would think that the (scientific) public at large is first and foremost interested in whether we can trust scientific results, whether they are robust overall, reveal the laws of nature---whether they can be corroborated by independent experimentation. In view of this, is it really sensible to narrow "reproducing" to mean the ability to "running the same software on the same input data and obtaining the same results"?

In the pioneering work of Claerbout's group (Claerbout and Karrenbach, 1992, Claerbout, undated, Schwab et al, 2000), I haven't found any discussion of why they chose the term "reproducible" for their approach. I wish they had chosen differently, so that the rather young (25 years) reproducible research movement had not ended up with a terminology at odds with the significantly older metrology.

labarba commented 7 years ago

A reproduction is a "exact or close imitation", while a replica is an "exact reproduction in all details" …

While you place the emphasis on the phrase "in all details," I could place it on "exact" to make the same argument for "reproduction" instead of "replication." In the end, it is seldom practical to try to get help from the dictionary for discussions about terms of art.

Reproducibility is a spectrum of concerns. A most basic question is: can you run my code with my data and get my results? This is the minimum requirement, and often referred to as "reproducible research." I would wager that's why the search results for "reproducible research" are most numerous. Replications, in the sense of Peng and others, are (unfortunately) quite rare. But reproducible research is a pre-requisite for replication studies, because if replication fails—as Donoho points out—only if both author teams worked reproducibly is it possible to find the source of any discrepancies.

screen shot 2017-06-26 at 11 36 30 am

khinsen commented 7 years ago

Personally I don't care much about vague analogies to dictionary definitions that were clearly not written with research in mind. But I do agree with @heplesser's argument about the use of "reproducible" in a wider sense, applying to science in general rather than to the specific problems of computer-aided research.

I suspect that Claerbout's choice of the term "reproducible research" was meant to be provocative. All scientific research is supposed to be reproducible (in an ideal world), so what he was arguing for was "merely" to adopt this criterion for computer-aided research as well. Back then, 25 years ago, nobody discussed non-reproducibility in experimental contexts.

Today, "reproducibility" in a less well-defined sense has become a widespread concern. Most modern uses of the term can clearly not be interpreted in Claerbout's sense, because they don't refer to computation. In the long run, I doubt computational scientists will be able to claim the "reproducibility" label for their particular rather technical issue. But this question won't be settled before the scientific community at large, which is dominated by experimentalists, understands the various issues and agrees on a common terminology. I expect this to take at least another decade, during which ReScience can become anything from a mainstream journal to a relic of the past.

In the meantime, the definitions we currently use in ReScience are clear and have some historical justification, which is good enough for me.