UBC-MDS / opinionated-practices-for-teaching-reproducibility

https://arxiv.org/abs/2109.13656
2 stars 0 forks source link

Flesh out introduction #6

Closed ttimbers closed 3 years ago

ttimbers commented 3 years ago

We likely want to fully define data science and reproducibility in the introduction. For data science, I prefer this definition: "the process of extracting insight from data through reproducible and auditable methods". For reproducibility I prefer: "reaching the same result given the same input, computational methods, and conditions". This is the definition from the National Academy of Sciences (citation below).

E. National Academies of Sciences and Medicine. Reproducibility and Replicability in Science. The National Academies Press, Washington, DC, 2019. ISBN 978-0-309-48616-3. doi:10.17226/25303. URL https: //www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science.

ttimbers commented 3 years ago

We could also use this text from my teaching dossier:

The definition for data science that I adopt in my teaching and work is the process of extracting insight from data through reproducible and auditable methods. Using this definition requires that I also define what is meant by a reproducible and auditable analysis. To define reproducible analysis, I embrace the National Academy of Sciences definition, which is reaching the same result given the same input, computational methods, and conditions (2019). For auditable, or transparent analysis, I follow how it has been defined by Hilary Parker (2017) and Karthik Ram (2013), which is that there should be a readable record of the steps used to carryout the analysis (i.e., computer code) as well as a record of how the analysis methods evolved (i.e., a version controlled project history). This history is important for recording how and why decisions to use one method or another were made, among other things.

The reason I embrace this definition of data science, is that I believe that data science work should bring insight (e.g., answer an important research question) and employ reproducible and auditable methods so that trustworthy results and data products can be created. Results and data products can be generated without reproducible and auditable methods, however, when they are built this way there can be little trust in how the results or products were created. This is because 1) they lack evidence that the results or product could be regenerated given the same input, computational methods, and conditions, 2) there is insufficient evidence of the steps taken duration creation and 3) there is an incomplete record of how and why analysis decisions were made.

In addition to contributing to the trustworthiness of data science work, employing reproducible and auditable methods and workflows bring additional benefits to data scientists. Data science is an inherently collaborative science, and the emphasis of reproducible and auditable methods in data science greatly facilitates the act of collaborating.

References

National Academies of Sciences, Engineering, and Medicine and others (2019). Reproducibility and replicability in science. National Academies Press.

Parker, H. (2017). Opinionated analysis development. PeerJ Preprints, 5:e3210v1.

Ram, K. (2013). Git can facilitate greater reproducibility and increased transparency in science. Source code for biology and medicine, 8(1):1–8.

ttimbers commented 3 years ago

done