fosil-project / fosil-project.github.io

0 stars 2 forks source link

Literate Programming #2

Open isabelleschang opened 1 year ago

isabelleschang commented 1 year ago
jvcasillas commented 1 year ago

Include template example(s)?

different frameworks?: r, rmd, qmd, python, jupitor notebooks

kparrish92 commented 1 year ago

@jvcasillas @isabelleschang - I apparently did not document what open datasets we talked about. Do either of you remember by chance?

jvcasillas commented 1 year ago

Yes, we talked about these (I think):

kparrish92 commented 1 year ago

Overview

Literate programming refers to the integration of code and prose in a reproducible document. This practice is not yet mainstream in linguistics, although it holds several advantages as opposed to traditional reporting methods. Traditionally, statistical analysis, plots, tables, citations and captions would be created and manually and inserted into a manuscript. One potential issue with this approach is the increased probability of reported errors. For example, a recent study found that...(Roettger analysis of Labphon). A literate programming approach to manuscript creation would plausibly reduce the quantity of these errors, and it would make the correct information traceable more often. Additionally, updates to the data would be (almost) automatically integrated into a given manuscript if the necssary scripts are run again.

The present tutorial will provide an example of literate programming specifically for linguists by using an open dataset in linguistics and reported a mock analysis While the emphasis of this tutorial will be on creating a simple working example in Rmarkdown, it is important to note that literate programming can be applied within R to APA style manuscripts (see the Papaja package), in slideshows (see Xaringan) and in other programs entirely (qmd, python, jupitor notebooks)

Working example

Here, we talk through an example of literate programming using open linguistics data. In particular, we are using the durationsGe dataset in the languageR package. For our example, we will report differences in the duration of dutch prefix "ge" by speaker sex.

First, we load our libraries. Both tidyverse and languageR are available on CRAN. In general all, inline reporting occurs in Rmardown in " "

library(languageR)
library(tidyverse)
library(brms)
klippy::klippy()

Reporting descriptive or summary statistics from a dataframe

In Rmarkdown, placing r code between backticks integrates it into your document. For instance, if we want to report the overall mean for DurationOfPrefix, we can simply put rcode such as, r mean(durationsGe$DurationOfPrefix) between to backticks, and it will return:

r mean(durationsGe$DurationOfPrefix)

r mean(durationsGe$DurationOfPrefix)

There are several decimal points here, though! We probably don't want that, so if we haven't rounded the data previously, we can do so inline by using the round function:

r round(mean(durationsGe$DurationOfPrefix), digits = 2)

r round(mean(durationsGe$DurationOfPrefix), digits = 2)

We likely also want to report how many participants are in our dataset. We can do so by extracting this information from a dataframe using unique and length.

r length(unique(durationsGe$Speaker))

r length(unique(durationsGe$Speaker))

Reporting results of statistical models

We can also report the output statistical models and tests. Typically, the results of these tests can be stored in an object in R and extracted. I will provide an example with a t-test in R. First, we will run a t.test to see whether duration varies as a function of speaker sex:

t_test_object = t.test(DurationOfPrefix ~ Sex, data = durationsGe)
print(t_test_object)

For a t-test, in APA guidelines we report degrees of freedom, the t-value, and the p-value. All of these are actually stored in the object we just created, and we can automate the reporting process. Note: The degree of freedom in this dataset are exaggerated due to the nested structure of the data and this t-test serves as an example only

Degrees of Freedom

r round(t_test_object$parameter, digits = 2)

r round(t_test_object$parameter, digits = 2)

The t-value

r round(t_test_object$statistic, digits = 2)

r round(t_test_object$statistic, digits = 2)

The p-value

r round(t_test_object$p.value, digits = 2)

r round(t_test_object$p.value, digits = 2)

All together

t(r round(t_test_object$parameter, digits = 2)) = r round(t_test_object$statistic, digits = 2), p = r round(t_test_object$p.value, digits = 2)

t(r round(t_test_object$parameter, digits = 2)) = r round(t_test_object$statistic, digits = 2), p = r round(t_test_object$p.value, digits = 2)

Citations

@article{kruschke2021bayesian, title={Bayesian analysis reporting guidelines}, author={Kruschke, John K}, journal={Nature Human Behaviour}, volume={5}, number={10}, pages={1282--1291}, year={2021}, publisher={Nature Publishing Group UK London} }

kparrish92 commented 1 year ago

We have a brief draft done! We have been thinking also about adding more examples and showing how to cite plots and tables.

For more examples, I was thinking about showing how to report nested model comparisons in R using literate programming and/or reporting bayesian posteriors.

One more thing: I could not yet figure out how to render text between backticks without it running the R code.

jvcasillas commented 1 year ago

@kparrish92 @isabelleschang I pushed the draft to main. We have to do some changes for showing the inline code. It's a bit tricky, but I htink it looks ok now (I only fixed like half of it). You can edit the comment above with updates or if you feel comfortable you can submit PRs.

https://fosil-project.github.io/posts/literate-programming/