benmarwick / rrtools

rrtools: Tools for Writing Reproducible Research in R
Other
670 stars 85 forks source link

[discussion / feature] More structured analysis #86

Closed wolass closed 2 years ago

wolass commented 5 years ago

Hi guys! I've been using rrtools for every project/paper in the last 2 years.

I love it and it gets better and better.

The problem I have right now is that by making all the statistical (preliminary) analysis I am bloating up the files quite a bit.

I do not want to increase the size of the paper.Rmd file unnecessarily, but on the other hand, I don't know if my preliminary statistical analysis will make it into the final version of the paper.

So one potential approach would be to split up the code into multiple files (that are parts of the data analysis and paper writing process).

Here is an example:

Let's say that we want to do logistic regression on our database, do multiple models and choose only one, produce a preliminary figure and a publication-ready figure.

we start as usual with our rrtools process up to the make_analysis part.

NEXT (and this is the enhancement I am talking about),

  1. we create a bunch of .R files: load_data.R, data_wrangling.R, statistical_analysis.R, figures.R
  2. In each file we already should have a structure with an example function. JUST REMEMBER that these files need to be FUNCTIONS. so it would go like that

the contents of load_data.R

load_data_from_source(){ # notice - no arguments#
  #user-defined input here
  return(read_csv("source_data.csv"))
}

The above example shows that we could use functions in R scripts to easily incorporate them in the paper.Rmd file

Then we do the next step:

content of the paper.Rmd:

```{r}
df <- load_data_from_source()
``'

In our database we had `r nrow(df)` observations. Please see fig \@ref(fig:baseline)

```{r baseline, fig.cap = "Baseline figure"}
produce_publication_ready_figure_baseline()
``'

I hope that I am making it somewhat clear :P... Basically I am saying that we could help the users by making some standardized functions in the R folder so that they can fill them outeasily and not clutter their paper.Rmd

also that makes the compilation a lot more efficient because we would only need to run only these functions that are mentioned in the paper.Rmd files.

I am right now running ALL the statisticall models that I tried - and I dont want to erease them from my files but its suboptimal to run them each time I have a typo in my paper.Rmd file.

I think this approach is better than caching...

What do you guys think? Would it be worth it to make some structure in the R folder for newbies?

id suggest these:

  1. load_data.R
  2. data_cleaning.R
  3. statistical_analysis.R
  4. figures.R
  5. tables.R

Another approach would be to have only one .R file and the above sections within it.

I don't know which is better, but using functions that go into the paper.Rmd is definitely a good approach that we should encourage.

Please let me know what you think (even if you think that I'm rambling here without any sense...)

benmarwick commented 5 years ago

Yes, this is a very interesting idea, and I have head of people doing similar things. That is, putting most of their code in script files, then invoking that code in the Rmd in some way (e.g. using source or a knitr function that I can't recall exactly). There's also this famous post on Stackoverflow: https://stackoverflow.com/a/1434424/1036500

The challenge is to imagine the minimal set (and names) of script files that make the most sense for most people. And keep complexity and the barriers to entry as low as possible for beginners.

wolass commented 5 years ago

Oh... I was not aware of this overflow question ;)

But this basically is exactly what I tried to explain here. So having just a simple structure of "load , clean, func, do", would be of great help.

I would maybe exchange the 'do' for two files 'eda' - explorative data analysis where every possible analysis lives, and 'final' - where only selected parts of the initial analysis are explored and final figures are formed.

Or maybe it would be better to treat the paper.Rmd as the "final" file.

So yeah... "Load clean func do" should do the trick... Thanks for this. Shall I suggest something for a pull request?

On Tue, Jul 9, 2019, 20:04 Ben Marwick notifications@github.com wrote:

Yes, this is a very interesting idea, and I have head of people doing similar things. That is, putting most of their code in script files, then invoking that code in the Rmd in some way (e.g. using source or a knitr function that I can't recall exactly). There's also this famous post on Stackoverflow: https://stackoverflow.com/a/1434424/1036500

The challenge is to imagine the minimal set (and names) of script files that make the most sense for most people. And keep complexity and the barriers to entry as low as possible for beginners.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benmarwick/rrtools/issues/86?email_source=notifications&email_token=AAR7HR2OFWJFM7L5PJDKN2DP6THKRA5CNFSM4H5VQKNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZRBV2I#issuecomment-509745897, or mute the thread https://github.com/notifications/unsubscribe-auth/AAR7HR3XWKOZGC4L76N7473P6THKRANCNFSM4H5VQKNA .

benmarwick commented 5 years ago

Yes, let's see if any of our co-authors have any relevant thoughts on this topic, e.g. @nevrome @pkq @SCSchmidt @annakrystalli @MartinHinz @softloud @dakni

One possibility might be to have a function that allows the user to give file names as arguments. Because I'm thinking it might be a challenge to come up with a set of file names that will make sense to everyone. The ones you propose could be the default set, and then we also give the user the option to override with other file names.

For example, something like rrtools::use_scripts(c('load.R', 'clean.R', 'func.R', 'do.R'))

MartinHinz commented 5 years ago

Thank you for raising this question! I am also in the process of writing a paper with rrtools, and I have established a similar structure for it. To get the more or less standardized structure of a paper, the analyses are provided as a separate thread in the R folder, where there is a main.R file with the following structure:

run_analysis <- function()
{
  set_up_environment()
  this_data <- read_input_data() %>%
  prepare_data() %>%
  do_simulate(n_run = 1000)
  render_tables()
  render_plots()
}

The functions are planned to be contained in the following files:

This is pretty much the same structure you suggested in your first post. So, yes, I think it's a good idea to do this the way @benmarwick suggested. A standard structure that comes from rrtools::use_scripts(), but that can be overloaded by specifying filenames.

nevrome commented 5 years ago

There seems to exist a general hierarchy of analysis complexity: For papers with simple data analysis, it might be ok to put everything into the paper.Rmd. For more -- but not too -- expansive analysis code the solution suggested here by you, @wolass, might be ok. For even more (and computationally intensiv) code it might be necessary or desirable to separate paper rendering and running code (example).

Didn't we already discuss these different approaches somewhere, @benmarwick?

annakrystalli commented 5 years ago

Really interesting discussions here.

Agree with everything said already re: stepping the approaching according to computational complexity and focusing on functions. Here are some additional thoughts:

1) I totally agree that the R/ folder is best reserved for scripts containing functions only, consistent with what it's reserved for in an R package and recommend using usethis::use_r() for creating said scripts eg usethis::use_r("data_prep"). This unlocks the ability to set up tests for said functions, using eg usethis::use_test("data_prep"). I personally don't think we should prescribe what these should be but perhaps motivate best practice and some of the useful ways to break functionality up with motivating examples in the docs.

Given the example workflows discussed by @wolass & @MartinHinz, to me it seems a good approach would be to use a drake plan which could be defined in a single script that lives in the analysis/paper folder. Here's an example plan from the package README.

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

Rendering the paper then becomes the last step in the plan. To rerun the analysis & re-render the paper you would use:

make(plan)

which would only re-run parts of the analysis to which the inputs have changed. Drake keeps track of such dependencies through a plan graph:

image

The package has quite good documentation. Have a little look and see what you think.

benmarwick commented 4 years ago

We can update the main readme to point the user to these various possibilities about how to organise a workflow, that might be the sweet spot for flexibility and prescriptiveness.