Motivation-and-Behaviour / sleepIPD_analysis

Analysis for the sleep and physical activity pooled study (https://osf.io/gzj9w/)
Other
0 stars 0 forks source link

Forking pipeline, include log pa_volume #85

Closed conig closed 1 year ago

conig commented 1 year ago

This PR ended up being more complex than I had hoped. We decided to log-transform the PA-volume in an attempt to improve the model's residual distributions (issue #74). Due to the protocol, this means we need to retain the variable in both states. The most straightforward way to achieve this was to create two datasets, one with the transformed variable and one without, and then run each of those objects through the entire pipeline.

To accommodate this change, I had to carry out a fairly significant restructuring. On the bright side, the pipeline is now much more concise, and we can easily transform other variables without further complications.

The kurtosis values are much less extreme, although they are still all negative, which is worth looking into.

Overview

tarensanders commented 1 year ago

Due to the protocol, this means we need to retain the variable in both states. The most straightforward way to achieve this was to create two datasets, one with the transformed variable and one without, and then run each of those objects through the entire pipeline.

I'm struggling a bit to see the rationale for this, but maybe I'm just overlooking something? In my head the simplest way to do this is just to append log transformed versions of the variables to the same dataset. The model builder then just returns two versions - a log transformed and a non-transformed. But, happy to be convinced that there's something that I'm missing.

I'll have a look through the changes, but I'm a bit worried about a couple of things:

  1. This is rapidly becoming more and more complicated. We're accruing technical debt that I think will make doing revisions in a year from now very hard. I think I would struggle to make changes to this that wouldn't break things. It's starting to veer from complex to complicated which seems bad.
  2. I don't love the volume of outputs, especially multiple versions of the manuscript. That seems like we don't know what should go in the manuscript, and that would seem like a better place to start. We should also probably stop saving the originals of every image and instead just put them in the multiverse doc (i.e., load the ggplot object in the rmd, don't save the png). Every PR has 100+ files, and realistically we don't need all these images for anything.
  3. The pipeline has become so large it's basically unreadable. image Rather than a custom target factory, it seems like it would be better to just use the builtin functions of targets, like dynamic branches. We're not solving a novel problem here: iterating over lots of different types of models is the use case dynamic branches was created for. In pseudocode it's basically just tar_target(models, make_models(model_params), pattern = map(model_params), and then you only have a single node on the graph. (The reason I didn't make this change before is that the names of the targets was being used in the multiverse doc, and dynamic branches are randomly named, but we could work around that).

These changes are all super impressive, I'm just really worried about how complicated this all is.

conig commented 1 year ago

I agree that we don't want it to be any more complicated than it has to be. I started by trying to append the transformed variable to the dataset and then adding a model to the model_list. But there are eight models where PA_volume features, meaning that we would need to greatly increase the length of model_list. You then have to hardcode what goes where throughout the pipeline to make two variants of all the figures and tables. Then you need extra code to pull both into supplementary materials. On top of that, if we are asked by reviewers to transform another variable, you're back to square one and the code grows in complexity again.

image

For those reasons I don't think we want to try and integrate the two variants with the previous system. However, I agree that my hack job of producing targets is not needed if there are more elegant ways to do it. I also agree there's no need for two versions of the manuscript/supps, there's better ways to handle that. In an ideal world we would set what models are going to find themself in the manuscript and the rest would automatically appear in supps. I think let's discuss how we want to go forward.

tarensanders commented 1 year ago

Been thinking about this a bit more. I can't think of a reason why all of the models need to be run with both pa_volume and log(pa_volume). We should be deciding on what we think the correct approach is (log(pa_volume) based on the kurtosis values) and then just include a single model list which tests what would have happened to the primary analysis if we hadn't log transformed.

That's what we did the with the fixed effect model - it only gets a single model list (moderated by age). We didn't test it on every moderator, so I don't think we need to change tactics for the transformation on one variable either.

Is there any reason to test every model with both version of the variable?

conig commented 1 year ago

Been thinking about this a bit more. I can't think of a reason why all of the models need to be run with both pa_volume and log(pa_volume). We should be deciding on what we think the correct approach is (log(pa_volume) based on the kurtosis values) and then just include a single model list which tests what would have happened to the primary analysis if we hadn't log transformed.

That's what we did the with the fixed effect model - it only gets a single model list (moderated by age). We didn't test it on every moderator, so I don't think we need to change tactics for the transformation on one variable either.

Is there any reason to test every model with both version of the variable?

I agree that it is perhaps overkill to run every variant of every model that we have put into supplementary materails. I was trying to strictly adhere to this from the protocol:

To increase transparency and check the robustness of our estimates, we will use a multiverse analytical approach. This involves using different reasonable data processing choices (e.g., different variable transformation approaches) to construct multiple datasets. Then, we will analyse each dataset in this data multiverse separately, leading to the multiverse of statistical results (Steegen et al 2016)

But I can see that if we can justify an approach then there is no need to include models which violate assumptions etc in supplementary materials. If you're around we can briefly discuss on Tuesday then implement the changes we want.