Missing data files - Githubissues

sbfnk commented 3 years ago

Some data files appear to be missing from the repo, which means the Rmd doesn't render out of the box, e.g.

FPT/splines_TP+.Rdata CP/fits/w25o/CP/fits/w25o/TC_APGHB117_simple_shiftw25o_sel3.Rdata

These can be re-created from by selective changing snippets to eval = TRUE but this line fails nonethess (should this be created in the previous snippet?) https://github.com/VirologyCharite/SARS-CoV-2-VL-paper/blob/93070be084ff2b6cbc9059ff59054aeaae192328/ExtendedMethods.Rmd#L1546

Would it perhaps be possible to add the data files so this can run without manual intervention? Since no random seed is set this would also help with reproducibility.

Sorry to trouble if I'm missing something obvious and thanks for your help with this.

gbiele commented 3 years ago

Sorry for the inconvenience.

As is written in the beginning of the document, some analyses (where in code blocks were eval = F) were done with separate scripts and not with the code in the R-Markdown doc. To still have everything in one place, I intended to copy the relevant parts of these scripts to the R-Markdown doc.

However, I forgot to copy the full script to the code block "estimate_time_model" in the R-Markdown document. So this line of code, which saves the results, is missing.

save(csf, draws, ss, sampler_diags,
     file = here(paste0("CP/fits/w25o/",model,"_sel3.Rdata")))

In addition, the code in this cell block will only estimate the parameters for people with at least 3 data points, and not for the smaller sub-samples with at least 4, at least 5, ... data points.

I'll also add this to the R-Markdown file.

Regarding data files: FPT/splines_TP+.Rdata and CP/fits/w25o/CP/fits/w25o/TC_APGHB117_simple_shiftw25o_sel3.Rdata are not data files. These files contain brmsfit and cmdstanr-fit objects, respectively. So adding those would not add much for reproducibility. It would show that one can reproduce the numbers/tables/figures from the model fit-files, but these would not seem to be the key reproducibility issue to me. Or am I misunderstanding something?

If you re-run the analyses, I would recommend to not do this from the R-Markdown, because this will take a day or more, depending on the hardware you have. (We estimated the models on a cluster)

sbfnk commented 3 years ago

Thanks, this all makes sense. On the question of reproducibility, I agree that it’s not a key issue. I think there are two reasons it would be good to add all the .Rdata files that are generated here and subsequently used to produce the figures/tables/etc.: 1) People could inspect or further investigate the posteriors without having to run computationally intensive samplers themselves 2) Reproducibility in the sense that you’re running a random sampler without setting a seed and so anyone running this will get different results. While I would expect them to be similar enough that it doesn’t make any noticeable difference, if at any stage they didn’t the researcher trying to reproduce the results would be faced with having to investigate whether there is an issue with the code or the way they ran it, or whether it is due to the randomness inherent in the analysis pipeline.

Definitely only a nice-to-have though and not fundamental to understanding what you’ve done, and I appreciate that e.g. file sizes might be an issue. None of this is taking away that it’s really great to see all this thorough analysis laid out in such detail.

gbiele commented 3 years ago

Seeing this only now. I'll discuss with the others if we should upload everything. (That would be a bit more than 2.5G).

About reproducibility given seeds: I agree that the results should be close enough without seeds. I'd add two things: If there were noticeable differences in the results with different seeds, that would point to a problem with estimation/model. The main reason I did not set seeds is that i remembered some discussions on the Stan forum that seeds alone are not sufficient to guarantee identical chains. Different OS or even versions of the same OS (not 100% sure about the latter) are sufficient to lead to slight differences in chains when the seed is constant.

I also think it's useful to show all the details, especially because we are in a situation where pre-registration of analysis was n my view not an option.

VirologyCharite / SARS-CoV-2-VL-paper

Missing data files #5