This work is published here: https://www.nature.com/articles/s41597-022-01143-6
get-dois
Code from get-dois
enables communication with the Harvard Dataverse repository and collects DOIs of datasets that contain R code.
aws-cli
The list of DOIs is used to define jobs for the AWS Batch. Code from aws-cli
sends these jobs to the batch queue, where they will wait until resources become available for their execution.
docker
When a job leaves the queue, it instantiates a pre-installed Docker image containing code to retrieve a replication package, executes R code, and collects data. Code from docker
prepares the image.
analysis
All collected data is retrieved and analyzed in analysis
.
install.packages()
command. Is that right?Yes, that's correct. More precisely, in the code cleaning step we add if (!require(lib)) install.packages(lib)
for all detected libraries in the code.
I also tested the code cleaning step by adding just install.packages()
or install.packages()
& library()
, but require()
was best performing.
Yes! You can see how all the errors were classified here under the heading "Error type".
This is a good question and a limitation of our approach. I have previewed a lot of the research code to create the code cleaning step and haven't seen bioconductor and GitHub packages, so my intuition is that it is a small subset, but I cannot be sure.
So the issue of allocating a specific time period for the re-execution on the cloud created the following problem in data collection: For example, out of 10 scripts in the initial re-execution, we'll initially have a result for 9. But after code cleaning, we'll have the result for 6 (as "fixed" code may take more time to re-execute). So we needed to "match" the 6 re-executed scripts in the second run to their result in the first run to see how the result had changed (Fig. 8 in the paper). That was done in this notebook. In the section "Constructing Sankey", you can see how the error changed before and after code cleaning for each file (ie, those are result_x and result_y after merge).