epiforecasts / covid-rt-estimates

National and subnational estimates of the time-varying reproduction number for Covid-19
https://epiforecasts.io/covid/
MIT License
34 stars 17 forks source link

Archive results history (post dataverse publication) #101

Closed joeHickson closed 3 years ago

joeHickson commented 4 years ago

I have looked at trying to siphon this off into dataverse but I can't do it in a nice way. You can specify the production date but it's very slow to loop through each commit in turn, uploading the files and waiting for the dataset to publish. Whilst it gives a full history of "production date" on all the versions the default information is by publish date which leaves it as correct order but hard to know where in the history you are going without lucky dipping.

I think the best solution here is to migrate national, region and subnational to a new repository - https://gist.github.com/trongthanh/2779392 or similar looks like it will keep the full history then we can strip them from the covid-rt-estimates repo.

This does leave the drawback that there will be a break in the overall history - pre/post switch over (with some overlap on both) but hopefully this will only be a pain for a few months until there's a decent history depth on dataverse.

joeHickson commented 4 years ago

This is the archiving part of #9

seabbs commented 4 years ago

Hmm that is a shame but all sounds reasonable. I wonder if we could loop through the git commits once and compile into a single csv and then publish that to the data verse as a single drop titled archived results or similar (adding an additional field based on date of estimation)? The issue with just migrating over the files somewhere else on git is that anyone wanting to use the estimates will need to do the git loop extraction and may not be able to (me) or realise they can.

joeHickson commented 4 years ago

This is as far as I got trying to publish when prototyping moving the archive to dataverse. You should be able to use it to spin over the git commits and grab the results when they change - that could provide the basis for the archive. Note this is currently only working on a subset of datasets and a subset of commits ([1:2] / [600:650])

source("R/dataset-list.R")
futile.logger::flog.threshold(futile.logger::DEBUG)
datasets <- datasets[1:2]

# empty list ready for previous checksum for each dataset
prev <- list()

# prime prev and get a complete list of summary dirs to git checkout
summary_dirs <- ""
for (dataset in datasets) {
  prev[[dataset$name]] <- ""
  summary_dirs <- paste(summary_dirs, dataset$summary_dir)
}

# get all the git commits for master, sorted by oldest to newest (--reverse)
commits <-
  system("git rev-list master --reverse", intern = TRUE)[600:650]
for (commit in commits) {
  try({
    # this will error if folder doesn't exist
    if (system(paste0("git checkout ", commit, " -- ", dataset$summary_dir)) ==
        0) {
      # get the date for the current checkout commit
      commit_date <-
        as.POSIXct(system(
          paste0("git show --no-patch --no-notes --pretty='%cd' ", commit),
          intern = TRUE
        ), format = "%a %b %d %T %Y %z")
      for (dataset in datasets) {
        # get the checksum for the summary table
        this_ver <-
          system(
            paste0(
              "git ls-files -s ",
              dataset$summary_dir,
              "/summary_table.csv"
            ),
            intern = TRUE
          )
        futile.logger::flog.trace(this_ver)
        if (this_ver != prev[[dataset$name]]) {
          prev[[dataset$name]] <- this_ver
          futile.logger::flog.debug("publishing %s for date %s", dataset$name,
                                    format(commit_date, "%Y-%m-%d"))
          #publish_data(dataset, pub_date = as.Date(commit_date))

        } else{
          futile.logger::flog.debug("no change since last revision")
        }
      }
    } else{
      futile.logger::flog.debug("no checkout for changeset %s", commit)
    }
  })

}