What's the best way to access different versions of the database?

What is the issue you need help with?

To better understand how 2DII could better manage and use data, I need to know how the work of database-admins translates to better workflows downstream from the database. Today I would like to know more about @AlexAxthelm's work, particularly how analysts and software developers may access different versions of the database.

This is what I learned:

VERSIONING

Databases lack a free version control system like Git.
Anytime, the database hosts a few versions of data (maybe from two or three quarters), and a number of flat "master" files that result from dumping specific snapshots of data.
Taylor makes master files available to analysts, by coping them into dropbox.
It is difficult or impossible use Git fluently with large files, say, 50MB or more.
Alex recommends a workflow similar to the existing one:
- The database dumps a snapshot of data and makes it available somewhere. Best in Azure storage but Dropbox might be okay.
- New versions as a unique name so it doesn't overwrite previous versions.
- The entire history is always available, for anyone to serve themselves.

TESTING

Alex encourages analysts to unit-test their code. Most unit-tests can run with toy data that expose a specific problem that data might present. Toy data may have just one row. Thus most tests should be able to run on public repos and take short time (few seconds) to run.
How about integration-tests with entire datasets? Some problems may be exposed only with whole datasets. You can then write a few tests with entire private data. Write this tests so they can access private data, either from your local computer or from a server, e.g. Azure.

SECURITY

People might accidentaly share secrets publicly. We need to manage the risk. One way is via git hooks.

RUNNING CODE ON SCHEDULE WITH WHOLE DATA

Similar to what's described in TESTING, you may also want your code to access private data, maybe on a schedule, e.g. to update your analyses with the latest data. You should be able to access private data locally. To work on a remote computer you have some options:
- One simple way is to move the private data to the remote server where you run your code. We can create this environment on Azure and give you access to RStudio from your web browser.
- Another way is to run the code separate from the data. You may run your code anywhere -- e.g. on an Azure server running RStudio from the web browser -- and access data from somewhere else. Common platforms to store data provide an API. For example, you may use R packages to access Dropbox online, Azure, and many other sources.

Resources

https://www.atlassian.com/git/tutorials/git-hooks

2DegreesInvesting / coding-helpdesk

What's the best way to access different versions of the database? #14

What is the issue you need help with?

Resources