Add best practices for dissociating private/binary data from public/version-controlled code

NilsEnevoldsen commented 9 years ago

One issue I struggle with is what I should do with large datasets. Git is not wired up to do large binary diffs, and GitHub has a hard limit of 100 MB/file and 1 GB/repo. Right now I add raw-ish data to the repo under the assumption that I won't modify it, and processed data in a temp data folder that is .gitignored. Even my raw-ish data gets me dangerously close to GitHub's limit, though.

For some projects, the datasets are so large that GitHub is right out, so we store the datasets on Dropbox and manually (yeuch!) sync them with .gitignored data folders in the repo. (Referring to a Dropbox folder from within Stata code is tricky to do in a portable way, and storing the repo in Dropbox would be a disaster.)

I'm also divided on whether to store data as DTAs or CSVs. Some practitioners recommend using CSVs as far into the pipeline as possible, because they are plain text, so they are portable and diffable. On the other hand DTAs have features like labels and notes which are desirable.

Advice would be appreciated.

mbomby commented 9 years ago

Just some thoughts:

1) We store raw and final data in a separate directory (actually a Truecrypt archive) on dropbox which has worked fine. If the raw data are really not going to change, then you should have no problem with this method. If they are going to change, just keep good documentation of the changes and keep them infrequent. If you are making frequent changes that need to be tracked you'll need a better solution. There is a project called dat-data.com working to implement a git-style version control for datasets but it's still in development.

2) Another method for integrating later changes into the dataset (for example, if you discover new data entry errors while doing analysis) is to have a readreplace csv file tracked in your git repo. Then you can track only the changes you are making to the raw data (which probably won't fill up your github space), instead of the whole dataset.

3) Why is it tricky to refer to a Dropbox folder within Stata? We just make a global macro with the current user's dropbox folder and refer to $dropbox throughout the do-files. It might be a little annoying but we've never had problems with this. A better alternative could be to have all users of these do-files to define a global macro "dropbox" in their profile.do and this would eliminate any of this manual global definition.

matthew-white commented 9 years ago

One issue I struggle with is what I should do with large datasets. Git is not wired up to do large binary diffs, and GitHub has a hard limit of 100 MB/file and 1 GB/repo. Right now I add raw-ish data to the repo under the assumption that I won't modify it, and processed data in a temp data folder that is .gitignored. Even my raw-ish data gets me dangerously close to GitHub's limit, though.

For some projects, the datasets are so large that GitHub is right out, so we store the datasets on Dropbox and manually (yeuch!) sync them with .gitignored data folders in the repo. (Referring to a Dropbox folder from within Stata code is tricky to do in a portable way, and storing the repo in Dropbox would be a disaster.)

As with @mbomby, on projects I've worked on, the data usually sits outside the repository. We then refer to the data directories using globals that are defined in an initializing do-file that all do-files run. We also make heavy use of the SSC package fastcd. If you associate a directory with a fastcd code (let's say my_data), it's easy to add that directory to a global:

loc curdir "`c(pwd)'"
c my_data
glo my_data "`c(pwd)'"
cd `"`curdir'"'

This way we use only fastcd codes and relative paths.

I also tend to prefer project-specific globals to a single global like $dropbox. The directory structure of my Dropbox is different from my PI's, other RAs', and so on, so usually I need something like $my_data instead.

I'm also divided on whether to store data as DTAs or CSVs. Some practitioners recommend using CSVs as far into the pipeline as possible, because they are plain text, so they are portable and diffable. On the other hand DTAs have features like labels and notes which are desirable.

Stata datasets are portable as far as I know. It's nice to be able to diff, but I don't find myself doing it that often, and usually prefer cfout to do so — so in the end the cost of converting to/from .csv doesn't seem worth it. Also, some projects I'm on make heavy use of Stata metadata, especially characteristics, in which case it really isn't an option to lose that in a .csv save. I suppose you could save as XML so that it's diffable and you get the metadata. I think it just depends on what you find useful in your workflow. If you're not actually finding yourself wanting to diff a dataset with Git, maybe datasets are best left as Stata binaries.

NilsEnevoldsen commented 9 years ago

Thanks for your input, guys.

mbomby:

dat is interesting.
It's tricky to refer to Dropbox within Stata because somewhere the user needs to specify the hard path to the Dropbox directory. Furthermore, it must be specified in a file (e.g. a profile.do) that isn't under version control, or each user will overwrite the other user's settings. Furthermore, the documentation must explain (1) how to create that file and (2) how to request access to the relevant Dropbox. Users can also commit code that improperly escapes such paths, as they may contains spaces, parentheses, etc on other user's computers. None of these issues are insurmountable, but they all take some effort.

matthew-white:

What's the advantage of fastcd over the use of macros?
Sorry, "portable" was poor word choice. I meant portable between programs. It's not possible to browse a DTA file with Excel or GitHub.
The primary advantage to me of being diffable is that changes are stored efficiently, not that a human reviews them often.

I'm starting a project now for which raw data is kept on Dropbox, and a local macro in profile.do points to that location. The raw data is in whatever format. The cleaned and munged data is stored as DTAs in the project's tempdata folder and is .gitignored. Feels pretty good so far. It's no longer possible to just clone ⇒ build, but that's not a huge sacrifice.

Thanks again.

I do suggest writing up a condensed version of this advice, since it will confront (in my experience) nearly every researcher who makes the transition from (usually) Dropbox to GitHub. I'll change the title to reflect that, but I'll leave the issue open. Feel free to close it as you please.

matthew-white commented 9 years ago

It's tricky to refer to Dropbox within Stata because somewhere the user needs to specify the hard path to the Dropbox directory. Furthermore, it must be specified in a file (e.g. a profile.do) that isn't under version control, or each user will overwrite the other user's settings. Furthermore, the documentation must explain (1) how to create that file and (2) how to request access to the relevant Dropbox. Users can also commit code that improperly escapes such paths, as they may contains spaces, parentheses, etc on other user's computers. None of these issues are insurmountable, but they all take some effort.

What's the advantage of fastcd over the use of macros?

I think fastcd is easier than creating/updating profile.do, which is pretty hard to explain to a PI or new RA. Also, projects can have conflicting globals, e.g., projects X and Y can both define $datadir. You could get around this by defining $x_datadir and $y_datadir in profile.do. An alternative workflow on many of my projects is to have an initializing do-file for each project that all other do-files run; among other things, the initializing do-file defines the project globals. By using fastcd and converting fastcd codes to globals with short names, you only have one project's set of globals defined at once, and also don't get conflicting globals. As long as everyone uses fastcd, no source code (either in the repo or profile.do) needs to change.

This needs to be documented and fastcd needs to be initialized, but this doesn't take more than a few minutes. I also don't find Dropbox invitations that cumbersome, and as long as you enclose paths in double quotes, you're almost always OK (no further quoting/escaping necessary). It's definitely a balancing act though, and I think a future FAQ should describe all these as workable options.

One further downside of data on GitHub is that our team uses GitExtensions rather than the command line, and GitExtensions will sometimes run git clean at unexpected times, which may remove some .gitignored data. Maybe that just means we need a new GUI, but it's been part of the motivation for keeping data off GitHub.

The primary advantage to me of being diffable is that changes are stored efficiently, not that a human reviews them often.

What do you mean by "changes are stored efficiently"? Git always stores snapshots, not differences — not sure if that's what you mean.

I do suggest writing up a condensed version of this advice, since it will confront (in my experience) nearly every researcher who makes the transition from (usually) Dropbox to GitHub. I'll change the title to reflect that, but I'll leave the issue open. Feel free to close it as you please.

Fantastic idea. @hdiamondpollock, do you feel up to adding this to the FAQ?

NilsEnevoldsen commented 9 years ago

What do you mean by "changes are stored efficiently"? Git always stores snapshots, not differences — not sure if that's what you mean.

Conceptually, Git uses snapshots. But secretly, when it thinks nobody's watching, Git actually uses deltas. (Shhhh. Don't tell anyone.)

matthew-white commented 9 years ago

Ooh very interesting! With Stata datasets, what's the motivation to store changes efficiently? If your raw data almost never changes and you're .gitignoreing your clean datasets, there are few changes, right? Our datasets also aren't that big. Is the motivation to try to stay below GitHub's storage limits?

NilsEnevoldsen commented 9 years ago

The motivation is to start version controlling my cleaned data in a useful way. It would be a nice thing to have, if it were free. At the moment it requires too many sacrifices, so I don't, and I probably won't.

PovertyAction / github-training

Add best practices for dissociating private/binary data from public/version-controlled code #9