ATFutures / who-data-archived-report1

Contains the original git history used to generate results for first report
1 stars 1 forks source link

How to store and update data? #2

Open mpadge opened 6 years ago

mpadge commented 6 years ago

All the mucking about with my gitstructor script only ended up saving us about 200MB, and we've still got a 400MB .git in a 600MB repo. Files will need updating, but this kinda bloat is really undesirable. What to do?

The data need to sit inside a git repo, and the need to be able to be directly accessed by other repos such as flowlayers. @Robinlovelace Can you see this working with piggyback? Other repos need direct access, not downloading via any piggyback::pb_...() functions, but with piggyback, the data are not actually held directly in the repo. In my admittedly limited vision of things, this would require a who-data repo that was largely empty on github, but locally pulled any modified versions via a series of piggyback calls. The local version would thus differ from the github version in having all files present in the repo and ready to use.

We'd then need some kind of hash system, so I guess could use storr to control the updates. Then each package which used who-data would then make an initial storr call to compare hashes, and in response to any changes would update corresponding files in who-data via a piggyback call. A bit messy, but should result in a tightly inter-woven and stable system for sharing a common repo of potentially very large data.

Alternative

Re-start repo (now that it has a roughly stable structure), allow all other repos to stay as they are, and just periodically use gitstructor to clean things up. I'm not sure I'm in favour of that, because the first solution is likely to be more scalable and (ever our aim here) future-proof.

Thoughts?

Robinlovelace commented 6 years ago

I think having a repo that is largely empty of data on GitHub / GitLab (but not necessarily of code) but that has the data in the releases thanks to piggyback and locally thanks to an initial load script that keeps it up-to-date is a good plan for open access data. I'm not sure about the alternative. Periodically cleaning things up sounds like a maintenance burden and raises the question: with what frequency?

Git was never intended to deal with binaries is my understanding so keen on keeping any large (~1 MB+) out - we had a mission cleaning the geocompr repo - used bfg for that: https://rtyley.github.io/bfg-repo-cleaner/#download