Open mpadge opened 6 years ago
I think having a repo that is largely empty of data on GitHub / GitLab (but not necessarily of code) but that has the data in the releases thanks to piggyback and locally thanks to an initial load script that keeps it up-to-date is a good plan for open access data. I'm not sure about the alternative. Periodically cleaning things up sounds like a maintenance burden and raises the question: with what frequency?
Git was never intended to deal with binaries is my understanding so keen on keeping any large (~1 MB+) out - we had a mission cleaning the geocompr repo - used bfg for that: https://rtyley.github.io/bfg-repo-cleaner/#download
All the mucking about with my
gitstructor
script only ended up saving us about 200MB, and we've still got a 400MB.git
in a 600MB repo. Files will need updating, but this kinda bloat is really undesirable. What to do?The data need to sit inside a git repo, and the need to be able to be directly accessed by other repos such as
flowlayers
. @Robinlovelace Can you see this working withpiggyback
? Other repos need direct access, not downloading via anypiggyback::pb_...()
functions, but withpiggyback
, the data are not actually held directly in the repo. In my admittedly limited vision of things, this would require awho-data
repo that was largely empty on github, but locally pulled any modified versions via a series ofpiggyback
calls. The local version would thus differ from thegithub
version in having all files present in the repo and ready to use.We'd then need some kind of hash system, so I guess could use
storr
to control the updates. Then each package which usedwho-data
would then make an initialstorr
call to compare hashes, and in response to any changes would update corresponding files inwho-data
via apiggyback
call. A bit messy, but should result in a tightly inter-woven and stable system for sharing a common repo of potentially very large data.Alternative
Re-start repo (now that it has a roughly stable structure), allow all other repos to stay as they are, and just periodically use
gitstructor
to clean things up. I'm not sure I'm in favour of that, because the first solution is likely to be more scalable and (ever our aim here) future-proof.Thoughts?