codeforkansascity / Property-Violations-Settlement

Analyzing Kansas City's open data on property violations
5 stars 7 forks source link

Adopt solution for storing and versioning large input datasets #35

Closed buzwells closed 8 years ago

buzwells commented 8 years ago

It would be nice to have a simple solution for storing versions of our input datasets, which might be drawn from a wide variety of sources. Having all developers/analysts working from the same data snapshots would help to ensure that we have consistent results.

Github has some limitations with handling large files. Github discusses the limits at https://help.github.com/articles/what-is-my-disk-quota/. Our dataset is just on the edge of the maximum file size (100 mb), but could easily exceed that if we decide to take in more cases in the future. Github’s “large file storage” alternative solution has some controversial billing practices - https://medium.com/@megastep/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91#.4p4ysyhmh.

Eric Roche has suggested Google Drive as a viable alternative and pointed out that it may already have R packages tailored to it. Here are a few links from a quick Google search:

devoredevelops commented 8 years ago

I tried to load the current ~97MB dataset into Google Sheet but the file size prevented it from loading. :(

buzwells commented 8 years ago

We agreed to use a Google drive share. I believe this issue should be closed.