International-Soil-Radiocarbon-Database / ISRaD

Repository for the development and release of ISRaD data and tools
https://international-soil-radiocarbon-database.github.io/ISRaD/
24 stars 15 forks source link

Flat files (.csv) exceed Github's recommended max file size #252

Closed jb388 closed 2 years ago

jb388 commented 3 years ago

@aahoyt @coreylawrence @SophievF @ShaneStoner

We've reached a point where the flat files we generate with ISRaD exceed the max recommended file size for github (50 Mb). I am not sure how useful these giant files are, especially considering they mostly contain NAs. In comparison, the compiled "list" form Excel workbook files are only 5-6 Mb. Additionally, I worry that we will be hitting the limit of data storage on github at some point in the near future, and that seems like a headache we should avoid if possible. Continually generating these giant .csv files really eats into our data.

So, I suggest that we stop serving these files from github. The data are all present in the Excel workbook for those users who prefer not to access or work with the data in R, and obviously anyone who does use R doesn't need the pre-compiled flat files either.

Thoughts?

bpbond commented 3 years ago

(Poking head in.)

You could try writing nothing for NA na = "", and eliminate any column-wise redundancies. That doesn't solve the fundamental problem but might buy you some time.

P.S. could also compress the files? R.utils::compressFile()

jb388 commented 3 years ago

@bpbond Nice, thanks for the tips. Compressing the files would definitely be a good idea. The files shouldn't have column-wise redundancies, but taking out the NAs for the .csv files should also help with the file size.

The bigger issue, one might say, is that these flat files are necessarily large, since they are the product of joining tables. And of course, this is why we built ISRaD as a relational database to begin with: because these flat files are so unwieldy.

I'll see if I can adopt Ben's neat fixes for now. But going forward I think it's worth considering the idea that we stop serving the flat data.

SophievF commented 3 years ago

I agree that it is probably not necessary to save the flat file on Github. For those who use the database in Excel (I don't know if anyone has done that yet), it's probably not possible to deal with that one huge file at some point either. And everyone else can either use one of the built-in ISRaD functions in R or write their own function for R or another program.

bpbond commented 3 years ago

@jb388 - another way would be to use GitHub Releases for your data distribution. This is what I do for COSORE. The nice part here is that (1) files can be up to 2 GB, (2) releases have no size limit, and (3) it doesn't count against your repository quota.

https://docs.github.com/en/github/administering-a-repository/about-releases

aahoyt commented 3 years ago

Revisiting this conversation -- what if we just serve the flat file for the ISRaD_extra layer data? (ISRaD_extra_flat_layer_vX.date.csv). Working with the flux, fraction, etc data is probably easier in R anyway. But the flat layer data might be of interests to a wider group of users?

jb388 commented 2 years ago

Current solution is: serve only the zipped version of the flat (.csv) and excel files.