ClimateImpactLab / DataFS

An abstraction layer for data storage systems
MIT License
6 stars 2 forks source link

Provide a way to track multi-file archives #132

Open jrising opened 7 years ago

jrising commented 7 years ago

Multi-file archives are used for shapefiles (which include .shp, .shx, .dbf, .prj and other files), IPUMS data (.dat, .sas, and .xml), Binary grid data (.bil or .flt and .hdr), and LaTeX files (minimally .tex and .bib). The files will (1) always need to be downloaded together to be useful, and (2) rather than a file pointer, a path is needed to the location that contains all of the files. Having these files split across multiple archives will tempt fate for them to go out of sync.

One possible solution is to store them as a .tar file and use the file handle with the tarfile module to create a temporary location with the files. The disadvantages of this are that it is cumbersome for the user, and the files will need to be extracted multiple times even if they are in the cache.

This is a request for a way to store and retrieve such multi-file archives.

delgadom commented 7 years ago

I think this makes a ton of sense but am not sure how the interface works. Some sample usage code would be really helpful.

Do you imagine that they would be self-extracting? or that we would wrap tarfile/gzip and provide archive/extract commands? Maybe the archive/extract command would be available from the command line but not from python? In that case, why not tar the files in a separate command (with the tar tool)?

jrising commented 7 years ago

Here is some sample code:

import shapefile, datafs

api = datafs.get_api()

archive = api.get_archive('gadm2-2.8') # the GADM2 shapefile DB
assert archive.is_multifile() # Returns True

with archive.multifile_path() as path: # Use 'with' so can be removed when done
    sf = shapefile.Reader(path)
    # Do stuff with the shapefile

When you say "self-extracting", it sounds like you're planning on using tarfiles or something similar on the backend. I don't think that the user should be concerned about the backend implementation, or that the functions should have names like extract.

Potentially, the interface could be that you can always say with archive.temporary_path() as path:, and that this function works for both multifile and single file archives. Then you can also say "The archive.open() syntax also works for any kind of archive, but it's functionality for multifile archives is undefined."