GSA / enterprise-data-inventory

The Enterprise Data Inventory is a CKAN based data management system for private and public data management
7 stars 5 forks source link

How to handle compressed files #179

Closed MikePulsiferDOL closed 8 years ago

MikePulsiferDOL commented 9 years ago

We have an agency that is compressing their large CSVs into ZIP files. The error logs for the inventory suggest there's a format mismatch because we stated that the data is in a CSV file. That is correct, but it's been compressed.

Shouldn't the data format matter more than the file format as far as what is presented to the end-user looking for data? I ask because stating it's a ZIP file wouldn't be all that informative because what's in the ZIP file could be an agency's data file in a proprietary format (e.g. mdb). Listing what format the data is in in its readable form would be more informative.

format_mismatch,DOL-MSHA-225,Conferences,http://www.msha.gov/OpenGovernmentData/DataSets/Conferences.zip,200,application/x-zip-compressed,text/csv,2015-04-14T03:51:26-04:00

JJediny commented 8 years ago

Given the goal is to accurately determine the format/MIME type of files available for download or access its important to denote the initial format (.zip, .tar.gz, .7z) to let users know it needs to be extracted before it can be accessed. For CSV files I can't imagine why it is a required step by the agency producing the data to compress such a lean text based format with no size limitations and minimal performance issues even at 2-4 or more GB...

Agencies should forth-most be separating the raw data from documentation usually contained in a packaged CSV... with good intentions but it is most useful to provide direct access to ready-to-use downloads/APIs such as CSV and not bundling everything together as a formats excel file or documents/metadata within the compressed folder as this prevents users creating dynamic service/applications that can access data directly without common ETL, cleaning, swimming through a compressed archive that can change internal structure/organization more easily than the raw formats. The only file/formats that make sense to package would be shapefiles as it is a collection of interdependent files that constitute a singular dataset