Standard complete-dataset-dump format and location

wardi commented 10 years ago

Let's settle on a single recommended location and format for the complete static dump of datasets that many sites offer.

My (biased) choice would be the format produced by ckanapi dump datasets

rufuspollock commented 10 years ago

@wardi sounds good. Could you provide a tiny bit more info about format of ckanapi dump datasets

wardi commented 10 years ago

It's a json lines format. The format is exactly what package_show returns for each package in UTF-8 (not using escaped characters), one per line.

ckanapi dump datasets and ckanapi load datasets use this line-based format so that they can run the jobs in parallel with multiple worker processes.

rossjones commented 10 years ago

What about going with data.json like http://project-open-data.github.io/schema/ ?

wardi commented 10 years ago

Using an existing standard is a good idea, but could we manage a lossless conversion? With the flexibility that IDatasetForm provides I don't know how that would work.

davidread commented 10 years ago

@rossjones argh another proposed standard for metadata... I don't see why we can't provide a mapping in some way, but I think the aim of this bulk download should be to provide it in exactly CKAN's native format.

Using JSON seems a no-brainer. I quite like @wardi 's suggestion to use one line per dataset, rather than put them in a JSON list - that allows a script to consume them one by one, rather than loading them all into memory at once as the average JSON tool would do. However I guess it does mean that loading the whole lot into a standard JSON reading, such as Refine, wouldn't work. Maybe that's what the CSV version of the dump can be for.

So can I request a CSV dump is provided too? I think people love loading this data into Excel to use for all sorts of report generation, custom filtering etc. The problem I encountered with CSV dumps with providing the dozen or so columns for each of resource-0, resource-1, etc. It gets unwieldy when you have say 200 resources and Excel 2003 has a max of 256 columns and LibreOffice 4.2 has a max of 1024! I'd suggest only putting all the resources in a single column as a JSON blob... If we can decide how to do it, I'm happy to sort the code for this.

rossjones commented 10 years ago

It's not so much a proposed standard as one in use as it is required of US agencies that they publish their catalog metadata in this format (for harvesting into data.gov). I guess their use-case is different as they want to support portals other than CKAN.

The problem with doing it in CKAN's native format is that the native format is different from instance to instance, so you'd still have to choose core metadata.

Perhaps DCAT/JSON-LD? (/me ducks)

wardi commented 10 years ago

I'm suggesting we embrace the differences and present each instances data as-is. From there we can convert to whatever other formats we like, and do it efficiently, with a process external to ckan.

ckan / ideas

Standard complete-dataset-dump format and location #48