dat-ecosystem / dat

:floppy_disk: peer-to-peer sharing & live syncronization of files via command line
https://dat.foundation
BSD 3-Clause "New" or "Revised" License
8.25k stars 449 forks source link

what if csv had a header with metadata? #52

Closed dominictarr closed 9 years ago

dominictarr commented 10 years ago

for the stuff I've been thinking about recently, like diffing csv, merging and joining it would be way easier if csv had a simple header - not just column names. I know an hcsv file could not be opened in excel, although it could be very easy to remove the header.

just something like this. a line of --- to demarcate the header?

----------
HEADER
----------
CSV

then you could have a patch file, and the header would just say which version (hash of) of the previous file it patched. You could have units for each header. You could have types for each header. You could specify which combination of columns made the primary key. You could also have this as a separate file and then load that metadata as an cli option in all the tools, but sometimes it would be much simpler to just have a header in the csv stream.

The header could be human readable, and you could easily have unix tools to parse it out.

you could also put JSON inside the header - although it might be good to use INI instead just because that is a similar legacy to CSV, so it just fits better...

max-mapper commented 10 years ago

IMO making the CSV any more complex than it currently is wouldn't be worth it, as the beauty of the format is in it's extreme simplicity, editable with a wide range of tools. I'd lean towards using another file and having the 'primitive' be not a single CSV but rather a tarball

marks commented 10 years ago

Exactly. You all are probably aware that there is work by OKFN on exactly this under the name of data packages: http://data.okfn.org/doc/tabular-data-package

max-mapper commented 10 years ago

@marks thanks for the link! I actually helped write that spec, haha.

marks commented 10 years ago

+1 there you go. Definitely makes sense, IMHO, to adopt that or something close to it for this project which I'm actively lurking watching with interest

jalbertbowden commented 10 years ago

odd timing...have been using data-package.json on /datasets, when i came across civic.json last night: https://github.com/BetaNYC/civic.json which is a cfa joint.
being down with both groups, i'm not pushing one or the other...which do ya'll prefer? and why? does it matter as long as they're json?

marks commented 10 years ago

@jalbertbowden - I see them as two different ways to explain two different things. civic.json, to me, looks to be a standard way to describe civic projects whereas data-package.json describes datasets.

jalbertbowden commented 10 years ago

yeah. i guess i was looking for something that's not there. mb.

williamscraigm commented 10 years ago

Not the most exciting thing on earth, but in the Windows world schema.ini files can be used for this: http://msdn.microsoft.com/en-us/library/ms709353(v=vs.85).aspx

pezholio commented 10 years ago

Yup, datapackages would seem to be the thing to use here. We've done a fair bit of work with them on our (alpha) CSV validation tool over at http://csvlint.io/

okdistribute commented 9 years ago

I'm thinking we should allow people to specify a schema with a dataset. I think we can close this and get #390 done