Closed dominictarr closed 9 years ago
IMO making the CSV any more complex than it currently is wouldn't be worth it, as the beauty of the format is in it's extreme simplicity, editable with a wide range of tools. I'd lean towards using another file and having the 'primitive' be not a single CSV but rather a tarball
Exactly. You all are probably aware that there is work by OKFN on exactly this under the name of data packages: http://data.okfn.org/doc/tabular-data-package
@marks thanks for the link! I actually helped write that spec, haha.
+1 there you go. Definitely makes sense, IMHO, to adopt that or something close to it for this project which I'm actively lurking watching with interest
odd timing...have been using data-package.json on /datasets, when i came across civic.json last night: https://github.com/BetaNYC/civic.json which is a cfa joint.
being down with both groups, i'm not pushing one or the other...which do ya'll prefer? and why? does it matter as long as they're json?
@jalbertbowden - I see them as two different ways to explain two different things. civic.json
, to me, looks to be a standard way to describe civic projects whereas data-package.json
describes datasets.
yeah. i guess i was looking for something that's not there. mb.
Not the most exciting thing on earth, but in the Windows world schema.ini files can be used for this: http://msdn.microsoft.com/en-us/library/ms709353(v=vs.85).aspx
Yup, datapackages would seem to be the thing to use here. We've done a fair bit of work with them on our (alpha) CSV validation tool over at http://csvlint.io/
I'm thinking we should allow people to specify a schema with a dataset. I think we can close this and get #390 done
for the stuff I've been thinking about recently, like diffing csv, merging and joining it would be way easier if csv had a simple header - not just column names. I know an hcsv file could not be opened in excel, although it could be very easy to remove the header.
just something like this. a line of
---
to demarcate the header?then you could have a patch file, and the header would just say which version (hash of) of the previous file it patched. You could have units for each header. You could have types for each header. You could specify which combination of columns made the primary key. You could also have this as a separate file and then load that metadata as an cli option in all the tools, but sometimes it would be much simpler to just have a header in the csv stream.
The header could be human readable, and you could easily have unix tools to parse it out.
you could also put JSON inside the header - although it might be good to use INI instead just because that is a similar legacy to CSV, so it just fits better...