frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
496 stars 113 forks source link

[jts] Refine field/column data types in JSON Table Schema #159

Closed rufuspollock closed 9 years ago

rufuspollock commented 9 years ago

This is a discussion issue for potentially refining the list of column / field data types in JSON Table Schema:

Also worth comparing against the (in-progress W3C spec - which has been informed by this spec ...): http://w3c.github.io/csvw/metadata/index.html#datatypes

pwalsh commented 9 years ago

cc @tryggvib

Suggestion:

Several data types are just format patterns for strings.

JSON Schema has a set of "built-in" formats, and "allows" for custom formats.

Getting JSON Table Schema closer to JSON Schema in this regard would potentially be useful (3rd party tools for validation, etc.).

For built-in formats of the string type in JSON Schema, see:

http://spacetelescope.github.io/understanding-json-schema/reference/string.html#format

I propose the following:

I believe that this simplifies JSON Table Schema, by correctly differentiating between a type and a format that the type should conform with.

Following from this, there is another opportunity to simplify the spec with regards to geopoint and geojson. Again, these are really formats for (array or object or string), and I suggest would be better represented as such:

eg:

rufuspollock commented 9 years ago

@pwalsh need to think about this one. I'm not sure how much we get moving the "complexity" down into "format" (though maybe, as you suggest, we can reuse json schema stuff more ...). I also acknowledge that format is currently underspecified generally.

Could you also summarize in a list what the new set of types (+ formats) would be.

/cc @paulfitz @jpmckinney

pwalsh commented 9 years ago

I'll give one example of where tightening the type/format relationship could be useful for code that implements the spec:

As a user, if I have an object that is geojson, I'd define it as type geojson in the current spec.

With the current spec, the validation library would have to explicitly support geojson as a type - I don't think any do at present (may be wrong though..).

However, if the user was able to declare the object as type object and format geojson, we might expect that the validation library could at least validate it as a validly formed object, even if it doesn't support the geojson format.

By reducing the primitive type set, I think it makes it easier in the following areas:

So in general I'm suggesting:

jpmckinney commented 9 years ago

+1 for @pwalsh's three bulleted reasons, plus it avoids an unnecessary difference with JSON Schema.

Separately: why are there three serializations of a geopoint?

pwalsh commented 9 years ago

@jpmckinney because the current JSON Table Schema spec prescribes these three forms for geopoint, each being valid.

Plus, it serves an an interesting example of formats > types - these three forms are quite common in the wild.

jpmckinney commented 9 years ago

@pwalsh I've seen the array and string forms in the wild - but the object form? I've seen all sorts of keys: lat/lng, lat/long, latitude/longitude, etc. The object form seems the least standardized.

pwalsh commented 9 years ago

@jpmckinney fair enough. In any event, they are here because they are all described in the current spec (obj form using lat and lon).

rufuspollock commented 9 years ago

Note: re geopoint I think we should drop array version but re the object and string versions that was there to support very common usage (certainly string version ...). I would be up for narrowing to even one but challenge will be that either one you drop you exclude a lot of existing use (and convenience if you drop string).

pwalsh commented 9 years ago

So I guess we should first get general agreement on the type/format implementation I've suggested above.

But, about geopoint specifically, I'd keep all three:

jpmckinney commented 9 years ago

I'm fine with keeping all three. Going back to this comment https://github.com/dataprotocols/dataprotocols/issues/159#issuecomment-70809331 I agree with @pwalsh 's suggestion.

pwalsh commented 9 years ago

A slight amendment to the type/format list above:

binary (which is a base64 encoded string in JTS) should be a format of the string type, IMHO.

pwalsh commented 9 years ago

@rgrp how do we move forward with this (or not)? I'm just not sure of the process with the data protocols specifications.

rufuspollock commented 9 years ago

@pwalsh can we summarize again what the exact proposal is. Also I'm really not sure how this interacts with format field (which is somewhat underspecified).

Also clarity on pros / cons would be super-useful (obvious con is this is a substantial breaking change but that is not the end of the world). Once we have that we can resolve on this, trial it and then release into the spec.

pwalsh commented 9 years ago

Ok, summary below. BTW, if this is too radical, that's fine. I made the proposal based on (a) working with JTS in code, and (b) under the assumption that the spec is still rather fluid, and there is not necessarily a deep commitment to the stability of the spec as is. Could be the (b) is an incorrect assumption.

Proposal - Standardise field type/format for consistency, ease of use, and wide(r) compatibility with JSON Schema

I find the implementation of the type and format properties of field hashes to be inconsistent and unclear. I'm proposing to reduce the valid types in JSON Table Schema, and make the use of format consistent and explicit for each type.

Making format "consistent and explicit" means that formats, if declared, should always be a particular pattern or structure that type conforms to. In addition, this also means that something should not be a type if it is really a format of another type (e.g.: geojson is really just a format for object).

Relation to JSON Schema

The available types should be consistent with JSON Schema. formats can align with JSON Schema where possible, but also extend and deviate from those in JSON Schema. JSON Schema has a set of "built-in" formats, and "allows" for custom formats.

Getting JSON Table Schema closer to JSON Schema in this regard would potentially be useful for 3rd party tools for validation, etc.

For built-in formats of the string type in JSON Schema, see:

http://spacetelescope.github.io/understanding-json-schema/reference/string.html#format

I propose the following:

The change

The type/format relation I'm suggesting would still cover the entire spec as it stands today, while making the use of both type and format clearer.

User story

I'll give one example of where tightening the type/format relationship could be useful for code that implements the spec:

As a user, if I have an object that is geojson, I'd define it as type geojson in the current spec.

With the current spec, the validation library would have to explicitly support geojson as a type - I don't think any do at present (may be wrong though..).

However, if the user was able to declare the object as type object and format geojson, we might expect that the validation library could at least validate it as a validly formed object, even if it doesn't support the geojson format.

Pros/Cons

Cons

Pros

Reducing the primitive type set is an improvement in the following areas:

rufuspollock commented 9 years ago

This is wonderfully clear and a really great summary @pwalsh.

In general, I think there is a lot to be said for this change.

My only immediate thought is about "promoting" date-time (and I guess date and time) to a first-class type.

Why: it is a "first-class" type - very common. Also could allow us to introduce flexibility on the date-time format using the format string on date-time (default would still be full ISO8601).

Why not: dates are strings in JSON and CSV and are notoriously tricky. This system is consistent with treating them basically as strings with structure to them.

Overall I'm +1 on this change.

@jpmckinney @paulfitz @ldodds - thoughts, votes +1, -1, +0, -0 0.

pwalsh commented 9 years ago

The date/time thing with formatting is a bit tricky, and in order to provide custom formats, I see why the case can be made for date/time as types, and not formats on string.

I though of suggesting "format modifiers" for this very problem, but held back because it may confuse the bigger issue of type/format clarity.

Basically, the idea of a format modifier would be that certain formats could take a custom pattern. Example for date: "type": "string", "format": "date:d% %b %Y"

jpmckinney commented 9 years ago

+1 to proposed changes. We can continue thinking about dates/times, but I think the proposal is in the right direction in that regard.

paulfitz commented 9 years ago

I generally like this, especially for describing tables that are themselves encoded in JSON. For describing types in an sql database, I'm less sure, mostly because of date-time/date/time.

I like that, ignoring the type/format hierarchy, all the geopoint representations are grouped (via format). I don't like that there's no grouping of date-time/date/time representations.

Geopoints showing up as strings, objects, or arrays reminds me of sqlite's notion of "affinities" for dealing with date-times (https://www.sqlite.org/datatype3.html#datetime), which can be stored as strings, reals, or integers.

What if format were retained for variations on a theme, while adding a logicalType field that captures the same basic idea being expressed different ways?

So a date might be "type": "string", "logicalType": "date-time", "format": "date", and a geopoint expressed as an array might be "type": "array", "logicalType": "geopoint" (no format needed until there are geopoints with lat/long in differing orders).

ldodds commented 9 years ago

I think I'm -1 on this. I'd prefer to see the direction of travel be alignment with the W3C metadata vocabulary. I've already advocated for use of XML Schema type system in the other type changes that I've proposed (and which have been partially adopted).

I'm not clear why alignment with JSON Schema is that desirable, as opposed to alignment with the W3C guidance.

rufuspollock commented 9 years ago

@pwalsh I've also been reflecting and I think that my major considerations would be:

This makes me think:

@ldodds there is a bit of circularity re W3C metadata vocab since that is in part coming from here to. Also, it sounds like you are +1 on a significant upgrade, but just in a different direction. Interested in your thoughts on this comment.

Aside: if we do a significant change we probably could fold this in with #126 (switch to datatype from type)

pwalsh commented 9 years ago

@ldodds: about the W3 XML Schema, I don't really see why XML Schema is preferable to JSON Schema either, but, the JTS spec as written certainly engages more with JSON Schema than W3 XML Schema. As a relative newcomer to JTS (me), that alone makes a significant difference. Also see #46.

@rgrp: ok now that SQL comes into the picture that changes things (for me) somewhat. I approached JTS as a way to declare schemes for text (CSV) and JSON sources, hence the emphasis on string formats. Perhaps the spec needs to be more explicit in the expected applications of JTS - the only SQL reference I picked out of the spec was to show contrast to a text-based table.

So, I would def. support date/time types if we are not just talking text/json-based data sources here.

About geocsv I didn't know it, but I'll check it out.

rufuspollock commented 9 years ago

@pwalsh point re SQL is that in terms of people using tabular data (and esp using CSV), SQL is a very obvious load target (plus things like bigquery, redshift etc) tend to be close to SQL. To be clear, i'm not suggesting we support all of SQL but an easy ability to map (pretty well) is def desirable.

ldodds commented 9 years ago

@pwalsh my reasoning here is that a specification should be as clear as possible. rather than defining a new type system we can use one that already exists. I don't think the JSON Schema type system is particularly well defined. It defining types that are arguably unnecessary for a tabular data structure (JSON objects) and doesn't define other types that are useful (date, uri, etc). Rather than define those things in JTS, we can just refer to another specification.

As an aside I was wondering whether datapackage should just let a user say what type of schema it includes, which might be either a JSON Schema or the new W3C format. Then ditch JTS all together. Then rather than try to align various specifications, we just let people choose what they want to use.

rufuspollock commented 9 years ago

@ldodds i have to say i'm -1 or -0 on just using full xsd types. Real aim for simplicity here. Also W3C format is a long way from done and is significantly reusing JTS :-) I also think JTS will be useful on its own.

As such I don't think making this "choose your own" is the way to go for Tabular Data Package per se - though I strongly +1 idea of allowing schema field to mean other things for other extensions of Data Package :-)

ldodds commented 9 years ago

I wasn't suggesting supporting all of XSD, just that we draw from its type system which is well defined. See also discussions in #124 & #96.

fwiw, csvlint already supports using some of the XSD schema types and people are creating & applying schemas using it. haven't found any particular complexity from an implementation point of view.

pwalsh commented 9 years ago

So, some great points have been brought up in the thread; here is a revised version of my type/format proposal. The goal is still to have explicit definitions of types and the supported formats per type.

rufuspollock commented 9 years ago

@pwalsh I think this looks great. Only thing in spec would be to make clear that format is always optional (e.g. string does not require format). Re geopoint could we lose the '-geopoint' in the format descriptions. also for string, default should lng then lat (i understand that to be an underlying standard) and i think we then add 'string-latlng' or similar as a format.

Would you be happy to submit a pull request with these changes to the spec and we can get it in (remember to note the changes at the top of the spec).

pwalsh commented 9 years ago

Sure. I'll do a pull request on the spec for this over the next few days.

jpmckinney commented 9 years ago

:+1: Re geojson formats, we might consider topojson?

pwalsh commented 9 years ago

Good idea. Seems practical. I'll add it to the pull request too.

rufuspollock commented 9 years ago

FIXED. This was fixed in PR #168.