Closed pwalsh closed 7 years ago
Yes, the example you take is indeed very unclear. It took me some time to understand that the repetition was for each field: at least all schema fields sections should be clearly visually separated from each other.
Nevertheless, there are some fields which are mandatory regardless of type (e.g. name
).
From the current presentation it's nearly impossible to understand that.
I guess the major problem is Table Schema fields. I have some points on that, but I'll wait for more feedback, and see what else comes up in general.
fields[].name
required is missed in descriptor section (only in overview)fields[].constraints.xxx
props feel like fields[].xxx
props)#
right to header){PATTERN}: The value can be parsed according to {PATTERN}, which MUST follow the date formatting syntax of C / Python strptime.
format: {mm-dd-yyyy}
or format: mm-dd-yyyy
. It's more common to use <PATTERN>
for values to replace by user but anyway some explicit note or example I suppose should be added to remove any possible confusionformat: mm-dd-yyyy
Here is decided to apply constraints only on cast values - https://github.com/frictionlessdata/specs/issues/296#issuecomment-268471008
It seems it's kind of mechanical mistake that pattern
constrains intended only for strings
have leaked to other types in current Table Schema v1.0.0rc1
@roll it was not decided there - it was a suggestion I made that was never confirmed by anyone else in that thread. It also needs to be field specific, as only applying on cast values cannot work for date/time fields at least. The solution in a call following that thread was to just be explicit on constraints for every single type.
by @pwalsh
there is a bug here (about
missingValues
defaults) -https://github.com/frictionlessdata/specs/blob/master/sources/dictionary/tableschema.yml#L167
From Table Schema
:
Form
The descriptor MUST be valid JSON, as described in RFC 4627, and SHOULD be in one of the following forms:
- A file named tableschema.json.
- An object, either on its own or nested in another data structure.
Doesn't it sounds too restrictive (ok it's only SHOULD not MUST but anyway)? Because I suppose the most common way to name your table schema after name of the data file like:
fields
paragraph is really shy on explanations - it's mostly an example and concrete types paragraphs further. So here we could have:PS.
We have basic concepts explanation (name, constraint etc) in the very beginning of the spec but there is a big distance in pages between it and fields
paragraph.
@pwalsh
So what I think we miss in current Table Schema
spec is something like Concepts
section inside the Specification
section with list of key spec concepts. And it could solve problem with clarity on constraints applying we was discussing in Slack. So something like this:
There is a note about tabular data but should we provide a quick introduction on what could be described by the spec?
What is data value that could be described by the field object. How it's related to field type/format. If data value conforms to a field it means that it must conform to type/format of field. Describe concept of raw and typed (cast, parsed) data value?
The same as in SQL null value is important concept of the spec. So we should clarify what does it mean if value is inside missingValues
. In SQL null is not an implementation concept but the spec concept. The same for Table Schema I suppose.
Field could have constraints but what does it mean? What is constraint value (related to data value)? It mean something like - data value only conforms to field if it satisfy all field constraint. Where satisfying a constraint means:
I do understand there is a good chance to touch some implementations details we don't want. But with good wording I suppose we could find a good balance between clarifying core concepts and not being implementation-specific.
In Table Schema spec the following example is wrong:
{ "name": "extra" "type": "object" }
.
A comma is missing.
@roll the number docs are all old docs, nothing new or changed there, so maybe the above comment is a good candidate for a distinct issue (it is not something I'd want to address as part of fixing v1 + display issues).
@pwalsh done!)
An identifier string. Lower case characters with ., _, - and / are allowed.
seems better to be An identifier string. It MUST contain only lower case characters with ., _, - and /.
.description (markdown is encouraged)
- plain text not encouraged?Boolean/Any
types misses format
propert (other types with only default
option still have it)PS. upd comment to don't spam people too much
author
is mentioned in one of the data packages examples, but it doesn't seem to be specified.
In data package properties, role
is not specified at all.
A Table Schema descriptor SHOULD include the following properties.
It's optional. Shouldn't it be MAY?
Items
Each item in the array is a string. The property is required, and other defined properties are optional.
Not clear what second sentence mean.
The whole section should be reviewed I suppose (just not finished).
HTML code is not valid, see:
@pwalsh Here are my 2 cents (or perhaps a bit more than 2) after a full read of the specs:
In general is really hard to navigate visually from one section to another on the Properties section (the most important one). There is no clear separation between different Field types and constraints, specially as all fields share a lot of the same text and properties. Things that could help:
Specification
Examples
Descriptor
Properties
Required
fields
string
number
...
Optional
primaryKey
foreignKey
missingValues
* Format field type titles differently, instead of `String Field` use either
1. `string` fields
2. String fields
In any case, I would start every section with:
> Fields of type `string` contain sequences of characters.
>
> Fields of type `geojson` contain a JSON object according to GeoJSON or TopoJSON
>
> ...
This would make clearer that we are talking about different types of the same object (a Table Schema Field).
* Indent constraint properties to show that they are a level below field types
Add anchors to the different sections, as this is really used on spec sites to link to specific items.
Agree with @akariv that if name
(or any other property) is mandatory that should be displayed on the properties list (I know this is mentioned on the "Specification" section but people will likely jump to the actual properties)
On name
if "This is ideally a url-usable and human-readable name" then it's probably best to explicitly say "This SHOULD
be a url-usable and human-readable name".
On enum
it says "Each enum item MUST
comply with the type and format of the property." but we probably need to add ", as well as any other specific constraint." (to ensure values adhere to minimum
, maximum
etc)
rdfType
: This needs much more detail and ideally an example. What is an RDF type? an RDF class, an IRI?
The description for currency
is really confusing: "A number that may include additional currency symbol". Do you need to provide an actual number? The currency symbol or name?
For Date
, DateTime
etc, it would be really good to show an example pattern and example ISO date so people don't need to click further links to get an idea of what's supported
minLength
and maxLenght
don't apply to date field, Object
fields, etc and yet they are part of their properties.
I agree with @roll that for "Optional properties" we should say "A Table Schema descriptor MAY
include the following properties."
The "Items" section on primaryKey
is confusing:
Each item in the array is a string. The property is required, and other defined properties are optional.
As I understand it the property is not required, but if present it MUST
contain at least one item right? And AFAICT there are no other defined properties for primaryKey
. My suggestion:
Each item in the array is a string. If present,
primaryKey
MUST
contain at least one item.
The foreingKey
reference
property is missing descriptions of its own properties (resource
, fields
)
missingValues
: Wouldn't it make sense to use null
as default value for string and non-string values? so ''
and null
for non-string fields and null
for string fields.
description
: "Markdown is encouraged". I think this is quite opinionated (and goes against the principle of simplicity). I personally much prefer metadata to be on plain text (specially as integrator). At least we could say "Markdown is supported" or something similar.
schema
: "A schema for this resource." This is a pretty critical property, it should have more details. What form of schema? Can this be an arbitrary schema defined by yourself, a JSON schema, a Table Schema ?homepage
does not have the properties described (like source
), ie name
, uri
.data
:
The dereferenced value of each referenced data source in the
data
arrayMUST
be commensurate with a native, dereferenced representation of the data the resource describes
That is quite a mouthful. Perhaps the use of "dereferenced" is justified, but perhaps replace "commensurate" with "match" or something similar?
Also this example is really confusing:
{
"data": [
"#/data/my-data",
"#/data/my-data2"
]
}
The data
in the JSON Pointer seems to reference the own data
property in the Resource itself. After some thought, and if I understood the specs correctly this complete example would be:
{
"name": "my-data-package",
"data": {
"my-data": [{...}],
"my-data2": [{...}]
}
"resources": [
{
"name": "my-data-resource",
"data": [
"#/data/my-data",
"#/data/my-data2"
]
}
]
}
I wonder if it would make more sense to show a simplified version of this that includes the actual property referenced, eg
{
"data": [
"#/resources/0/records/my-records1",
"#/resources/0/records/my-records2",
],
"records": {
"my-records1": [{...}],
"my-records2": [{...}]
}
}
data
property, but according to the docs "data
MUST
be an array of valid URIs."profile
property shows { "profile": "tabular-data-package" }
, which is confusing. I guess it should be { "profile": "tabular-data-resource" }
.dialect
: Link to the CSV Dialect specSame points as in Table Schema on difficulty to navigate the properties hierarchily
homepage
does not have the properties described (like source
), ie name
, uri
.
contributors
: role
property lacks description.
Specification: "Each resource
MUST
be a valid Tabular Data Resource". So if a Data Package has 3 resources, 2 CSVs and 1 PDF is not a Tabular Data Package? Or scripts, as suggested by the example a bit furhter down
On the examples it says:
A minimal Tabular Data Package on disk would be a directory containing a single file: datapackage.json
That would not be a valid Tabular Data Package according to the spec above.
profile
shows the generic blob for profiles, but if I understood correctly this should always be { "profile": "tabular-data-resource" }
@amercader
So if a Data Package has 3 resources, 2 CSVs and 1 PDF is not a Tabular Data Package
Hmm, good question. My understanding was that it would have 2 resource
s (the csv files) defined in the datapackage.json
, while the PDF file would be included in the bundle of file, but not be referred to in the JSON. But now that doesn't sound right to me.
@amercader @stevage
So if a Data Package has 3 resources, 2 CSVs and 1 PDF is not a Tabular Data Package
Correct, a Tabular Data Package requires that each resource is a Tabular Data Resource. Until v1, Data Resource was not a top-level concept, so, practically speaking, it would not have been possible to have non-tabular data as resources in a TDP. However, now that we have Data Resources specified, and they too have profiles, then it is easier to declare a generic Data Package where some Data Resources are of one type, and other of another.
The data in the JSON Pointer seems to reference the own data property in the Resource itself. After some thought, and if I understood the specs correctly this complete example would be
And yes, and no) That's the gotcha of JSON Pointers - it goes against newly introduced composability of specs (schema-resource-package).
Example from Data Resource
:
dataresource
descriptordatapackage
descriptorbecause those descriptors have different roots.
Comment: I'm finding this thread sort of tough to follow as we get more interleaving comments. What would people think of a hackmd doc (e.g. this blank one) where we could consolidate things?
The description for currency is really confusing: "A number that may include additional currency symbol". Do you need to provide an actual number? The currency symbol or name?
May be it means currency
is a boolean flag?
@roll let's try to seperate new display issues, from things like this currency issue which is simply wording from the old specs
@pwalsh (cc @amercader) prev1:
gyearmonth
A specific month in a specific year as per XMLSchema gYearMonth.
Usual lexical representation is: YYYY-MM. There are no format options.
v1:
Year Month Field
A calendar year month, being an integer with 1 or 2 digits. Equivalent to gYearMonth in XML Schema
upd. prev1 is correct
Table Schema datetime
default format:
default: An ISO8601 format string for datetime.
It should provide a concrete pattern like it was in pre-v1 - https://pre-v1.frictionlessdata.io/json-table-schema/#date. Just ISO8601 is a not concrete enough.
REQUEST: no more commenting in this issue as the comment thread is becoming unreadable.
Please post stuff in the hackmd https://hackmd.io/CwUwzAbAxgDAJjAtNOZHAGYYIyIBxgBGAhusHtlIQOxRwCcEcQA=
DUPLICATE. Info went elsewhere e.g. #420
This is a placeholder. All known bugs have already been fixed, but there are still possibly issues with readability, as the spec generation has changed significantly.
Some of the changes were very purposeful.
Example: each Table Schema field is self contained and repeats data, but this was to make some things very explicit per field, such as exactly which constraints each field supports, rather than in the old presentation, where much was ambiguous (evidenced by some questions in issues, and even in the implementations themselves).
However we obviously need ensure that things are clear and understandable.
@stevage @rufuspollock @roll @Stiivi @akariv
I'd really appreciate input from all of you, if you have time, by detailed comments on this thread.
Please do not hold back - if I lost too much readability in shifting to a generated spec rather than a strictly narrative one, then I also need to resolve this (and it is easy enough to resolve).