pwalsh commented 7 years ago

This is a placeholder. All known bugs have already been fixed, but there are still possibly issues with readability, as the spec generation has changed significantly.

Some of the changes were very purposeful.

Example: each Table Schema field is self contained and repeats data, but this was to make some things very explicit per field, such as exactly which constraints each field supports, rather than in the old presentation, where much was ambiguous (evidenced by some questions in issues, and even in the implementations themselves).

However we obviously need ensure that things are clear and understandable.

@stevage @rufuspollock @roll @Stiivi @akariv

I'd really appreciate input from all of you, if you have time, by detailed comments on this thread.

Please do not hold back - if I lost too much readability in shifting to a generated spec rather than a strictly narrative one, then I also need to resolve this (and it is easy enough to resolve).

CharlesNepote commented 7 years ago

Yes, the example you take is indeed very unclear. It took me some time to understand that the repetition was for each field: at least all schema fields sections should be clearly visually separated from each other.

akariv commented 7 years ago

Nevertheless, there are some fields which are mandatory regardless of type (e.g. name). From the current presentation it's nearly impossible to understand that.

pwalsh commented 7 years ago

I guess the major problem is Table Schema fields. I have some points on that, but I'll wait for more feedback, and see what else comes up in general.

roll commented 7 years ago

http://specs.frictionlessdata.io/table-schema/
- fields[].name required is missed in descriptor section (only in overview)
- nested levels of properties are hard to recognize (see fields[].constraints.xxx props feel like fields[].xxx props)
- the same for distinction with field types/attributes (everything are red with the same indentation)
we have lost an ability to link headers like https://pre-v1.frictionlessdata.io/data-packages/#table-of-contents (@danfowler has added it previous version - see # right to header)

roll commented 7 years ago

{PATTERN}: The value can be parsed according to {PATTERN}, which MUST follow the date formatting syntax of C / Python strptime.
- This is kinda confusing - should it be format: {mm-dd-yyyy} or format: mm-dd-yyyy. It's more common to use <PATTERN> for values to replace by user but anyway some explicit note or example I suppose should be added to remove any possible confusion

pwalsh commented 7 years ago

format: mm-dd-yyyy

roll commented 7 years ago

Here is decided to apply constraints only on cast values - https://github.com/frictionlessdata/specs/issues/296#issuecomment-268471008

It seems it's kind of mechanical mistake that pattern constrains intended only for strings have leaked to other types in current Table Schema v1.0.0rc1

pwalsh commented 7 years ago

@roll it was not decided there - it was a suggestion I made that was never confirmed by anyone else in that thread. It also needs to be field specific, as only applying on cast values cannot work for date/time fields at least. The solution in a call following that thread was to just be explicit on constraints for every single type.

roll commented 7 years ago

by @pwalsh

there is a bug here (about missingValues defaults) -https://github.com/frictionlessdata/specs/blob/master/sources/dictionary/tableschema.yml#L167

roll commented 7 years ago

From Table Schema:

Form
The descriptor MUST be valid JSON, as described in RFC 4627, and SHOULD be in one of the following forms:
- A file named tableschema.json.
- An object, either on its own or nested in another data structure.

Doesn't it sounds too restrictive (ok it's only SHOULD not MUST but anyway)? Because I suppose the most common way to name your table schema after name of the data file like:

gdp.csv
gdp-schema.json

roll commented 7 years ago

http://specs.frictionlessdata.io/table-schema/
- initial examples contains only very basic data types and features. So as a {NEW COMER} I could be much more impressed and encouraged to read to the end if I see more interesting use cases e.g. inside the complex Table Schema example (dates/geo/string patterns/etc)

roll commented 7 years ago

http://specs.frictionlessdata.io/table-schema/
- the fields paragraph is really shy on explanations - it's mostly an example and concrete types paragraphs further. So here we could have:
- that fields could have different types (and below is a list of this types)
- introduction to basic concepts: name, format, constraint etc

PS. We have basic concepts explanation (name, constraint etc) in the very beginning of the spec but there is a big distance in pages between it and fields paragraph.

roll commented 7 years ago

@pwalsh So what I think we miss in current Table Schema spec is something like Concepts section inside the Specification section with list of key spec concepts. And it could solve problem with clarity on constraints applying we was discussing in Slack. So something like this:

Concepts

tabular data

There is a note about tabular data but should we provide a quick introduction on what could be described by the spec?

data value

What is data value that could be described by the field object. How it's related to field type/format. If data value conforms to a field it means that it must conform to type/format of field. Describe concept of raw and typed (cast, parsed) data value?

null value

The same as in SQL null value is important concept of the spec. So we should clarify what does it mean if value is inside missingValues. In SQL null is not an implementation concept but the spec concept. The same for Table Schema I suppose.

constraints

Field could have constraints but what does it mean? What is constraint value (related to data value)? It mean something like - data value only conforms to field if it satisfy all field constraint. Where satisfying a constraint means:

typed data value must conform to constraint rule using typed (when applicable) constraint value

I do understand there is a good chance to touch some implementations details we don't want. But with good wording I suppose we could find a good balance between clarifying core concepts and not being implementation-specific.

CharlesNepote commented 7 years ago

In Table Schema spec the following example is wrong: { "name": "extra" "type": "object" }.

A comma is missing.

roll commented 7 years ago

moved to https://github.com/frictionlessdata/specs/issues/393

pwalsh commented 7 years ago

@roll the number docs are all old docs, nothing new or changed there, so maybe the above comment is a good candidate for a distinct issue (it is not something I'd want to address as part of fixing v1 + display issues).

roll commented 7 years ago

@pwalsh done!)

http://specs.frictionlessdata.io/table-schema/
- Based on specs language this sentence An identifier string. Lower case characters with ., _, - and / are allowed. seems better to be An identifier string. It MUST contain only lower case characters with ., _, - and /..
- description (markdown is encouraged) - plain text not encouraged?
- Boolean/Any types misses format propert (other types with only default option still have it)
- should types have some logical order in the text? Like array next to object etc.

PS. upd comment to don't spam people too much

CharlesNepote commented 7 years ago

author is mentioned in one of the data packages examples, but it doesn't seem to be specified.

CharlesNepote commented 7 years ago

In data package properties, role is not specified at all.

roll commented 7 years ago

Table Schema

Optional properties

A Table Schema descriptor SHOULD include the following properties.

It's optional. Shouldn't it be MAY?

primaryKey

Items

Each item in the array is a string. The property is required, and other defined properties are optional.

Not clear what second sentence mean.

foreignKeys

The whole section should be reviewed I suppose (just not finished).

roll commented 7 years ago

http://specs.frictionlessdata.io/table-schema/
- One important thing I suppose that fields of different types contains descriptor examples but don't contain any data examples. That's essential to understand types I think.

CharlesNepote commented 7 years ago

HTML code is not valid, see:

amercader commented 7 years ago

@pwalsh Here are my 2 cents (or perhaps a bit more than 2) after a full read of the specs:

Table Schema

Formatting and readability

In general is really hard to navigate visually from one section to another on the Properties section (the most important one). There is no clear separation between different Field types and constraints, specially as all fields share a lot of the same text and properties. Things that could help:

Include more levels on the TOC:


Specification
Examples
Descriptor
Properties
    Required
        fields
            string
            number
            ...
    Optional
        primaryKey
        foreignKey
        missingValues



* Format field type titles differently, instead of `String Field` use either 

   1. `string` fields 
   2. String fields

   In any case, I would start every section with:

   > Fields of type `string` contain sequences of characters.
   >
   > Fields of type `geojson` contain a JSON object according to GeoJSON or TopoJSON
   >
   > ...

   This would make clearer that we are talking about different types of the same object (a Table Schema Field).

* Indent constraint properties to show that they are a level below field types

Add anchors to the different sections, as this is really used on spec sites to link to specific items.

Specs language

Agree with @akariv that if name (or any other property) is mandatory that should be displayed on the properties list (I know this is mentioned on the "Specification" section but people will likely jump to the actual properties)
On name if "This is ideally a url-usable and human-readable name" then it's probably best to explicitly say "This SHOULD be a url-usable and human-readable name".
On enum it says "Each enum item MUST comply with the type and format of the property." but we probably need to add ", as well as any other specific constraint." (to ensure values adhere to minimum, maximum etc)
rdfType: This needs much more detail and ideally an example. What is an RDF type? an RDF class, an IRI?
The description for currency is really confusing: "A number that may include additional currency symbol". Do you need to provide an actual number? The currency symbol or name?
For Date, DateTime etc, it would be really good to show an example pattern and example ISO date so people don't need to click further links to get an idea of what's supported
minLength and maxLenght don't apply to date field, Object fields, etc and yet they are part of their properties.
I agree with @roll that for "Optional properties" we should say "A Table Schema descriptor MAY include the following properties."
The "Items" section on primaryKey is confusing:

Each item in the array is a string. The property is required, and other defined properties are optional.

As I understand it the property is not required, but if present it MUST contain at least one item right? And AFAICT there are no other defined properties for primaryKey. My suggestion:

Each item in the array is a string. If present, primaryKey MUST contain at least one item.
The foreingKey reference property is missing descriptions of its own properties (resource, fields)
missingValues: Wouldn't it make sense to use null as default value for string and non-string values? so '' and null for non-string fields and null for string fields.
description: "Markdown is encouraged". I think this is quite opinionated (and goes against the principle of simplicity). I personally much prefer metadata to be on plain text (specially as integrator). At least we could say "Markdown is supported" or something similar.

Data Resource

schema: "A schema for this resource." This is a pretty critical property, it should have more details. What form of schema? Can this be an arbitrary schema defined by yourself, a JSON schema, a Table Schema ?
homepage does not have the properties described (like source), ie name, uri.
data :

The dereferenced value of each referenced data source in the data array MUST be commensurate with a native, dereferenced representation of the data the resource describes

That is quite a mouthful. Perhaps the use of "dereferenced" is justified, but perhaps replace "commensurate" with "match" or something similar?

Also this example is really confusing:
```
{
  "data": [
      "#/data/my-data",
      "#/data/my-data2"
   ]
}
```
The data in the JSON Pointer seems to reference the own data property in the Resource itself. After some thought, and if I understood the specs correctly this complete example would be:
```
{
   "name": "my-data-package",
   "data": {
       "my-data": [{...}],
       "my-data2": [{...}]
   }
   "resources": [
       {
           "name": "my-data-resource",
           "data": [
              "#/data/my-data",
              "#/data/my-data2"
            ]           
       }
   ]
}
```
I wonder if it would make more sense to show a simplified version of this that includes the actual property referenced, eg
```
{
  "data": [
      "#/resources/0/records/my-records1",
      "#/resources/0/records/my-records2",
   ],
   "records": {
       "my-records1": [{...}],
       "my-records2": [{...}]       
   }
}
```

Tabular Data Resource

I think the inline example shown is wrong, as it adds the data into the data property, but according to the docs "data MUST be an array of valid URIs."
Example for the profile property shows { "profile": "tabular-data-package" }, which is confusing. I guess it should be { "profile": "tabular-data-resource" }.
dialect: Link to the CSV Dialect spec

Data Package

Same points as in Table Schema on difficulty to navigate the properties hierarchily
homepage does not have the properties described (like source), ie name, uri.
contributors: role property lacks description.

Tabular Data Package

Specification: "Each resource MUST be a valid Tabular Data Resource". So if a Data Package has 3 resources, 2 CSVs and 1 PDF is not a Tabular Data Package? Or scripts, as suggested by the example a bit furhter down
On the examples it says:

A minimal Tabular Data Package on disk would be a directory containing a single file: datapackage.json

That would not be a valid Tabular Data Package according to the spec above.
profile shows the generic blob for profiles, but if I understood correctly this should always be { "profile": "tabular-data-resource" }

stevage commented 7 years ago

@amercader

So if a Data Package has 3 resources, 2 CSVs and 1 PDF is not a Tabular Data Package

Hmm, good question. My understanding was that it would have 2 resources (the csv files) defined in the datapackage.json, while the PDF file would be included in the bundle of file, but not be referred to in the JSON. But now that doesn't sound right to me.

pwalsh commented 7 years ago

@amercader @stevage

So if a Data Package has 3 resources, 2 CSVs and 1 PDF is not a Tabular Data Package

Correct, a Tabular Data Package requires that each resource is a Tabular Data Resource. Until v1, Data Resource was not a top-level concept, so, practically speaking, it would not have been possible to have non-tabular data as resources in a TDP. However, now that we have Data Resources specified, and they too have profiles, then it is easier to declare a generic Data Package where some Data Resources are of one type, and other of another.

roll commented 7 years ago

The data in the JSON Pointer seems to reference the own data property in the Resource itself. After some thought, and if I understood the specs correctly this complete example would be

And yes, and no) That's the gotcha of JSON Pointers - it goes against newly introduced composability of specs (schema-resource-package).

Example from Data Resource:

it's pretty correct for dataresource descriptor
it's not correct for datapackage descriptor

because those descriptors have different roots.

rufuspollock commented 7 years ago

Comment: I'm finding this thread sort of tough to follow as we get more interleaving comments. What would people think of a hackmd doc (e.g. this blank one) where we could consolidate things?

roll commented 7 years ago

The description for currency is really confusing: "A number that may include additional currency symbol". Do you need to provide an actual number? The currency symbol or name?

May be it means currency is a boolean flag?

pwalsh commented 7 years ago

@roll let's try to seperate new display issues, from things like this currency issue which is simply wording from the old specs

roll commented 7 years ago

@pwalsh (cc @amercader) prev1:

gyearmonth
A specific month in a specific year as per XMLSchema gYearMonth.
Usual lexical representation is: YYYY-MM. There are no format options.

v1:

Year Month Field
A calendar year month, being an integer with 1 or 2 digits. Equivalent to gYearMonth in XML Schema

upd. prev1 is correct

roll commented 7 years ago

Table Schema datetime default format:

default: An ISO8601 format string for datetime.

It should provide a concrete pattern like it was in pre-v1 - https://pre-v1.frictionlessdata.io/json-table-schema/#date. Just ISO8601 is a not concrete enough.

rufuspollock commented 7 years ago

REQUEST: no more commenting in this issue as the comment thread is becoming unreadable.

Please post stuff in the hackmd https://hackmd.io/CwUwzAbAxgDAJjAtNOZHAGYYIyIBxgBGAhusHtlIQOxRwCcEcQA=

pwalsh commented 7 years ago

DUPLICATE. Info went elsewhere e.g. #420

frictionlessdata / datapackage

Fixing display of v1 specs #385

Concepts

tabular data

data value

null value

constraints

Table Schema

Optional properties

primaryKey

foreignKeys

Table Schema

Formatting and readability

Specs language

Data Resource

Tabular Data Resource

Data Package

Tabular Data Package