frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.

https://datapackage.org

The Unlicense

481 stars 109 forks source link

Clarify `physical/logical` representation in Table Schema #864

Open roll opened 6 months ago

roll commented 6 months ago

Overview

This paragraph - https://datapackage.org/specifications/table-schema/#physical-and-logical-representation

I think physical term might be confusing (see #621) as it seems to be really meaning lexical or textual while logical sounds easy to understand in my opinion while it might still need to be brainstormed

Subissues:

https://github.com/frictionlessdata/specs/issues/621

nichtich commented 6 months ago

The distinction between physical representation and logical representation is known under many names, e.g. lexical space vs. value space in XSD. Any name may be confusing without explanation. The current form is ok but it might be better to switch to other names. In this case I'd also change "representation" because "representation of data" is confusing as well. My current best suggestion is to use lexical value and logical value instead of physical representation and logical representation.

The current spec also uses "physical contents", this should be changed as well.

roll commented 6 months ago

Thanks @nichtich, I agree

I think, currently, confusion might occur because physical implies being textual:

The physical representation of data refers to the representation of data as text on disk

Although, in general, I guess for majority of people physical regarding data storage means something different

roll commented 6 months ago

BTW lexical is already actually used in the spec - http://localhost:8080/specifications/table-schema/#number

The lexical formatting follows that of decimal in XMLSchema

This sentence I think is very easy to understand so I guess lexical is a good choice

khusmann commented 5 months ago

Hmm, I think a danger of replacing physical with lexical or textual here is that a given logical value can have many different lexical / textual representations... The textual representation of a date is an easy example. What we're wanting to refer to here is specifically the particular lexical/textual form being stored in the actual source data file.

So I actually prefer the current term physical here for that reason, provided we repeatedly emphasize that physical here implies textual as @roll noted.

Although reading through the standards again I'm also now realizing that's not quite the case because we're allowing type info to be associated with JSON source data... so it's actually not purely textual/lexical in a strict sense, which complicates things. Does this mean we throw an error or warn if a numeric field finds numeric values as strings (e.g. "0", "1", "2") in JSON source data? What if a string field schema gets numeric values? etc.

It'd simplify these cases if all "raw" data was just guaranteed to be parsed by the field schema as pure lexical/textual/string, and field props referencing physical values always used strings. If we're including / allowing type info other than string to come from the underlying source data representation, I may reconsider my position on #621, because it makes a case for props referencing physical values be allowed to be any JSON type.

In the spirit of brainstorming to get more ideas flowing:

Other possible terms for physical, lexical, textual value: raw value, source value, underlying value...

Other possible terms for logical value: typed value, parsed value, conceptual value, ... (I actually like the term conceptual value quite a bit; logical has always sounded like a boolean to me...)

nichtich commented 5 months ago

Ok, the issue needs more. The whole section on Concepts needs to be rewritten to better clarify what is meant by "tabular data". Because we also have two levels of description:

the physical table, consisting where every cell is an untyped string
the logical table with typed cell values

There are "raw" tabular data formats (TSV/CSV) and there are tabular data formats with typed values (Excel, SQL, JSON, Parquet... limited to non-nested cells...). I'd say a Table Schema only refers to the former. A SQL Table can be converted to a raw table (just export as CSV) plus a Table Schema (inferred from the SQL Table definition) but SQL Tables are not directly described by Table Schema, nor is any JSON data as wrongly exeplified in the current specification.

khusmann commented 5 months ago

There are "raw" tabular data formats (TSV/CSV) and there are tabular data formats with typed values (Excel, SQL, JSON, Parquet... limited to non-nested cells...). I'd say a Table Schema only refers to the former.

Agreed!

Perhaps it would clear some of the confusion if we renamed "Table Schema" to "Textual Table Schema" or "Delimited Table Schema" to reflect that the schema definition is specifically designed for textual data.

It would also pave the way for future frictionless table schema standards for other types of physical data, e.g. "JSON Table schema", "Excel Table Schema", "SQL Table Schema", which would be designed around the particularities of the types found in those formats.

In that case, we'd have:

The physical values of Textual Table Schema are all strings The physical values of JSON Table Schemas are all JSON data types The physical values of Excel Table Schemas are all Excel data types etc.

As you say, it's much easier to think about conversions between formats, rather than type coercions if we try to use a textual table schema to parse an excel file, for example. The latter has a lot of potential complexity / ambiguity.

roll commented 5 months ago

Although reading through the standards again I'm also now realizing that's not quite the case because we're allowing type info to be associated with JSON source data... so it's actually not purely textual/lexical in a strict sense, which complicates things. Does this mean we throw an error or warn if a numeric field finds numeric values as strings (e.g. "0", "1", "2") in JSON source data? What if a string field schema gets numeric values? etc.

In frictionless-py:

strings in numeric fields will be parsed (no error)
numbers in string fields won't be coerced (error)

roll commented 5 months ago

The conversation is happening here so I'm adding @pwalsh's comment:

@nichtich @roll the original terminology seems pretty standard, eg

https://aws.amazon.com/compare/the-difference-between-logical-and-physical-data-model/

https://www.gooddata.com/blog/physical-vs-logical-data-model/

Whereas I have never come across using "lexical" to represent what is called "physical" in the current terminology.

I read https://github.com/frictionlessdata/specs/issues/864 but honestly physical vs logical seems the most common terminology for describing this and I am not sure I see a good reason to change it.

roll commented 5 months ago

First of all, probably I did not understand it correctly but I never thought about physical and logical in terms described here - https://www.gooddata.com/blog/physical-vs-logical-data-model/. I was thinking that in the case of Table Schema we're talking about basically a data source (like 1010101 on the disc or so-called text in csv) and data target (native programming types like in python and SQL).

So my understanding is that every tabular data resource has a physical data representation (in my understanding of this term). On current computers, it's always just a binary that can be decoded to text in the CSV case or just read "somehow" in case of a non-textual format e.g Parquet. For every format there is a corresponding reader that converts that physical representation to a logic representation (e.g. a pandas dataframe from a csv or parquet file).

I think here it's important to note that the Table Schema implementors never deal with any physical data representation (again based on my understanding of this term). Table Schema doesn't declare rules for csv parsers or parquet readers. In my opinion, Table Schema actually declared only post-processing rules for data that is already in its logical form (read by native readers).

Physical Data -> [ native reader ] -> Source Logical Data -> [ table schema processor ] -> Target Logical Data

For example, for this JSON cell 2000-01-01:

physical data -- binary
source logical data -- string
target logical data -- date (the point where Table Schema adds its value)

Another note, that from a implementor perspective, as said we only have access to Source Logical Data. It means that the only differentiable parameter for a data value is an source logical data type. For example, a Table Schema implementation can parse 2000-01-01 string for a date field because it knows an input logical type and a desired logical type. There is no access to underlying physical representation to have more information about this value. We only see that the input is string. For example, frictionless-py differentiates all the input values into two groups:

string -> process
others -> don't process (actually that's what #621 is about)

So for me it feels that Table Schema's level of abstraction is to provide rules for processing "not typed" string values (lexical representation) and that's basically the only thing this spec really can define while low-level reading can't be really covered. So my point is that physical is not a wrong term or whatever but that we really need to describe parsing lexical values e.g. for dates or missing values rather talking about physical.

cc @peterdesmet

akariv commented 5 months ago

I tend to agree that we actually have 3 states of data in the spec, as you write.

A few notes, though: 1 - you write "Table Schema doesn't declare rules for csv parsers". However, the data package spec does have a csv dialect section and a character encoding setting, which are precisely rules for csv parsers that interact with the physical layer. 2 - 'source logical data' and 'target logical data' are not great names imo as they impose some sort of order between the layers (source and target) which does not apply in many cases (e.g. when writing a data package).

So, I would suggest to follow your lead, and use

"physical layer" for the lower level binary data,
"native format layer" for the data that the native, file format specific drivers work with,
and "logical layer" for the table-schema typed data

On Thu, Jan 25, 2024 at 4:24 PM roll @.***> wrote:

First of all, probably I did not understand it correctly but I never thought about physical and logical in terms described here - https://www.gooddata.com/blog/physical-vs-logical-data-model/. I was thinking that in the case of Table Schema we're talking about basically a data source (like 1010101 on the disc or so-called text in csv) and data target (native programming types like in python and SQL).

So my understanding is that every tabular data resource has a physical data representation (in my understanding of this term). On current computers, it's always just a binary that can be decoded to text in the CSV case or just read "somehow" in case of a non-textual format e.g Parquet. For every format there is a corresponding reader that converts that physical representation to a logic representation (e.g. a pandas dataframe from a csv or parquet file).

I think here it's important to note that the Table Schema implementors never deal with any physical data representation (again based on my understanding of this term). Table Schema doesn't declare rules for csv parsers or parquet readers. In my opinion, Table Schema actually declared only post-processing rules for data that is already in its logical form (read by native readers).

Physical Data -> [ native reader ] -> Source Logical Data -> [ table schema processor ] -> Target Logical Data

For example, for this JSON cell 2000-01-01:

physical data -- binary

source logical data -- string

target logical data -- date (the point where Table Schema adds its value)

Another note, that from a implementor perspective, as said we only have access to Source Logical Data. It means that the only differentiable parameter for a data value is an input logical data type. For example, a Table Schema implementation can parse 2000-01-01 string for a date field because it knows an input logical type and a desired logical type. There is no access to underlying physical representation to have more information about this value. We only see that the input is string. For example, frictionless-py differentiates all the input values into two groups:

string -> process

others -> don't process

So for me it feels that Table Schema's level of abstraction is to provide rules for processing "not typed" string values (lexical representation) and that's basically the only thing this spec really can define while low-level reading can't be really covered

cc @peterdesmet https://github.com/peterdesmet

— Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/specs/issues/864#issuecomment-1910316936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5NUKZH4VDEZHTGRV3TYQJTJ3AVCNFSM6AAAAABBLUSTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGMYTMOJTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

roll commented 4 months ago

Hi @nichtich,

Are you interested in working on the updated version of https://github.com/frictionlessdata/datapackage/pull/17 that incorporates comments from this issue?

After working closely with the specs last month and refreshing in my memory implementation details from frictionless-py I came to the conclusion that we actually don't have a very complex problem here.

For example, for a JSON data file like this:

[
  ["id", "date"],
  [1, "2012-01-01"]
]

We have:

[physical] UTF-8 encoded bytes representing the text above. We send these bytes to a JSON parser
[native: logical + lexical] As an output, we get a "data stream" of cells with native to the format types, BUT for Table Schema, some of them are still represented lexically and require additional processing. So for the example above, 1 is already a logical value, and 2012-01-01 is still a lexical value
[logical] After full Table Schema processing, we get a "data stream" of cells in fully logical form in Table Schema terms i.e. 1 and Date('2012-01-01')

I think this tiering is applicable to basically any input data source from csv to parquet or sql.

I guess we need to rename the section to something like Data Processing and mention this workflow. Although, we have 3 tiers I would personally focus the explanation on lexically represented cells because basically all Table Schema data type descriptions is about of how to parse lexically represented data e.g. date/times, objects, arrays, numbers (basically all the types).

nichtich commented 4 months ago

I guess we need to rename the section to something like Data Processing and mention this workflow.

Yes. I'd like to provide an update but I don't know when so it's also ok for me if you come up with an update. To quickly rephrase your words:

We have three levels of data processing:

The native format of tabular data, e.g. JSON, CSV with some specific CSV Dialect, Open Document Spreadsheet, SQL...
An abstract table of cells, each given as abstract value with data type from the underlying data format (e.g. plain strings for CSV, SQL types for SQL, JSON scalar types for JSON...)
A logical table of cells, each having a typed value.

Table Schema specification defines how to map from level 2 to level 3.

roll commented 4 months ago

Table Schema specification defines how to map from level 2 to level 3.

I think it's a good wording!

Yes. I'd like to provide an update but I don't know when so it's also ok for me if you come up with an update.

Of course, no hurry at all. Let's just self-assign ourselfes to this issue if one of us decide start working (currently, I also have other issue to deal with first)

akariv commented 4 months ago

I agree but I have an observation here -

In @roll's example, it's mentioned that '1' is already a logical value.

I would claim that it's still a native value - a JSON number with the value of 1. It might represent a table schema value of type integer, number, year, or even boolean (with trueValues=[1]). It might also be converted to None, e.g. in case missingValues=[1].

Therefore I would say that the distinction between native and logical is correct and that all values start out as native values and get processed, casted and validated into logical values - even if they come from a more developed file format such as JSON. Then, in each case we require a value to be present in the descriptor (e.g. in a max constraints, booleans trueValues of missingValues) we need to specify whether a native value or a logical value is expected there.

On Wed, Feb 21, 2024 at 1:46 PM roll @.***> wrote:

Table Schema specification defines how to map from level 2 to level 3.

I think it's a good wording!

Yes. I'd like to provide an update but I don't know when so it's also ok for me if you come up with an update.

Of course, no hurry at all. Let's just self-assign ourselfes to this issue if one of us decide start working (currently, I also have other issue to deal with first)

— Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/specs/issues/864#issuecomment-1956477045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5POT5AGUWG4WD4M5BDYUXNCFAVCNFSM6AAAAABBLUSTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJWGQ3TOMBUGU . You are receiving this because you commented.Message ID: @.***>

roll commented 4 months ago

It might also be converted to None, e.g. in case missingValues=[1].

Currently, it cannot because missingValues items in v1 have to be strings. So basically, I think we found the root cause and the real decision to make (related to #621 as well) what is our data model:

(1) physical/logical+lexical/logical -- Table Schema processes only strings
(2) physical/native/logical -- Table Schema processes all the native values

I guess (2) might be cleaner and easier to explain. In this case it will be something like this e.g. for datetime:

datetime: if on the native-data level a value is represented lexically than it MUST be in a form defined by XML Schema containing required date and time parts, followed by optional milliseconds and timezone parts

nichtich commented 4 months ago

Therefore I would say that the distinction between native and logical is correct and that all values start out as native values and get processed, casted and validated into logical values

Good to introduce "native" as description of values before the logical level. A native boolean false from JSON or SQL may end up as logical boolean value false, logical string "false", or logical missing value.

(2) physical/native/logical -- Table Schema processes all the native values.

All native values either have type that directly maps to a logical type (e.g. JSON Boolean and SQL BOOL both map to logical boolean value) or they are treated as strings.

datetime: if on the native-data level a value is represented lexically than it MUST be in a form defined by XML Schema containing required date and time parts, followed by optional milliseconds and timezone parts

Yes except replace "is represented lexically" with "is represented as string". If the native-data level already has a type compatible with datetype, no lexcial representation is involved at all.

I think we are all on the same track but use slightly different terminology for the same idea.

pwalsh commented 4 months ago

I like the direction :)

roll commented 4 months ago

If we lean towards 3 distinct layers (physical/native/logical) as an implementor I'm curious what will be the behaviour for this resource for example:

data:
  - [id]
  - [1]
  - [2]
  - [3]
schema:
  fields:
    name: id
    type: string

Will it be considered valid data, and values will be coerced to strings? Currently, frictionless-py will raise 3 validation errors as the number type is not compatible to the string type.

Also, I think it's important to check what dataframe parsers (readr/pandas/polars/etc) do in this case so we don't end up with non-implementable solution

khusmann commented 4 months ago

I like where this is going too, it's really clarifying the decision at hand:

a) do we parse fields with a 2-layer physical / logical distinction or

b) do we parse fields with a 3-layer physical / native / logical distinction

The spec is currently written / defined as (a) a 2-layer scheme. This is why missingValues is string[], and why trueValues/falseValues are string[]: everything should hit the TableSchema as a string physical value type, no matter its native origin type. I think this fits with TableSchema being billed as a description of textual data. (I realize the way the implementation currently handles JSON data is inconsistent with this; I'm referring to the broad intent of the spec in my reading here).

Supporting JSON in the data field throws a wrench into the works for the 2-layer approach, because it has its own native type definitions. Now, this could be resolved by just reading each JSON element as string and ignoring the native type info. But retaining JSON type info requires the 3-layer distinction.

An advantage of the 3-layer distinction is that in addition to JSON, it allows us to consider other intermediate typed sources (like SQL, ODF, etc), rather than being forced to convert all of the native types to string before reaching the TableSchema.

The disadvantage of the 3-layer distinction is that I think it opens a can of worms of complexity. With 2 layers, we only have to define our Fields parsers as mappings from string -> FieldType. But with 3 layers the TableSchema would need the capability to define mappings / validation rules from all possible JSONType -> FieldType, SQLType -> FieldType, ODFType -> FieldType etc., depending on the native type being used.

Furthermore, with 3 layers we also need a way to losslessly represent native values in the TableSchema. For JSON types, this is easy, because the spec is JSON. But if we're envisioning support for other native types, we'd need ways to represent their native values in JSON. As @akariv said:

in each case we require a value to be present in the descriptor (e.g. in a max constraints, booleans trueValues of missingValues) we need to specify whether a native value or a logical value is expected there.

This is also apparent in the issue @roll describes re: numeric data. A JSON number is not an exact type (like SQL's DECIMAL), but a string representation of a number is an exact decimal type. With 3-layer parsing, a numeric field parser has to think about validation for both exact and non-exact inputs, but with 2-layer parsing we can always handle the source as an exact decimal type because everything is being parsed from string.

In addition to the example @roll provided above, 3-layer parsing also creates ambiguity in situations like:

data:
  - ["id"]
  - ["1"]
  - ["2"]
  - ["3"]
schema:
  fields:
    name: id
    type: integer

data:
  - ["id"]
  - [0]
  - [true]
schema:
  fields:
    name: id
    type: integer

data:
  - ["id"]
  - ["1"]
  - [0]
  - [true]
  - ["true"]
schema:
  fields:
    name: id
    type: boolean

data:
  - ["id"]
  - ["0"]
  - ["1"]
  - [0]
  - [1]
schema:
  fields:
    name: id
    type: boolean
    trueValues: ["1"]
    falseValues: [0]

If we have 2-layer parsing, that is, where all JSON native cell values are received by the TableSchema parser as string types (ignoring the native JSON type info), the expected behavior is very straight-forward:

1) No validation errors, because the integer field type parses all the strings successfully.

2) One validation error, on boolean true, because it is passed as string "true" to the TableSchema parser (not as a native boolean!).

3) No validation errors, because each string cell value is in the default trueValues or falseValues arrays.

4) No validation errors, because the TableSchema only receives string "0" and "1" values. (edit: well actually technically a schema parse error if falseValues must be string[])

(I understand that our current implementation may slightly differ right now because it currently conflates the two- and three-layer parsing approaches)

By contrast, 3-layer parsing creates a lot of questions:

1) Should we parse the native string into a numeric type here, or error because it is not a native numeric type going into a numeric field type?

2) Can native boolean be silently coerced to a numeric field type?

3) Can native number be silently coerced to a boolean field type, even though default trueValues and falseValues are all strings?

4) Do native types have to match the type specified in trueValues and falseValues? If so, then we have errors on cells string "0" and numeric 1.

3-layer parsing also creates problems for schema-sharing:

data.csv:

myField
true
false
true

csvResource:

{
  "name": "csvResource",
  "format": "csv",
  "path": "data.csv",
  "schema": "schema.json"
}

jsonResource:

{
  "name": "jsonResource",
  "format": "json",
  "data": [["a"], [true], [false], [true]],
  "schema": "schema.json"
},

schema.json:

{
  "fields": [
    {
      "name": "myField",
      "type": "boolean",
      "trueValues": ["true"],
      "falseValues": ["false"]
    }
  ]
}

With 2-layer parsing, this isn't a problem; the JSON and CSV files are interpreted exactly the same (as textual values). With 3-layer parsing, however, this may fail because the native values true and false are not listed in trueValues and falseValues. This is why I like 2-layer parsing, because when everything is parsed by TableSchema as string (ignoring the native type info), the expected validation behavior is 1000x clearer.

...And this is just for JSON… we'd have to go through the same exercise for 3-layer parsing of SQL types, ODF types, etc and for those it'd be further complicated by a need to losslessly express their native values as JSON types. Much easier to to stick to the original 2-layer scope of frictionless being for textual tabular data, where by definition physical values are always string, and TableSchema fields define mappings from physical string -> FieldType (not NativeType -> FieldType)

I like the idea of 3-layer parsing, but I think to support native types properly in the spec, TableSchema would have to be rebuilt from the ground up with support for lossless representations of native values, or we'd need to create additional versions of TableSchema that to map the subtleties of different native values of a specific format to frictionless fields e.g. SQLSchema, ODFSchema ... So I'm against it for V2. Instead, I think we'd be better off making the implementation's JSON behavior consistent with the original scope of 2-layer textual parsing (read each JSON array cell as a string, and ignore the native type info).

roll commented 4 months ago

The spec is currently written / defined as (a) a 2-layer scheme. This is why missingValues is string[], and why trueValues/falseValues are string[]: everything should hit the TableSchema as a string physical value type, no matter its native origin type. I think this fits with TableSchema being billed as a description of textual data. (I realize the way the implementation currently handles JSON data is inconsistent with this; I'm referring to the broad intent of the spec in my reading here).

Note that it's not only about JSON; frictionless-py supports a dozen formats and in-memory data. It never worked like this at least in Python and JavaScript, the parsers get an input cell and forward it as it is if it's not a string and process if it's a string. So, currently, these implementations based on the (1) model from above

roll commented 4 months ago

I think it will be simple and correct to say that regarding the data model, Table Schema is no more than an extension of a native data format (all of them). This concept is quite simple, for example, we have JSON and there is SUPERJSON that adds support for date/time, regexp, etc. It's achieved via an additional layer of serialization and deserialization for lexical values. If we think about Table Schema that way than it's still the (1) data model and missing/false/true values need to stay strings only. But this model doesn't imply that all the input data need to be strings or it's only for textual data sources, not at all; it just means that Table Schema comes into play only when additional serialization/deserialization is needed.

PS. Thought a little bit more about it and I would say that on the Table Schema level, here is basically only two relevant concepts (while Data Resource and Table Dialect deal with physical representation):

data-format data model
Table Schema data model

akariv commented 4 months ago

That might be confusing though.

E.g. a JSON file with -1 denoting an empty value, we would say missingValues="-1". That's reasonable.

But what if 'n/a' is the empty value? Would we say missingValues="n/a" or "\"n/a\"" (as is the physical representation of the value)?

What if there is no natural string representation of the value (if the file format is not text based)?

On Thu, Feb 22, 2024, 10:49 roll @.***> wrote:

I think it will be simple and correct to say that regarding the data model, Table Schema is no more than an extension of a native data format (all of them). This concept is quite simple, for example, we have JSON and there is SUPERJSON https://github.com/blitz-js/superjson that adds support for date/time, regexp, etc. It's achieved via an additional layer of serialization and deserialization for lexical values. If we think about Table Schema that way than it's still the (1) data model and missing/false/true values need to stay strings only

— Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/specs/issues/864#issuecomment-1958971407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5MOI4IZ2AUDBMQCE7TYU4BB5AVCNFSM6AAAAABBLUSTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJYHE3TCNBQG4 . You are receiving this because you were mentioned.Message ID: @.***>

roll commented 4 months ago

But what if 'n/a' is the empty value? Would we say missingValues="n/a" or "\"n/a\"" (as is the physical representation of the value)?

I'm getting to thinking that we actually need to isolate Table Schema from any physical data representation and let it operate only on the logical level. On the logical level it's n/a no matter how it's stored

nichtich commented 4 months ago

It's 3 layers but we only have to think about two levels:

cells with native-typed values (in case of CSV all values have native type string) aka data-format data model
cells with logical values of types in the type system of Table Schema specification aka Table Schema data model

Furthermore, with 3 layers we also need a way to losslessly represent native values in the Table Schema.

We should aim to be able to represent common data types in the type system of Table Schema but we don't have to ensure lossless mappings of native type systems. We define a set of data types such as string, number types, boolean, n/a... and either types of native format X directly map to one of these Table Schema types or implementations must downgrade their values, e.g. by serialization to string type values.

P.S: Maybe this table of common native scalar data types helps to find out what is needed (also for #867).

roll commented 4 months ago

I bootstrapped a new specification called "Terminology" - https://datapackage.org/specifications/glossary/ - I think it will be great to define everything we need there and then refer it across the specs. Lately I encountered that e.g. physical and logical data are also needed to define descriptor serialization. And e.g. Tabular Data defined in the Table Schema spec is really needed in other places as well. Aslo, we often mention implementations, data publishers/producers, consumers etc so it will be good to define it

khusmann commented 4 months ago

It's 3 layers but we only have to think about two levels:

I agree. It's always technically (at least) 3 layer, in that the source format needs to be parsed to get at the value cells. What I'm trying to get at is how we define the type signature of our field parsers.

Right now the spec defines field / schema parsers as mappings from string -> FieldType.

If we promote this to NativeType -> FieldType, then we introduce a lot of validation ambiguity in the form of type coercion rules in field definitions.

We define a set of data types such as string, number types, boolean, n/a... and either types of native format X directly map to one of these Table Schema types or implementations must downgrade their values, e.g. by serialization to string type values.

I think I agree. As a textual format, the TableSchema should be defined (as it currently is) in terms of always be parsing serialized string values, no matter the source. In the special case where the native format directly maps, we can take a shortcut and directly import the data.

This way we keep missingValues: string[] (match missing value cells on serialized value strings)

and can avoid missingValues: (string | NativeType)[], (match missing value cells on serialized strings or their NativeType, and have to think about type precedence / coercion rules)

I'm getting to thinking that we actually need to isolate Table Schema from any physical data representation and let it operate only on the logical level. On the logical level it's n/a no matter how it's stored

This is another good approach worth exploring. The challenge will be to keep it backwards compatible...

roll commented 3 months ago

Dear all,

Here is a pull request based on @akariv's data model - https://github.com/frictionlessdata/datapackage/pull/49

I think this simple 3-layered model highly improves the quality of the building blocks on which Data Package stands and simplifies field types a lot conceptually. Initially, I was more in favour of thinking about Table Schmea as a string processor (serialize/deserialize) but having native data representation makes things way easier and more consistent internally.

An interesting fact is that after the separation of the native representation sections for field types, we can realize that field types basically don't have any description on a logical level—something to improve in the future, I guess, as currently, we mostly define only serialization/deserialization rules.

Please take a look!

akariv commented 3 months ago

Great work @roll! I reviewed the PR and left a few minor comments.