Options for a manifest - Githubissues

dcmi / dctap

DC Tabular Application Profile

https://dcmi.github.io/dctap/

32 stars 10 forks source link

Options for a manifest #16

Open kcoyle opened 3 years ago

kcoyle commented 3 years ago

I edit this to keep it up to date with the ideas in the thread.

Some possible data for the manifest:

RE: entire table
- Administrative data (who created the profile, when, etc.)
- Link manifest to TAP
- Linking modular tables that have compatible/same columns (DEFER)
- Define prefixes for URIs
- Base model of the metadata being profiled
- Extra shape elements
- Additional value node types
- Escape character
- List item separator
RE: individual columns
- Overriding of boolean defaults from 1/0, true/false
- Alternate column titles (e.g. in different languages)
- Elements to be parsed as multiple items in a cell
Row-level data
- Defining order of values on a statement
- Overriding of boolean defaults for a property of data type Boolean

Includes Tom's list of manifest items from #41 : added to the above

philbarker commented 3 years ago

++ alternative column headings

johnhuck commented 3 years ago

And URL prefixes?

bencomp commented 3 years ago

Would it help to base such a manifest on the CSV on the Web metadata JSON file? That covers alternative column headings.

kcoyle commented 3 years ago

I do think that we should have the CSV on the web JSON as one of the options. Can anyone here mock that up?

philbarker commented 3 years ago

@kcoyle will do.

kcoyle commented 3 years ago

@philbarker Thanks, Phil. Also, take a look at the example that Nishad did on #3. He based on that CSVW but used a table format for the data. Perhaps these two could be somewhat parallel to show both methods?

kcoyle commented 3 years ago

@bencomp Thanks. We are looking at CSV on the Web for this, among other solutions, but are also wanting to develop a table-based option (since CSVW uses Json). We figure that we will present multiple options as examples of how one might encode a manifest. Note that the only aspect of a manifest that seems to be absolutely necessary for TAP is the definition of prefixes to used namespaces. This means that we'll probably have a range of examples beyond that single requirement, but may not specify a single TAP manifest.

bencomp commented 3 years ago

@kcoyle Thanks, I just arrived at this repository and had missed the extensive mentions of CSVW in #3. If I understand correctly, an [application] profile is a way of expressing what makes a resource description valid. The manifest aims to minimally prescribe the mappings for translating CURIEs to IRIs and (optionally) describe the profile.

If the resource description were expressed in RDF (whatever the serialisation), I would look at SHACL (or ShEx) to define the valid shapes. The SHACL would serve as the profile. To describe the profile, i.e. collection of SHACL shapes, I would again use RDF and it could go in the same file as the SHACL or it could be a separate resource. I think all RDF serialisations have ways of mapping CURIEs to IRIs, so that question would not be an issue.

If the profile were expressed in CSV and I wanted to interpret the values as RDF to validate RDF resource descriptions, I would want to convert the CSV to SHACL or ShEx. Several conversions exist, including CSV on the Web metadata (CSVM) and RML. Both CSVM and RML are expressed in RDF, so they could include the manifest. I think in a CSVM file you could specify a column's datatype to be { "@id": "http://www.w3.org/1999/xhtml/datatypes/CURIE" } to indicate its values are CURIEs, but I don't expect existing processors to understand what to do with that. TAP processors could be made aware and instructed to convert CURIEs to IRIs using, e.g., the mappings in the @context in the CSVM.

All I really wanted to say is: are you sure you're not reinventing the wheel? CSVM ticks all the boxes for the list at the top:

And more:

detailed instructions for parsing CSVs – though I understand from the discussion in #26 that there may be a specific way to read CSV files
ways of connecting the manifest to the CSV (#17, mentioned in https://github.com/dcmi/dctap/issues/3#issuecomment-830608470)
expressing that values in cells are ordered (#14)

If you definitely want to express the manifest as CSV to use as rules for validating the CSV profile, you're doing something that I haven't seen before. That doesn't mean it shouldn't be done, of course. You could reverse engineer the CSVM vocabulary to fit in CSV and provide a default mapping to RDF (I'd suggest to make it a CSVM file), along with instructions to derive your own CSVM file. A disadvantage of using CSV for elaborate schemas is that you could end up with

a table with lots of well-defined columns and many empty cells,
or the manifest split into several tables that need to reference one another,
or one table with no empty cells, but very generic columns that start to look like subject,predicate,object so that it can hold any data.

I don't want to suggest that every issue has been solved already, but I do see some overlap with existing standards and initiatives and hope that these are considered when looking for solutions for TAPs.

kcoyle commented 3 years ago

@bencomp Thanks for your detailed comment. It inspired me to spend the weekend doing a close reading of the CSVW documents. There is, as you mention, the meta problem: CSVW is for data in tabular format, and TAP is a tabular format for a profile describing metadata choices for metadata in any format. It's the profile that is tabular, so CSVW would only apply to the profile itself. I believe that this means that there will be some features of CSVW that we might use, but others may not be appropriate. For example, the CSVW ordered values in cells will only apply to tabular data; the request in #14 was to designate the order of property/value pairs in the RDF metadata the TAP defines. Order of values in cells in a TAP is probably an edge case (I can't think of an example where this would be needed).

We should definitely look at CSVW to see what it can offer for some functions:

admin data
connecting a manifest to a TAP
allowing multiple TAP modules to co-exist and be combined
defining prefixes to URIs
giving alternate titles for columns

(Note, I haven't found a way to provide alternate text for booleans in CSVW - would appreciate a tip for that.)

There may be other features as well. However, it looks to me that CSVW is quite a bit more complex than we want to embrace for the small number of needs we have. For example, it isn't clear to me if a CSVW description of the TAP vocabulary would be useful. It would define basic validation for created TAP columns but we've been leaving things pretty loose so I'm not sure how much validation of that type is needed. Each created TAP would need a different CSVW file to solve our needs, and that is more complicated than we've embraced so far with TAP. Of course, anyone who has the skills and wants to create a CSVW annotation for their TAP is welcome to do so.

Ideally, the manifest would be something very simple that people could encode in a spreadsheet, and that would be transformed to a more usable format by the program that ingests the CSV.

philbarker commented 3 years ago

@kcoyle wrote:

(Note, I haven't found a way to provide alternate text for booleans in CSVW - would appreciate a tip for that.)

It's here https://www.w3.org/TR/tabular-data-primer/#boolean-format

kcoyle commented 3 years ago

This is Nishad's suggestion, from #3

I liked the tabular way of expressing prefixes proposed by Karen, not so sure if we need to use a prefixed header for that. Probably the ontology is the best place to include any such mapping.

CSVW proposes a JSON metadata file relative to the path of the CSV, something like path/file.csv-metadata.json

Other than declaring prefixes, as per the DCAP Guidelines [1] 5.1 encourages declaring a set of metadata for the application profiles.

Not as a proposal, but I am curious if both of these can be implemented in a tabular format, as we are primarily focusing our interests on a tabular representation of application profiles.

The following is not a proposal but exploring the possibilities for discussing such a metadata file in tabular format. Headers are fictional, used as an example.

A metadata file with prefixes for the DCTAP path/foo-dctap.csv can be path/foo-dctap.csv-metadata.csv. Which can be :

MetadataID	MetadataType	Value	Notes
dct	Namespace	http://purl.org/dc/terms/	DublinCore Terms
schema	Namespace	http://schema.org/	Schema.org
dcat	Namespace	http://www.w3.org/ns/dcat#	DCAT Vocabulary
dct:title	LITERAL	Foo Application Profile
dct:creator	IRI	https://orcid.org/0000-0000-1111-0000
dct:creator	IRI	https://orcid.org/0000-0000-1111-1111
dct:contributor	IRI	https://orcid.org/0000-0000-1111-2222
dct:license	IRI	https://opendatacommons.org/licenses/odbl/1.0/
dct:description	LITERAL	A human readable description
dcat:downloadURL	IRI	https://zenodo.org/record/xxxx/files/xxx/data-v1.0.0.zip?download=1
dcat:distribution	IRI	https://doi.org/10.5281/zenodo.xxxx

[1] https://www.dublincore.org/specifications/dublin-core/application-profile-guidelines/

kcoyle commented 3 years ago

I also tried to create a table and did a myriad of different versions, none of which seemed to work.

There seem to be 5 different kinds of data statements we need to make:

Link to the TAP file
URIs and their prefixes
admin data about the table
alternate column header displays
alternate values for Boolean columns

It isn't easy getting these all into one table. Nishad covered 1 and 2. My table below tries to cover them all. Note that in my table

the URI is the ID and the prefix is a value on the row
the link to the TAP file is the ID for the table admin data
the ID to the column header is the TAP property
I don't think the way Boolean options are shown here is very good

ID	property	datatype
http://purl.org/dc/terms/	dct
http://schema.org/	sdo
http://www.w3.org/2001/XMLSchema#	xsd
:MyTap123	dc:title	xsd:string
:MyTap123	dc:publisher	xsd:string
:MyTap123	dc:modified	xsd:date
tap:shapeID	en	entity
tap:shapeID	it	entità

kcoyle commented 3 years ago

Note: if we like having the URI in the ID column and the prefix in a later column in the row, we might want to also reverse these in our "prefixes only" solution, and also use the same column headers (whatever they turn out to be).

philbarker commented 3 years ago

Here's an example of how we could use the CSV on the Web JSON-LD format to fulfill this requirement. It's quite long so I'll put my comments first:

I am happy that it does the things @kcoyle asked for in her comment above
It describes a set of csvs which are listed as tables at the end
Admin data is in the form of dc terms for a set as a whole (though I think you can also provide similar data for each table individually)
There is a naming convention to help discover the JSON-LD metadata for a known csv, though the reverse--i.e. discovering the csv from a known JSON-LD doc--is more robust, so it kind of makes sense to make the JSON-LD metadata as the primary resource that is distributed.
I used a bit of shacl to describe the prefixes and namespaces, as suggested by Karen a while back. An alternative would be to use the JSON-LD @context, but that seems like a slightly odd use of the @context.
the tableSchema lets you redefine column header names as titles for humans in a choice of language
the tableSchema lets you redefine boolean values, though I am not confident that multilingual options can be provided.
you can do other things with tableSchema, like put constraints on the value type. I'll note that much of this addtional information in the tableSchema would be the same whatever the TAP, i.e. would could define a default that people could tweak to the own needs.

So I think the only [potential] shortcoming relates to multilingual alternatives to the boolean defaults.

{
  "@context": {
    "@import": "http://www.w3.org/ns/csvw",
    "sh": "http://www.w3.org/ns/shacl#"
  },
  "dc:title": "Credential Engine Registry Application Profile",
  "dc:description": "Describes the minimum data policy for publishing to the Credential Engine Registry",
  "dc:creator": "https://credentialengineregistry.org/resources/ce-9bd8c615-9f3c-40e6-9c20-6d9f811844e6",
  "sh:declare": [
    {
      "sh:prefix": "rdf",
      "sh:namespace": "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    },
    {
      "sh:prefix": "rdfs",
      "sh:namespace": "http://www.w3.org/2000/01/rdf-schema#"
    },
    {
      "sh:prefix": "xsd",
      "sh:namespace": "http://www.w3.org/2001/XMLSchema#"
    },
    {
      "sh:prefix": "ceterms",
      "sh:namespace": "https://purl.org/ctdl/terms/"
    },
    {
      "sh:prefix": "agentSector",
      "sh:namespace": "https://purl.org/ctdl/vocabs/agentSector/"
    }
  ],
  "tableSchema": {
    "columns": [
      {
        "name": "shapeID",
        "titles": {
          "en": "Shape ID",
          "es": "ID de Forma"
        },
        "datatype": "anyURI"
      },
      {
        "name": "propertyID",
        "titles": {
          "en:": "Property ID",
          "es": "ID de properdad"
        },
        "datatype": "anyURI"
      },
      {
        "name": "propertyLabel",
        "titles": {
          "en": "Property Label",
          "es": "Etiqueta de properdad"
        },
        "datatype": "string"
      },
      {
        "name": "madatory",
        "titles": {
          "en": "Mandatory",
          "es": "Obligatoria"
        },
        "datatype": {
          "base": "boolean",
          "format": "Yes|No"
        }
      },
      {
        "name": "repeatable",
        "titles": {
          "en": "Repeatable",
          "es": "Repetible"
        },
        "datatype": {
          "base": "boolean",
          "format": "Yes|No"
        }
      },
      {
        "name": "valueNodeType",
        "titles": {
          "en": "Value node type",
          "es": "Tipo de nodo"
        },
        "datatype": {
          "base": "string",
          "format": "IRI|Literal|BNODE"
        }
      },
      {
        "name": "valueDataType",
        "titles": {
          "en": "Value data type",
          "es": "Tipo de datos"
        },
        "datatype": "anyURI"
      },
      {
        "name": "valueConstraint",
        "titles": {
          "en": "Value constraint",
          "es": "Restricción para valores"
        },
        "datatype": "string"
      },
      {
        "name": "valueConstraintType",
        "titles": {
          "en": "Value constraint type",
          "es": "Tipo de restricción"
        },
        "datatype": "string"
      },
      {
        "name": "valueShape",
        "titles": {
          "en": "Value shape",
          "es": "Forma para valores"
        },
        "datatype": "string"
      },
      {
        "name": "note",
        "titles": {
          "en": "Notes",
          "es": "Anotaciones"
        },
        "datatype": "string"
      }
    ]
  },
  "tables": [
    {
      "url": "http://example.org/CE_CredentialOrg_required.csv",
      "dc:title": "Required properties of a Credential Organization"
    },
    {
      "url": "http://example.org/CE_CredentialOrg_recommended.csv",
      "dc:title": "Recommended properties of a Credential Organization"
    },
    {
      "url": "http://example.org/CE_Credential_required.csv",
      "dc:title": "Required properties of a Credential"
    },
    {
      "url": "http://example.org/CE_Credential_recommended.csv",
      "dc:title": "Recommended properties of a Credential"
    }
  ]
}

nishad commented 3 years ago

Here is an almost equivalent YAML representation of @philbarker's JSON-LD manifest.

# DCTAP Manifest

prefixes:
  dc: 'http://purl.org/dc/elements/1.1/'
  rdf: 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
  rdfs: 'http://www.w3.org/2000/01/rdf-schema#'
  xsd: 'http://www.w3.org/2001/XMLSchema#'
  ceterms: 'https://purl.org/ctdl/terms/'
  agentSector: 'https://purl.org/ctdl/vocabs/agentSector/'

metadata:
  'dc:title': Credential Engine Registry Application Profile
  'dc:description': Describes the minimum data policy for publishing to the Credential Engine Registry
  'dc:creator': https://credentialengineregistry.org/resources/ce-9bd8c615-9f3c-40e6-9c20-6d9f811844e6

tableSchema:
  columns:
    - name: shapeID
      title:
        en: Shape ID
        es: ID de Forma
      datatype: anyURI
    - name: propertyID
      title:
        en: Property ID
        es: ID de properdad
      datatype: anyURI
    - name: propertyLabel
      title:
        en: Property Label
        es: Etiqueta de properdad
      datatype: string
    - name: madatory
      title:
        en: Mandatory
        es: Obligatoria
      datatype:
        base: boolean
        format: Yes|No
    - name: repeatable
      title:
        en: Repeatable
        es: Repetible
      datatype:
        base: boolean
        format: Yes|No
    - name: valueNodeType
      title:
        en: Value node type
        es: Tipo de nodo
      datatype:
        base: string
        format: IRI|Literal|BNODE
    - name: valueDataType
      title:
        en: Value data type
        es: Tipo de datos
      datatype: anyURI
    - name: valueConstraint
      title:
        en: Value constraint
        es: Restricción para valores
      datatype: string
    - name: valueConstraintType
      title:
        en: Value constraint type
        es: Tipo de restricción
      datatype: string
    - name: valueShape
      title:
        en: Value shape
        es: Forma para valores
      datatype: string
    - name: note
      title:
        en: Notes
        es: Anotaciones
      datatype: string

tables:
  - url: 'http://example.org/CE_CredentialOrg_required.csv'
    'dc:title': Required properties of a Credential Organization
  - url: 'http://example.org/CE_CredentialOrg_recommended.csv'
    'dc:title': Recommended properties of a Credential Organization
  - url: 'http://example.org/CE_Credential_required.csv'
    'dc:title': Required properties of a Credential
  - url: 'http://example.org/CE_Credential_recommended.csv'
    'dc:title': Recommended properties of a Credential

---

philbarker commented 3 years ago

I have been using a java CSVW-based validator to check how the CSVW JSON metadata above works in practice (and indeed whether it is valid) and whether we might use it to check our test cases. There's amended code below, but first comments.

that validator is a pain to use. Often it to tells you there is an error but not where it is. Added to that is the uncertainty of whether it is the metadata that is at fault or the CSV instance you are validating. Plus there are one or two features that don't quite seem to work the way you think they might. So, it might be worth trying to find other CSVW-based validators.
every deviation from "@context": "http://www.w3.org/ns/csvw" that I tried for the context lead to an error. I see that as a limitation of the validator not the format, but it meant that I had to take out the SHACL block that declared the namespaces in order to get anything that would work with this tool.
Every column listed in the metadata must be in the header file. You can declare whether a column is required, i.e. "required": false|"required": true but this just means that a row need need not have data in that column. It would be worth checking whether other validators interpret the spec in this way, but if so it would mean that even the simplest "list of properties" application profile would have to have all columns.
Similarly, it seems the columns have to be in the order that they are listed in the CSVW metadata JSON file.
The name of a column isn't used as one of the valid variants for its heading, so I had to add the unspaced column names to the English column titles.
Trying to provide several multilingual variations on "Yes|No" as alternatives to boolean "True|False" did not work.

In conclusion, as a format for providing metadata I think this is still an option (but we need simple options too), but I had hoped to be able to create a generic file that could validate any TAP and that doesn't seem possible.

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "tap4.csv",
  "dc:title": "Credential Engine Registry Application Profile",
  "dc:description": "Describes the minimum data policy for publishing to the Credential Engine Registry",
  "dc:creator": "https://credentialengineregistry.org/resources/ce-9bd8c615-9f3c-40e6-9c20-6d9f811844e6",
  "tableSchema": {
    "columns": [
      {
        "name": "shapeID",
        "titles": {
          "en": [
            "shapeID",
            "Shape ID"
          ],
          "es": "ID de Forma"
        },
        "datatype": "string",
        "required": false
      },
      {
        "name": "propertyID",
        "titles": {
          "en": [
            "propertyID",
            "Property ID"
          ],
          "es": "ID de properdad"
        },
        "datatype": "anyURI",
        "required": true
      },
      {
        "name": "propertyLabel",
        "titles": {
          "en": [
            "propertyLabel",
            "Property Label"
          ],
          "es": "Etiqueta de properdad"
        },
        "datatype": "string"
      },
      {
        "name": "madatory",
        "titles": {
          "en": [
            "mandatory",
            "Mandatory"
          ],
          "es": "Obligatoria"
        },
        "datatype": {
          "base": "boolean",
          "format": "Yes|No"
        }
      },
      {
        "name": "repeatable",
        "titles": {
          "en": [
            "repeatable",
            "Repeatable"
          ],
          "es": "Repetible"
        },
        "datatype": {
          "base": "boolean",
          "format": "Yes|No"
        }
      },
      {
        "name": "valueNodeType",
        "titles": {
          "en": [
            "valueNodeType",
            "Value node type"
          ],
          "es": "Tipo de nodo"
        },
        "datatype": {
          "base": "string",
          "format": "IRI|BNODE|Literal|IRI BNODE|IRI Literal|BNODE Literal|IRI BNODE Literal"
        }
      },
      {
        "name": "valueDataType",
        "titles": {
          "en": [
            "valueDataType",
            "Value data type"
          ],
          "es": "Tipo de datos"
        },
        "datatype": "anyURI"
      },
      {
        "name": "valueConstraint",
        "titles": {
          "en": [
            "valueConstraint",
            "Value constraint"
          ],
          "es": "Restricción para valores"
        },
        "datatype": "string"
      },
      {
        "name": "valueConstraintType",
        "titles": {
          "en": [
            "valueConstraintType",
            "Value constraint type"
          ],
          "es": "Tipo de restricción"
        },
        "datatype": "string"
      },
      {
        "name": "valueShape",
        "titles": {
          "en": [
            "valueShape",
            "Value shape"
          ],
          "es": "Forma para valores"
        },
        "datatype": "string"
      },
      {
        "name": "note",
        "titles": {
          "en": [
            "note",
            "Notes"
          ],
          "es": "Anotaciones"
        },
        "datatype": "string"
      }
    ]
  }
}

kcoyle commented 3 years ago

I had hoped to be able to create a generic file that could validate any TAP and that doesn't seem possible.

Thanks, Phil. This answers a question that I had as well - a generic TAP validator. I guess we'll have to do our own. (hint, hint)