GSS-Cogs / application-profile

Other
6 stars 2 forks source link

CSV-W life-expectancy-by-region-sex-and-time.csv-metadata.json review #21

Open canwaf opened 1 year ago

canwaf commented 1 year ago

Metadata/catalogue data

Columns

General

This CSV-W should be a distribution, the metadata should be held by the catalogue service so it shouldn't contain information about the data set.

We like saying that the CSV-W is a distribution, where we can have an interdeterminate data set with this CSV-W being a distribution of it. Unless it known in advance, we shouldn't set the parent data set's ID. This improves portability, and doesn't supplant the work from the cataloguing service(s) -- two publisher one distribution.

The use of fixed uris for @id of dcat:Dataset, dcat:Distribution, qb:DataSet should be confirmed that this user-provided only otherwise relative @ids will can be used.

(Workflow idea: you go to the catalog service, coin the dataset, and download the template with these values already filled in for you.)

The spatial and temporal range information should be duplicated across the qb:DataSet and dcat:Dataset; for finding it on the catalogue, but also qb:DataSet to be able to interpret the CSV-W independently.

For convience perspective we should have triples which use qb:componentProperty to link component specifications to the component properties. (In addition to qb:dimension/qb:attribute/qb:measure predicates already present.)

Components stuff

rossbowen commented 1 year ago

Cheers for this, will write here about the bits where I have a different view!

@id user should be able to provide an absolute identifier

Think I agree (doesn't the example have this?) - will check when you're back!

dcterms:title (m), dcterms:description (o), rdfs:comment (o) should be part of the CSV-W distrbution

I think the CSV(W) distribution has a dcterms:title and dcterms:description but they were the title relating to that specific distribution (so may include the filetype for example). So I think I'm disagreeing with you - I think the resource which holds the "main" title and "main" description is the dcat:Dataset.

I could see us offering different distributions than just CSV (JSON in particular), so that's why I think we ought to respect the dcat:Dataset being the main resource.

I didn't include rdfs:label for DCAT resources and wrote a bit about my thinking in the profile here.

dcat:mediaType should be csvw not csv

I think the CSVW metadata file (xxx.csv-metadata.json) would have a MIME type of application/csvm+json but the CSV file itself would have a MIME type of text/csv. So disagreeing.

The spec mentions giving the metadata file that MIME type here.

And an example:

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "dcat:mediaType": {
    "@id": "https://www.iana.org/assignments/media-types/text/csv"
  },
  "wdrs:describedby": {
    "@id": "http://data.gov.uk/series/greenhouse-gas-emissions/dataset/2018.csv-metadata.json",
    "dcat:mediaType": {
        "@id": "https://www.iana.org/assignments/media-types/application/csvm+json"
      }
  }
}

columns is fine but urls should be relative until crystalised at a later time

I think whether things are absolute/relative is more of a csvcubed implementation. I think in the profile I'm trying to be explicit about what the underlying RDF should be and what the URIs should look like, if we can use relative URIs to do that, then great.

It would be helpful to have an option where we can set what the start of the URI should be if we wanted to create absolute URIs in csvcubed.

the metadata should be held by the catalogue service so it shouldn't contain information about the data set.

I get we could have some separate workflow for this but don't really know why we'd start with that. I recognise it's verbose to deal with dcat:Datasets and their dcat:Distributions but users will think of their CSV of data as the same as a dataset they're producing so for now I reckon we just create the metadata for both at the same time.

Unless it known in advance, we shouldn't set the parent data set's ID

I think, for our own use of this stuff, we'll know them in advance.

The use of fixed uris for @id of dcat:Dataset, dcat:Distribution, qb:DataSet should be confirmed that this user-provided only otherwise relative @ids will can be used.

Yeah I think I agree, needs to be user provided.

convience perspective we should have triples which use qb:componentProperty

Don't think I agree, this makes the CSVW more verbose whereas I think the use case you're imagining is solved easily enough by writing SPARQL which is slightly more verbose.