dcmi / dcap

DC Tabular Application Profile - supporting materials
29 stars 12 forks source link

USE CASE: Small controlled taxonomies provided as tablar data #12

Open nichtich opened 5 years ago

nichtich commented 5 years ago

Creator: Jakob Voß

Problem statement

We collect simple controlled vocabularies (term lists, classification schemes etc.) from users such as libraries and museums. Vocabularies can be (mono-)hierarchic so let's call them taxonomies. Users often manage their taxonomies without a dedicated vocabulary management tool. The most popular data format is a spreadsheet. For simplicity, each term/concept/class consists of an identifier and a label, together with hierarchy information. We came up with two CSV formats:

hierarchy information via level:

level,id,label
0,A,beeings
1,A.1,animals
2,A.1.1,mammals
2,A.1.2,insects
1,A.2,humans

hierarchy information via parents:

id,label,parent
A,beeings,
A.1,animals,A
A.1.1,mammals,A.1
A.1.2,insects,A.1
A.2,humans,A

The question is how to communicate constraints of these formats.

Stakeholders

GLAM institutions that want to share simple taxonomies

Requirements

kcoyle commented 5 years ago

@nichtich I'm trying to make these into AP requirements - and some seem to be requirements on the vocabularies themselves. So there's a need to translate these into what an AP can do to support the taxonomies. Much of what is here looks very much like SKOS, of course, although it is expressed in csv rather than RDF. It looks to me that:

  1. "concept" here is an entity using our current terminology
  2. a concept would have statements: level',labelandid(I'm assuming thatparent/child` would be derived from levels)
  3. it needs to be possible to validate language tags across the entire profile
  4. the ids may need to conform to a pattern that is determined by levels

I'm not sure what to say about "ids should be sorted" since that seems to be beyond the capacity of a profile itself as long as the sort element is identified. Also, it isn't clear to me what would be used for sorting: labels? levels?

In any case, does this capture your requirements?

nichtich commented 5 years ago

I moved some requirements to #21. In this use case a CSV table row corresponds to an entity with statements id, label, and either level (or parent). You wrote:

it needs to be possible to validate language tags across the entire profile

The data does not include language tags but natural language strings that should all be in the same language. The profile could specify a language by its language tag but validation is not possible automatically.

the ids may need to conform to a pattern that is determined by levels

yes.

I'm not sure what to say about "ids should be sorted"

If the data contains column level, the CSV rows must be sorted the same way as the hierarchy tree would be shown (left-to-right depth-first tree traversal). A slightly simplified constraint would be: every parent entity must be given before its child entities.