Problem statement

We collect simple controlled vocabularies (term lists, classification schemes etc.) from users such as libraries and museums. Vocabularies can be (mono-)hierarchic so let's call them taxonomies. Users often manage their taxonomies without a dedicated vocabulary management tool. The most popular data format is a spreadsheet. For simplicity, each term/concept/class consists of an identifier and a label, together with hierarchy information. We came up with two CSV formats:

hierarchy information via level:

level,id,label
0,A,beeings
1,A.1,animals
2,A.1.1,mammals
2,A.1.2,insects
1,A.2,humans

hierarchy information via parents:

id,label,parent
A,beeings,
A.1,animals,A
A.1.1,mammals,A.1
A.1.2,insects,A.1
A.2,humans,A

The question is how to communicate constraints of these formats.

Stakeholders

GLAM institutions that want to share simple taxonomies

Requirements

every concept must have a unique id
ids must match a regular expression (optional constraint)
every concept must have a label (with optional uniqueness constraint)
natural language of labels must be the same for all concepts
ids of child concepts must start with the id of their parent concept (optional constraint)
the level/parent information must form (mono)hierarchy
- every parent must be the id of another concept
- circles are not allowed
- ...
ids should be sorted
...

kcoyle commented 5 years ago

@nichtich I'm trying to make these into AP requirements - and some seem to be requirements on the vocabularies themselves. So there's a need to translate these into what an AP can do to support the taxonomies. Much of what is here looks very much like SKOS, of course, although it is expressed in csv rather than RDF. It looks to me that:

"concept" here is an entity using our current terminology
a concept would have statements: level',labelandid(I'm assuming thatparent/child` would be derived from levels)
it needs to be possible to validate language tags across the entire profile
the ids may need to conform to a pattern that is determined by levels

I'm not sure what to say about "ids should be sorted" since that seems to be beyond the capacity of a profile itself as long as the sort element is identified. Also, it isn't clear to me what would be used for sorting: labels? levels?

In any case, does this capture your requirements?

nichtich commented 5 years ago

I moved some requirements to #21. In this use case a CSV table row corresponds to an entity with statements id, label, and either level (or parent). You wrote:

it needs to be possible to validate language tags across the entire profile

The data does not include language tags but natural language strings that should all be in the same language. The profile could specify a language by its language tag but validation is not possible automatically.

the ids may need to conform to a pattern that is determined by levels

yes.

I'm not sure what to say about "ids should be sorted"

If the data contains column level, the CSV rows must be sorted the same way as the hierarchy tree would be shown (left-to-right depth-first tree traversal). A slightly simplified constraint would be: every parent entity must be given before its child entities.

dcmi / dcap

USE CASE: Small controlled taxonomies provided as tablar data #12

Problem statement

Stakeholders

Requirements