SAEON / odp-server

Source code for the SAEON Open Data Platform server components.
GNU Affero General Public License v3.0
0 stars 2 forks source link

Keyword API #30

Closed marksparkza closed 1 month ago

marksparkza commented 2 months ago

A new top-level entity, keyword, to replace vocabulary and vocabulary_term. keyword is inherently hierarchical. A vocabulary is simply a keyword that is a parent of a set of keywords. One keyword in such a set may itself represent a sub-vocabulary with its own set of keywords, and so on.

Motivated by #22 and informed by the development of a package contributor tag representing an individual author or contributor that is associated with a given data package. Each author and contributor typically has one or more affiliation, selected from a managed institution vocabulary or added inline as an unlisted affiliation pending addition to the vocabulary.

Keyword identification is path-based. For example, the Institution keyword represents the institution vocabulary, the Institution/DFFE keyword represents the DFFE and includes relevant metadata (title, ROR, URL, etc), while Institution/DFFE/OC could represent the Oceans & Coasts subdivision of the DFFE.

The keyword API is simple, intuitive and recursive.

marksparkza commented 1 month ago

Using path syntax for keyword identification might actually be disadvantageous. It requires overriding routing behaviour to enable a path to be interpreted as an identifier. While this is easy to do in frameworks like FastAPI and Flask, it still creates a dependency that might be difficult to remove at a later stage. This is not a far-fetched scenario: we cannot assume that / will never be a valid character in an individual keyword - it might be part of an institutional acronym, for example. This is an important consideration because the ODP allows keywords to be created by users (though subject to curator approval), not just predefined as in, for example, the ICD-10 medical vocabulary.

A more future-proof approach would be to define a keyword hierarchy separator as a multi-character sequence; perhaps even to pad it with whitespace - because multi-level keywords should be easily readable by humans as well as machines. For example, ::, as in:

marksparkza commented 1 month ago

There is another way.

At the database level, the keyword hierarchy is managed in two ways: with a foreign key reference to itself (adjacency list pattern), and a check constraint that ensures that every parent key is a substring of all of its child keys. The uniqueness constraint on key ensures further that a parent key is always a proper substring of its child keys. The definition of a key as a concatenation of its complete hierarchy is important, because the same 'leaf' value may appear in multiple hierarchies. However, there is no intrinsic need for a parent-child separator, nor for any such separator to be standardised.

The / separator was introduced originally with the idea of representing keywords consistently and intuitively in URLs. From an API perspective, this made parent keys implicit - a parent key could be inferred from any child simply by dropping the child path element (like inferring the containing directory from a file path). But if the API explicitly requires parent key to be specified when creating a new key (and explicitly returns the parent key when the metadata for any key is requested) then there is no need for a standardised separator.

This allows us the freedom to define keywords and their sub-keywords in whatever way is most appropriate for the vocabulary. For example, we might use a simple dot-notation for SDGs, such as SDG:14, SDG:14.1.3.b, etc, while using / or > (or no separator at all) for other vocabularies.

marksparkza commented 1 month ago

Incidentally, we have settled on not defining institutions hierarchically, due to the difficulties of ascertaining and maintaining such hierarchies - for both SA and foreign institutions.