[CWM] Add vocabs API route to the ts-rest contract

wu-lee commented 1 week ago

Description

The mykomap client will need to get information about pin attribute categories. Specifically:

What categories exist. For each:
- A unique category ID
- A list of term IDs
- For each language supported by the map (although this might be only one language at a time):
- A title for the category as a whole
- A title for each term

This information was provided by the vocabs.json format in older versions of Mykomap. It might be convenient to retain that format/structure here.

Examples of this can be visible on the data server. Some specific examples:

A cut-down example from the Mykomap documentation follows. This defines one vocabulary defining standard size definitions. The SKOS vocab URI for this is https://example.com/sizes/1.1/. An abbreviation prefix is defined for use in this context, sz. Within the rest of the JSON document, a sz: prefix represents that URI, and the URI for a term can be reconstructed by replacing the prefix and the colon with that URI.

Languages are identified by a two-character (upper case) code as defined by ISO 639 Set 1. There is only one language section here, for EN, with term labels in English. But there could be as many as required.

Both the prefixes and the language codes should be unique within this file. The same set of languages should be present for all vocabs. (When they are not, this might be an error, or it might be resolved by rules to fall-back to some other language, but that's outside the scope of this proposal.)

{
  "prefixes": {
    "https://example.com/sizes/1.1/": "sz"
  },
  "vocabs": {
    "sz:": {
      "EN": {
        "title": "Sizes",
        "terms": {
          "sz:large": "Large",
          "sz:medium": "Medium",
          "sz:small": "Small"
        }
      }
    }
  }
}

Use as language files

The older version of Mykomap was also starting to use these vocabs for language translations, which are functionally very similar.

Not all translations can (or should) be hardcoded into the application, either the old or the new one:

Specific text for custom layouts may be required.
Languages not included in the base application may need to be supported.

See the ui: vocab in the second example above for an example of a vocab used to define the labels for UI text not built into the Mykomap application.

Linked data interoperability

These vocab files were designed to be interoperable with linked-data SKOS vocabularies (hence the name "vocab" rather than "taxonomy" or "category")

They don't include everything a SKOS vocab can, just a subset. But a SKOS vocab can be used to create one of these JSON vocab files, and SKOS vocab can be constructed from a vocab file when some supplemental data is added.

There are scripts for converting these formats in the map-sse repository:

Converting SKOS vocabs to TSV "phrasetab" files and back without loss of information. This is to allow import to a spreadsheet for translation (potentially automatically)
Converting one or more SKOS vocabs (which are considered the "source of truth") to the vocabs.json format.
In principle SKOS vocabs can be converted to documentation files. However the tool to do this was never fully completed.

API Proposal

The endpoint could look like this (in sketch form):

https://example.com/api/vocab/:vocabId - get a single vocab by ID, with all supported languages. Invalid vocab IDs should be an error.
https://example.com/api/vocab/:vocabId1;vocabId2 - get multiple vocabs, likewise. Invalid vocab IDs should be an error.
https://example.com/api/vocab/:vocabId1/:languageId1;:languageId2 - get a vocab by ID, but just certain languages. Missing vocabs an error as before, but missing languages might not be an error? In which case the application should be smart and substitute a "default" language. This is what the old Mykomap app does with missing languages.

A successful response should returned JSON data in the body, and this should essentially look like the vocab JSON format examples above.

(A more formal proposal expressed as a ts-rest contract should be added as a preliminary implementation step.)

Caveats:

I wonder if the colons in the QName IDs for vocabs are spurious and ought to be removed. For instance, referring to the short vocab example, maybe sz should be the key instead of sz:
I also see duplication in the terms. Again, looking at the short example, sz: is repeated in all the terms, but could be inferred by the fact they are within the sz: vocab section. If implemented, both this and the previous point would, make the vobab schema diverge from the old format. However, it should be possible to make the old (or new) parser tolerant of the differences. In which case, this might be an opportunity for incremental improvements.
As such, perhaps a schema version field might be worth including, perhaps under meta where it is otherwise ignored.
The meta attribute is specifically meant for diagnosing problems with the old Mykomap vocab proxy service, which talks to a SPARQL endpoint and generates a vocab from that. It's probably no longer really useful.
The prefixes attribute might not strictly be needed in our case here. But saying that it is necessary for conversion of the QNames back to URIs, needed for linked-data compatibility, so perhaps it is worth preserving.

Back end

Note: at the time of writing the back-end is not fully implemented, instead it just regurgitates canned data files.

If that is still the case when this issue is implemented, use canned-file regurgitation as before. A full implementation will probably then need to be added by the relevant issue for implementing the back-end.

Acceptance Criteria

[x] ts-rest contract proposal
[x] ts-rest contract reviewed
[x] endpoint implemented
[ ] tests included
[ ] tests pass
[x] stub back-end implemented
[ ] back-end implemented fully

rogup commented 1 week ago

@wu-lee In the back-end design, the vocabs.json lives in the folder for a dataset, since it needs to match up with the fields in that dataset (and from my understanding is generated by the data factory for a specific dataset CSV)

So, I'm wondering why we would need to get one of multiple vocabs individually by their IDs, rather than just using a URL like https://example.com/api/:datasetId/vocab ?

Also, another thing I was thinking was that we still don't have a way of specifying which field to use in the 'directory' panel. Maybe we could specify this in the vocabs.json e.g. something like a primary_category in the meta section? Or maybe it makes more sense for it to live in the map's UI config, along with config such as which fields are filterable.

wu-lee commented 1 week ago

So, I'm wondering why we would need to get one of multiple vocabs individually by their IDs, rather than just using a URL like

Yes, I was imagining the front end would know the vocabs and languages to ask for from elsewhere, but you're right, perhaps it would only know the dataset ID.

Note re. "one of" - we will potentially need multiple vocabs (plural) for any given case (e.g. countries, activities, etc.).

Although presumably the client would know the languages it needs. Beyond limitations set by the languages available, the choice would I think depend on the client rather than the dataset - perhaps inferred from the browser's locale or some other parameter. And in most cases just one language - we could require a map reload to get a language switch.

So maybe we can change the multiple vocab IDs to a single dataset ID and put this endpoint under the dataset route. e.g.

https://example.com/api/dataset/:datasetId/vocab/:langId

I wonder if there might also need to be a way to get the vocabs by ID, but probably we could add that in a separate endpoint.

wu-lee commented 1 week ago

Also, another thing I was thinking was that we still don't have a way of specifying which field to use in the 'directory' panel.

True. Perhaps instead of purely a vocabs API endpoint, we need a "config" endpoint, which includes the vocabs in one attribute, and any other options. So

https://example.com/api/dataset/:datasetId/config/:langId

This broadens the scope of this issue somewhat. We could start with just a the vocabs though and add more things as we go?

rogup commented 1 week ago

So, I'm wondering why we would need to get one of multiple vocabs individually by their IDs, rather than just using a URL like

Yes, I was imagining the front end would know the vocabs and languages to ask for from elsewhere, but you're right, perhaps it would only know the dataset ID.

Note re. "one of" - we will potentially need multiple vocabs (plural) for any given case (e.g. countries, activities, etc.).

Although presumably the client would know the languages it needs. Beyond limitations set by the languages available, the choice would I think depend on the client rather than the dataset - perhaps inferred from the browser's locale or some other parameter. And in most cases just one language - we could require a map reload to get a language switch.

So maybe we can change the multiple vocab IDs to a single dataset ID and put this endpoint under the dataset route. e.g.
https://example.com/api/dataset/:datasetId/vocab/:langId
I wonder if there might also need to be a way to get the vocabs by ID, but probably we could add that in a separate endpoint.

If we're wanting to keep the logic simple (especially on the back-end so that we can just serve a static file), do we need the langId parameter at all? I can't see the harm in sending the whole file with all the languages, unless the file is very big, and then the front-end logic would be simpler too, rather than needing a reload if the language changes.

rogup commented 1 week ago

This broadens the scope of this issue somewhat. We could start with just a the vocabs though and add more things as we go?

Agreed, we can just hardcode some of it for now, until we work out the method of storing and retrieving config.

Long-term, I like the idea of having a https://example.com/api/dataset/:datasetId/config API. I think the config file could live in source control in something like a map-variants library, so maybe we would want to keep the vocabs separate.

We could potentially have just 1 back-end server and 1 build of the front-end, which is used by all maps. Which is customised by the URL parameters in order to dynamically load a specific set of data and UI config. This would be a lot simpler than having multiple builds of the front-end with built in UI layouts. But maybe it's limited if we need more customisation such as styling and images, which may be difficult to dynamically load in. This is all out of scope though

wu-lee commented 1 week ago

So, I'm wondering why we would need to get one of multiple vocabs individually by their IDs, rather than just using a URL like

Yes, I was imagining the front end would know the vocabs and languages to ask for from elsewhere, but you're right, perhaps it would only know the dataset ID. Note re. "one of" - we will potentially need multiple vocabs (plural) for any given case (e.g. countries, activities, etc.). Although presumably the client would know the languages it needs. Beyond limitations set by the languages available, the choice would I think depend on the client rather than the dataset - perhaps inferred from the browser's locale or some other parameter. And in most cases just one language - we could require a map reload to get a language switch. So maybe we can change the multiple vocab IDs to a single dataset ID and put this endpoint under the dataset route. e.g.
https://example.com/api/dataset/:datasetId/vocab/:langId
I wonder if there might also need to be a way to get the vocabs by ID, but probably we could add that in a separate endpoint.
If we're wanting to keep the logic simple (especially on the back-end so that we can just serve a static file), do we need the langId parameter at all? I can't see the harm in sending the whole file with all the languages, unless the file is very big, and then the front-end logic would be simpler too, rather than needing a reload if the language changes.

I guess we could cover all the bases by saying:

no langId for now
this gets all supported languages, i.e. you just get the whole static file
later, a langId can be added if we want to limit the languages

wu-lee commented 1 week ago

This broadens the scope of this issue somewhat. We could start with just a the vocabs though and add more things as we go?

Agreed, we can just hardcode some of it for now, until we work out the method of storing and retrieving config.

Potential misinterpretation here as I was actually meaning we could switch this to a config endpoint, but that it would omit any config options besides the vocabs. So, probably the same schema gets sent back as above (unless we prefer to nest that under an attribute, but then vocabs.vocabs.<whatever> seems to a potentially awkward consequence). And more attributes can get added later.

Long-term, I like the idea of having a https://example.com/api/dataset/:datasetId/config API. I think the config file could live in source control in something like a map-variants library, so maybe we would want to keep the vocabs separate.

In early implementations, we don't need to keep vocabs separate, cos there's only one case really, and even if not we don't really care much about duplication. In later implementations, any recombination could be done dynamically.

We could potentially have just 1 back-end server and 1 build of the front-end, which is used by all maps. Which is customised by the URL parameters in order to dynamically load a specific set of data and UI config. This would be a lot simpler than having multiple builds of the front-end with built in UI layouts.

Yes. I think the config would be implied by the datasetId, initially and/or by default. Later we could add a way to override the default config for a dataset?

But maybe it's limited if we need more customisation such as styling and images, which may be difficult to dynamically load in. This is all out of scope though

Yes

rogup commented 1 week ago

@wu-lee So are you suggesting that

the vocabs.json lives in the dataset folder, alongside the other data files for a dataset
we have a config API endpoint, and no vocabs endpoint
For this API route, the back-end pulls together the vocabs.json and any other dataset config in source control as one JSON object and sends this. But for now, the only bit of config it sends will be the vocabs.

Also a tangential question: do vocabs ever contain actual data? e.g. if the language is set to Hindi, where do the translations of the names and descriptions live, or do we not bother translating these? If they do contain translated data, config probably isn't the correct name for it

wu-lee commented 1 week ago

@wu-lee So are you suggesting that

the vocabs.json lives in the dataset folder, alongside the other data files for a dataset Maybe, this is one option - but I was thinking now it could just be stored in a verbatim config file.

we have a config API endpoint, and no vocabs endpoint Yes. Sent verbatim. Only parameter being datasetId for now. Later language, if only one is wanted.

Later, some combination of vocabs into the config could be done to avoid duplication when shared between datasets. But this is not entirely necessary now, if we wanted we could do that in the script which generates the stored config.json

* For this API route, the back-end pulls together the vocabs.json and any other dataset config in source control as one JSON object and sends this. But for now, the only bit of config it sends will be the vocabs.
This combination not necessary in the current version of the proposal. After all, this would will be repeating the same computation every request, so we may as well just do it once.

Also a tangential question: do vocabs ever contain actual data? e.g. if the language is set to Hindi, where do the translations of the names and descriptions live, or do we not bother translating these? If they do contain translated data, config probably isn't the correct name for it

The vocabs can contain whatever you want, in one sense. But I wouldn't consider them data, as they are indexes for interpreting the IDs, not instances of the IDs related to some organisation.

The translations are interpretations of IDs and live in the vocab. In this proposal these are stored in the config. Maybe config isn't the right name, but I'm not sure what is better?

rogup commented 1 week ago

Cool, thanks for the clarifications, this all makes sense to me now!

rogup commented 1 week ago

@wu-lee When you make these changes, I'm thinking we should also change the getDataset route to getLocations since it only returns the locations

wu-lee commented 5 days ago

@rogup I've got a contract with some tests for the validations in the branch now.

When you get a moment could you have a look and see if it looks sensible? This then satisfies the first and second checkboxes.

wu-lee commented 5 days ago

After our earlier discussion I've added

rogup commented 2 days ago

@wu-lee I've reviewed the contract and it looks fine, just a clarifying question as a comment in the above PR

wu-lee commented 1 hour ago

@rogup - I've revised the branch, and so it probably needs another quick look.

Mentioned this to Rohit elsewhere, but for context here, it wasn't finished, so wasn't intended to be merged yet. I've finished it now, adding tests for validating ConfigData and integrating some fixups and other minor changes.

I would like to re-merge it in the right place. Unfortunately there has been a commit to main since then. Fortunately I do think I can undo this, but

it needs a force-push to main
which needs a (temporary) removal of the branch protection on main

Since @rohit is away today, prior to actually doing that I've made a temporary new branch new-main which is my proposed change to main, and rebased the branch onto that. @ms0ur1s - just to be aware of this: I don't think your about to commit anything else to main but let me know if you want to.

DigitalCommons / mykomap-monolith