Vocabulary/taxonomy improvements

davidread commented 9 years ago

CKAN currently has a basic model for creating a taxonomy, loading terms into it via the API and you can create a form field that allows values from a taxonomy. Terms can be translated.

However at DGU we want to a categorize each dataset using a hierarchical taxonomy (we are thinking COFOG) that has been published online, and terms will be added/removed in time.

a term needs an ID (e.g. URI) attached
terms need to reference other terms in the vocabulary (e.g. for a tree hierarchy)
script that syncs terms from another source (e.g. an online RDF file containing hundreds of terms)

As a related item, it would also be handy to be able to call a service that can auto-categorize a dataset using:

categorization in another vocabulary (e.g. ONS custom one, or Eurovoc)
using text analysis and machine learning based on a manually curated training data

We've found people are terrible at selecting categories for their datasets.

rossjones commented 9 years ago

tl;dr; Let's leave tags alone to act as free-form tags and write as an extension until it is mature enough to make it into core?

Issues

Extending tags to be hierarchical, and have the extra required fields will introduce some problems.

Differentiating between the ‘name’ and the label. Currently tags are found by name, and are also displayed with that name - where the name is a slug (lowercase alpha, _ and -). Any addition of a label will need to make sure the label is used for display instead of the name.
Adding a URI field will work, but presumably that information will need to be exposed somewhere to be useful.
Having a hierarchy (via a parent tag), which would presumably be allowed to be NULL, raises a few questions:
1. Presumably the addition of tags via the UI will need to be tree-based.
2. How are child tags resolved. If you add a tag at one level down, do we auto-add all of the tags beneath that? Are they resolved at runtime or creation?
3. Depending on how resolution is handled in 2, adding/removing tags becomes more complex.
Tags are currently unilingual and vocabularies are merely a flag (id and name) so support for other languages must be worked-around at the vocab level.

Possible Solution

SKOS appears to be a linked-data format for defining controlled vocabularies which offers support for:

Hierarchies (through narrower and broader)
Language specific labels
Preferred and Alternative Labels
URIs

Perhaps a better approach to extending (and breaking) the current tag vocabs is to take an alternative approach, until such a time as it is usable enough to replace them in core?

We might instead build ckanext-taxonomy (maybe even as a core extension) to provide the functionality we require modelled on SKOS. It would allow import from a SKOS file (as well as the obvious API to create a vocab), and then access to the hierarchy from it's own logic layer. This would allow us to leave free-form tags (which are tied directly to packages) and the existing unidimensional vocabularies, and perhaps just deprecate them later?

I'm not sure this totally addresses the issue of when to resolve the actual values when an item is chosen halfway down the tree - are the narrower (child) items added at runtime, or creation. Perhaps the best approach is to do it at runtime, and ensure that the 'narrower' definitions are also added at indexing time. This would require that an addition or deletion of a SKOS concept would mean re-indexing those items that currently use it. Perhaps this happens infrequently enough that it wouldn't be an issue.

The extension would have to provide UI for selecting items in a taxonomy, bearing in mind there may be multiple selectable values, and the hierachical nature. These should be stored against the object (presumably package most of the time) as a list of taxonomy items, which can be used for display. The child items are likely only to be of value when facetting and so as long as they are indexed it should still be ok.

Also needed would be:

Helpers for resolving labels.
Logic layer for CRUD.
Auth layer for permissions around managing taxonomies.
UI snippets for viewing/selecting taxonomy values.
Plugin for ensuring all child elements are indexed.
Validators for schema.
Plugin for managing facets (and possibly issues around hierarchical view of facets)

wardi commented 9 years ago

There is some overlap with work I'm doing on scheming: UI snippets for viewing/selecting taxonomies values, validators for schema and a method to store and share taxonomies and their translated values.

Let's write those parts in a way that allows integration with scheming.

davidread commented 9 years ago

@rossjones +1 to getting the model to work just like SKOS

TBH I'd much rather we just put this in core CKAN, rather than create and extension. Else it will never go into core. And a lot of what we need is already there. I quite like having tags joined to packages - why not? The only issue you mention that relates to building it into core is the question of searching for a tag by name vs label - and I think it is worth solving neatly, even if there is a slight API backward incompatibility introduced. If we focus on changing Tag to what we want it to be, and then plumb back in the existing 'free-form tag functionalty' as a 'special case' of the new Tag.

Hierarchy - Yes, I think it is fine to demand that one creates the parent tag before its child. And yes, add in all the implied child tags at search index time - that makes sense to me.

Multilingual - Yes this needs sorting. I guess we just add a rule that if a tag is not translated into a particular language then it displays some sort of default, and the UI can show an asterisk or something to flag that. Doing a great job on multilingual is not the priority.

@wardi sounds good. Can you point us at your bits of code that do those three things you mention?

BTW Should tags go into SOLR now as themselves? Is the multilingual thing a good argument for that?

wardi commented 9 years ago

@davidread will do when the vocabulary/choice parts are a little more complete. For now there are the form and display snippets and use of the IValidators for registering new validators

I'm leery of building this feature into ckan's core tags. Would we be creating a new ITagForm interface to allow custom fields on tags? I don't think that applies to free-form tags at all, making the link between tags in vocabularies and free tags that much more tenuous.

Indexing could be a problem. When someone updates a tag description we need to re-index all the affected packages (this is the same problem we have with organization descriptions now)

rossjones commented 9 years ago

I'm less worried about the tag label changing, as long as the name doesn't - and I think in a lot of cases this will be imported anyway (at least for our use-case) so unlikely to change very often at all. More of a problem if something is inserted in the middle of the hierarchy as it'll mean adding another indexed value to all those datasets using the broader taxonomy value/tag.

davidread commented 9 years ago

@wardi Maybe you're right - free-form tags has no use for extra fields so doesn't need a schema or ITagForm. Free-form tags also don't need a multilingual element either.

So you'd create a new sort of Tag for closed vocabularies and change Vocabulary to be a grouping of this new sort of Tag?

Do you need a CRUD interface or API for these vocabulary items? How about we just use a paster command to load and update them? At least in version 1 of this. That would solve the problem of how to load 1000 in one request, or reindex masses of packages during a 30s web request.

wardi commented 9 years ago

I'm planning to publish the taxonomies as a part of the scheming schemas I will be using. There's no interface for updating because the schemas are considered to be site configuration. The scheming schemas are published on the instance with action functions like scheming_datasets_schema_show.

In my case I expect to generate these taxonomies from another source (probably a spreadsheet but something like SKOS would be cool too) and then paster search-index rebuild each time there is an update.

wardi commented 9 years ago

data.gov uses groups for some taxonomies. The plus is groups can have extra fields to support multilingual representations and SKOS information. Some negatives are the group permission model would allow anyone to apply them (except in controlled ckan instance cases) and they can't be associated with a dataset or resource field, just a whole dataset.

davidread commented 9 years ago

At the meeting today we agreed it is a good idea to develop a new model for vocabularies separate from Tag.

Most people were keen for this to be a normal extension until it is proved. I was the one arguing for it to be core ckan, but on reflection let's just go with it as an extension for now.

I guess that means there can be no foreignkey relation to a package, and it makes it non-simple to integrate into the schema, form, search facet etc.

We didn't quite decide whether the relation between a dataset and concept (vocab item) was:

each vocab simply has one fixed name for the field e.g. dataset['theme_vocab'] = ['http://subjects.com/geology', 'http://subjects.com/stalagtites']
or you have flexibility for multiple fields for a vocab e.g. dataset['main_theme_vocab'] = ['http://subjects.com/geology'] and dataset['sub_theme_vocab'] = ['http://subjects.com/stalagtites']

And Ian mentioned maybe connecting resources (or even publishers) to vocab items. That should probably be left for the future, but might suggest option 2. Ross I suggest you get stuck in and see how this pans out.

linforestzhang commented 4 years ago

Interesting Idea!

ckan / ideas

Vocabulary/taxonomy improvements #103

Issues

Possible Solution