vocabulary in SKOS (Turtle serialization) should be loaded even in case of lacking language tags

macsag commented 2 years ago

Hi! I've noticed some potentially inconsistent behaviour of the annif loadvoc command when loading voacabulary in ttl SKOS without language tags.

We've currently switched in our project from simple tsv vocabulary format to SKOS. I assumed that since we do not provide any information about language in tsv file, we don't necessarily have to add appropriate language tags in SKOS either. But it turned out, that loading vocabulary in SKOS without language tags prevents annif from creating the subject index (though original ttl file is being copied and gzipped file is being dumped).

It seems, that when annif converts tsv to SKOS, it adds language tags (they're based upon language configuration in projects.cfg), but when it loads vocabulary directly from SKOS format, it checks if language tag for a label is the same as language code in projects.cfg, and when it's not or if there is no language tag at all it skips the whole concept:

def get_concept_labels(self, concept, label_types, language):
    return [str(label)
            for label_type in label_types
            for label in self.graph.objects(concept, label_type)
            if label.language == language]

@property
    def subjects(self):
        for concept in self.concepts:
            labels = self.get_concept_labels(
                concept, [SKOS.prefLabel, RDFS.label], self.language)
            notation = self.graph.value(concept, SKOS.notation, None, any=True)
            if not labels:
                continue
            label = labels[0]
            if notation is not None:
                notation = str(notation)
            yield Subject(uri=str(concept), label=label, notation=notation,
                          text=None)

I think, that maybe it would be safer to assume, that when there is no explicit information about the language in SKOS file (there is a label without the language tag), its language corresponds with the language defined in projects.cfg and skip a concept (or label from the concept) only when the language tag truly exists and is not equal to the language from projects.cfg.

If this change is not possible for some reasons, it would be nice to provide some information about this behaviour in annif wiki (it took me some time to find this "bug").

Maciej Sagata, National Library of Poland

osma commented 2 years ago

Thanks for the report @macsag ! Sorry to hear that you had to spend a lot of time to track down this problem. Your analysis is correct - Annif does skip concepts that don't have a label with the right language tag. Labels without a language tag aren't checked.

There are actually two aspects here that could be questioned:

Only labels with the correct language tag are used; labels without a language tag are ignored.
A concept is skipped (not added to the vocabulary at all) if it doesn't have a label.

The first is based on the assumption that all well formed SKOS vocabularies should use language tags for labels. This is clearly not always the case - see for example this Skosmos issue.

The second is really a relic from earlier times (going back to the Annif prototype). There's no very good reason for it either these days - I think that including all concepts in the vocabulary, whether they have labels or not, would be better.

I suggest that both problems should be fixed in one go:

All skos:Concepts from the SKOS file should be included in the vocabulary, unless they are marked as deprecated (using owl:deprecated true). For each concept, a label is picked using the following method (first rule that matches):

If it has a preferred label with the correct language tag (matching the project language directly), use it.
If it has a preferred label that matches the project language indirectly via the lookup algorithm of BCP 47, e.g. if the project language is en-US, then a label with the language tag en matches, use it.
If it has a preferred label without a language tag, use it.
(optional) If it has a preferred label with any other language tag, use it (if more than one, pick one at random).
Construct an artificial label from the URI, shortening it to a CURIE if possible (e.g. yso:p12345 or lcsh:sh85061212)

I think this would solve your problem, and also improve multilingual aspects of Annif more generally. What do you think?

macsag commented 2 years ago

Thanks for the quick response!

Your solution looks great to me, but the fourth rule should really remain optional, maybe some CLI parameter to the loadvoc command?

Anyway, as for now, I just added the language tags to the conversion process and it works fine. The problem with vocabularies which were stored originally in MARC21 is that there is no place in MARC21 for language info at all. Theoretically, our vocabulary (National Library of Poland Descriptors) is unilingual, but there are some librarians with a particular taste for "radical cataloguing" and they add clandestinely here and there some alternative labels in other languages - no simple way to recognise them automatically. But that's another story.

osma commented 2 years ago

I agree that 4. is a bit dangerous and probably it would be easiest to just skip it.

That problem is very familiar to me - as you say MARC21 isn't very multilingual, with no official place for language tags. There are some workarounds using custom subfields etc. but those are non-standard and rarely used in practice.

NatLibFi / Annif

vocabulary in SKOS (Turtle serialization) should be loaded even in case of lacking language tags #556