NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
195 stars 41 forks source link

Training fails (unknown subject label error + python errors) #692

Closed kdw2060 closed 1 year ago

kdw2060 commented 1 year ago

Hi, I'm trying to figure out what i'm doing wrong.

I have created a vocabulary and this is loaded. I also have a training set consisting of about 6000 .tsv and .txt files. When I execute the command annif train my-model-name /myfoldername this results in a list of errors like:

warning: Unknown subject label "http://openvlacc.cultuurconnect.be/ZizoImages/Kleuter/KZDIE.gif"@nl

ending with:

Traceback (most recent call last):
  File "/Users/kris/annif-venv/bin/annif", line 8, in <module>
    sys.exit(cli())
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/annif/cli.py", line 317, in run_train
    proj.train(documents, backend_params, jobs)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/annif/project.py", line 225, in train
    self.backend.train(corpus, beparams, jobs)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/annif/backend/backend.py", line 67, in train
    return self._train(corpus, params=beparams, jobs=jobs)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/annif/backend/tfidf.py", line 112, in _train
    veccorpus = self.create_vectorizer(subjects)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/annif/backend/mixins.py", line 68, in create_vectorizer
    veccorpus = self.vectorizer.fit_transform(input)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 2121, in fit_transform
    X = super().fit_transform(raw_documents)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1377, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
  File "/Users/kris/annif-venv/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1283, in _count_vocab
    raise ValueError(
ValueError: empty vocabulary; perhaps the documents only contain stop words

To help you debug, this is what an item in the .ttl vocabulary looks like:

<http://openvlacc.cultuurconnect.be/ZizoImages/Kleuter/KZDIE.gif> a skos:Concept ;
    skos:notation "KZDIE" ;
    skos:prefLabel "Natuur - Dieren"@nl .

A corresponding .tsv file inside the training directory set looks like this:

http://openvlacc.cultuurconnect.be/ZizoImages/Kleuter/KZDIE.gif Natuur - Dieren KZDIE

The project definition inside projects.cfg looks like this:

[kleuterzizo-tfidf-nl]
name=KleuterZIZO TF-IDF Dutch
language=nl
backend=tfidf
analyzer=snowball(dutch)
limit=100
vocab=kleuterzizo

Last piece of info: the training data resides in another directory on another drive then where the Annif program and the project.cfg + vocab files are located.

Thank you for helping me figure out where I made a mistake.

osma commented 1 year ago

Hello @kdw2060,

thank you for trying Annif and opening this issue.

Your SKOS vocabulary looks OK, though the URI that seems to refer to a GIF image seems a bit unusual.

The .tsv file on the other hand is lacking the angle brackets needed to distinguish a URI from a label. This makes Annif interpret the first column as a (Dutch language) label, as can be seen from this error message:

warning: Unknown subject label "http://openvlacc.cultuurconnect.be/ZizoImages/Kleuter/KZDIE.gif"@nl

The solution is to use angle brackets within the TSV file, like this:

<http://openvlacc.cultuurconnect.be/ZizoImages/Kleuter/KZDIE.gif>   Natuur - Dieren KZDIE

For more details on the format, see Document corpus formats in the wiki.

(note that the label and notation are actually optional in this case, as Annif will prefer the URI when it sees one and ignore the other columns - but it may still be good to keep it there for your own use)

If you have further problems, please use the annif-users group to ask instead of this issue tracker, unless you think that you've found a bug in Annif or you want to submit a feature request.

kdw2060 commented 1 year ago

Thank you for the rapid response Osma, I'll make a script to add the brackets and will post follow-up questions in the users-group. The odd uri is because we don't have 'official' uri's for our metadata labels and I'm also not aware of the standards concerning this.