clarinsi / classla

CLASSLA Fork of the Official Stanford NLP Python Library for Many Human Languages
https://www.clarin.si/info/k-centre/
Other
38 stars 19 forks source link

Problems with nonstandard Slovene processing #38

Closed TomazErjavec closed 8 months ago

TomazErjavec commented 1 year ago

Annotating standard Slovene works for me, but now I tried to annotate a pre-tokenised Slovene text with the non-standard model using

pipeline = classla.Pipeline('sl', type='nonstandard',tokenize_pretokenized=True, processors='tokenize,pos,lemma,ner', pos_use_lexicon=True)

and I got the following error message:

Exception: You have to re-download Slovenian models. You can do this by using the following command: classla.download('sl')

First, this message gives wrong advice: I do have the standard Slovenian models, and they work, and if I run classla.download('sl'), it just says that the models exists (and then downloads them anyway). I figured out I need to do the following, which seemed to work, although also seems to indicate I already have the non-standard models as well:

>>> classla.download('sl', type='nonstandard')
Downloading https://raw.githubusercontent.com/clarinsi/classla-resources/main/resources_1.0.1.json: 10.3kB [00:00, 5.11MB/s]
2022-11-27 12:10:31 INFO: Downloading these customized packages for language: sl (Slovenian)...
===========================
| Processor | Package     |
---------------------------
| tokenize  | nonstandard |
| pos       | nonstandard |
| lemma     | nonstandard |
| depparse  | standard    |
| ner       | nonstandard |
| pretrain  | standard    |
===========================

2022-11-27 12:10:31 INFO: File exists: /home/tomaz/classla_resources/sl/pos/nonstandard.pt.
2022-11-27 12:10:32 INFO: File exists: /home/tomaz/classla_resources/sl/lemma/nonstandard.pt.
2022-11-27 12:10:32 INFO: File exists: /home/tomaz/classla_resources/sl/depparse/standard.pt.
2022-11-27 12:10:32 INFO: File exists: /home/tomaz/classla_resources/sl/ner/nonstandard.pt.
2022-11-27 12:10:33 INFO: File exists: /home/tomaz/classla_resources/sl/pretrain/standard.pt.
2022-11-27 12:10:33 INFO: Finished downloading models and saved to /home/tomaz/classla_resources.

The problem is, that if I now run annotation with non-standard models I get exactly the same error as before:

python3 anno.py < Master/janes.norm.txt > Master/janes.norm.txt.nstd.tag
2022-11-27 12:10:52 INFO: Loading these models for language: sl (Slovenian):
===========================
| Processor | Package     |
---------------------------
| tokenize  | nonstandard |
| pos       | nonstandard |
| lemma     | nonstandard |
| ner       | nonstandard |
===========================

2022-11-27 12:10:52 INFO: Use device: cpu
2022-11-27 12:10:52 INFO: Loading: tokenize
2022-11-27 12:10:52 INFO: Loading: pos
Traceback (most recent call last):
  File "anno.py", line 15, in <module>
    pipeline = classla.Pipeline('sl', type='nonstandard',tokenize_pretokenized=True, processors='tokenize,pos,lemma,ner', pos_use_lexicon=True)
...
Exception: You have to re-download Slovenian models. You can do this by using the following command: classla.download('sl')

It might be I am doing something wrong - but I don't know what, and the error message certainly does not help...

lkrsnik commented 1 year ago

As described in the documentation, only Slovenian standard morphosyntactic tagging model supports inflectional lexicon. Therefor, if you want to use non-standard models, you will have to remove pos_use_lexicon=True parameter.

That said, I agree that the error message is misleading, so it will be updated in the next release.

TomazErjavec commented 1 year ago

@lkrsnik, thanks for the explanation. All clear now, except, as you note, the error message.