Closed TomazErjavec closed 1 year ago
Since CLASSLA uses the Obeliks tokenizer for tokenization of standard Slovenian, this is an issue that primarily pertains to that tool. I opened a new issue on the Obeliks GitHub and used the above example to illustrate.
I tested the tokenizer output using the nonstandard tokenizer (reldi tokenizer) for Slovenian as well, and it seems it does not have this problem. If you initialize the pipeline with classla.Pipeline("sl", type="nonstandard", processors="tokenize")
you will get the following output:
>>> nlp = classla.Pipeline("sl", type="nonstandard", processors="tokenize")
2023-07-05 13:03:22 INFO: Loading these models for language: sl (Slovenian):
===========================
| Processor | Package |
---------------------------
| tokenize | nonstandard |
===========================
2023-07-05 13:03:22 INFO: Use device: cpu
2023-07-05 13:03:22 INFO: Loading: tokenize
2023-07-05 13:03:22 INFO: Done loading processors!
>>> doc = nlp("Kədar ne mačke doma, so mišə dobre volje.")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Kədar ne mačke doma, so mišə dobre volje.
1 Kədar _ _ _ _ _ _ _ _
2 ne _ _ _ _ _ _ _ _
3 mačke _ _ _ _ _ _ _ _
4 doma _ _ _ _ _ _ _ SpaceAfter=No
5 , _ _ _ _ _ _ _ _
6 so _ _ _ _ _ _ _ _
7 mišə _ _ _ _ _ _ _ _
8 dobre _ _ _ _ _ _ _ _
9 volje _ _ _ _ _ _ _ SpaceAfter=No
10 . _ _ _ _ _ _ _ _
Thanks, but I don't want to use the non-standard tokenizer, because that one splits tokens and sentences quite differently from the standard one, and (as vouchsafed by @nljubesi) it is suboptimal to use non-standard tokenisation and segmentation on standard text, which my example, despite the schwa - mostly is. So, reopening.
Ah, ok, I only now see your comment about opening a separate issue. So, closing this again, sorry.
I'm working on a new version of a corpus where the transcription includes the schwa character, i.e. "ə". In the previous run (December 2021) tokenisation for words including this character worked ok, but with the current version of CLASSLA as installed on new-tantra (I think & hope the latest!), tokens are split on schwa. E.g. in the input text is
the output, using
pipeline = classla.Pipeline('sl', processors='tokenize')
is:This is strange behaviour, as schwa is classified as "Lowercase Letter".