Tokenisation problems with schwa

TomazErjavec commented 1 year ago

I'm working on a new version of a corpus where the transcription includes the schwa character, i.e. "ə". In the previous run (December 2021) tokenisation for words including this character worked ok, but with the current version of CLASSLA as installed on new-tantra (I think & hope the latest!), tokens are split on schwa. E.g. in the input text is

Kədar ne mačke doma, so mišə dobre volje.

the output, using pipeline = classla.Pipeline('sl', processors='tokenize') is:

2023-06-29 13:12:29 INFO: Loading these models for language: sl (Slovenian):
========================
| Processor | Package  |
------------------------
| tokenize  | standard |
========================

2023-06-29 13:12:29 INFO: Use device: cpu
2023-06-29 13:12:29 INFO: Loading: tokenize
2023-06-29 13:12:29 INFO: Done loading processors!
# newpar id = 1
# sent_id = 1.1
# text = Kədar ne mačke doma, so mišə dobre volje.
1       K       _       _       _       _       _       _       _       SpaceAfter=No
2       ə       _       _       _       _       _       _       _       SpaceAfter=No
3       dar     _       _       _       _       _       _       _       _
4       ne      _       _       _       _       _       _       _       _
5       mačke   _       _       _       _       _       _       _       _
6       doma    _       _       _       _       _       _       _       SpaceAfter=No
7       ,       _       _       _       _       _       _       _       _
8       so      _       _       _       _       _       _       _       _
9       miš     _       _       _       _       _       _       _       SpaceAfter=No
10      ə       _       _       _       _       _       _       _       _
11      dobre   _       _       _       _       _       _       _       _
12      volje   _       _       _       _       _       _       _       SpaceAfter=No
13      .       _       _       _       _       _       _       _       _

This is strange behaviour, as schwa is classified as "Lowercase Letter".

lukatercon commented 1 year ago

Since CLASSLA uses the Obeliks tokenizer for tokenization of standard Slovenian, this is an issue that primarily pertains to that tool. I opened a new issue on the Obeliks GitHub and used the above example to illustrate.

I tested the tokenizer output using the nonstandard tokenizer (reldi tokenizer) for Slovenian as well, and it seems it does not have this problem. If you initialize the pipeline with classla.Pipeline("sl", type="nonstandard", processors="tokenize") you will get the following output:

>>> nlp = classla.Pipeline("sl", type="nonstandard", processors="tokenize")
2023-07-05 13:03:22 INFO: Loading these models for language: sl (Slovenian):
===========================
| Processor | Package     |
---------------------------
| tokenize  | nonstandard |
===========================

2023-07-05 13:03:22 INFO: Use device: cpu
2023-07-05 13:03:22 INFO: Loading: tokenize
2023-07-05 13:03:22 INFO: Done loading processors!
>>> doc = nlp("Kədar ne mačke doma, so mišə dobre volje.")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Kədar ne mačke doma, so mišə dobre volje.
1   Kədar   _   _   _   _   _   _   _   _
2   ne  _   _   _   _   _   _   _   _
3   mačke   _   _   _   _   _   _   _   _
4   doma    _   _   _   _   _   _   _   SpaceAfter=No
5   ,   _   _   _   _   _   _   _   _
6   so  _   _   _   _   _   _   _   _
7   mišə    _   _   _   _   _   _   _   _
8   dobre   _   _   _   _   _   _   _   _
9   volje   _   _   _   _   _   _   _   SpaceAfter=No
10  .   _   _   _   _   _   _   _   _

TomazErjavec commented 1 year ago

Thanks, but I don't want to use the non-standard tokenizer, because that one splits tokens and sentences quite differently from the standard one, and (as vouchsafed by @nljubesi) it is suboptimal to use non-standard tokenisation and segmentation on standard text, which my example, despite the schwa - mostly is. So, reopening.

TomazErjavec commented 1 year ago

Ah, ok, I only now see your comment about opening a separate issue. So, closing this again, sorry.

clarinsi / classla

Tokenisation problems with schwa #39