clarinsi / obeliks

Sentence splitting & tokenization.
MIT License
0 stars 1 forks source link

Tokenization of extended IPA characters #2

Open lukatercon opened 1 year ago

lukatercon commented 1 year ago

An issue was raised on the CLASSLA-Stanza github page about the tokenization of the schwa character "ə". It seems that Obeliks splits tokens on the schwa character.

For instance, with the following input string

Kədar ne mačke doma, so mišə dobre volje.

Obeliks produces the following output:

luka:~$ obeliks "Kədar ne mačke doma, so mišə dobre volje."
1.1.1.1-1   K
1.1.2.2-2   ə
1.1.3.3-5   dar
1.1.4.7-8   ne
1.1.5.10-14 mačke
1.1.6.16-19 doma
1.1.7.20-20 ,
1.1.8.22-23 so
1.1.9.25-27 miš
1.1.10.28-28    ə
1.1.11.30-34    dobre
1.1.12.36-40    volje
1.1.13.41-41    .

I looked inside the code and it seems Obeliks treats all extended IPA characters as punctuation, which is why they get tokenized separately.

These characters fall within the "Lowercase Letter" category, so there is probably no reason for them to be splitting up tokens?

TomazErjavec commented 1 year ago

As a matter of interest: who will fix this? Becuase it just might stay open forever....

nljubesi commented 1 year ago

Once @simonkrek ensures @lukatercon that the proposed modification is ok, @lukatercon will take care of it. This is currently possible as @lukatercon is working on the classla-stanza package, in the long term someone else should be available for resolving obeliks issues.

simonkrek commented 1 year ago

Yeah, I'm not against proclaiming schwa as a letter within the scope of the "standard" alphabet In the end it usually does represent a phoneme, as opposed to other symbols. We just have to see where to put it in the code, I guess together with @lukatercon?

lukatercon commented 1 year ago

From what I can gather, two minor changes would have to be implemented in the TokRulesPart1 and TokRulesPart3 files, where the Unicode character codes for punctuation characters are listed. I did a quick test locally by removing the character code for the schwa and it seems to have fixed the token splitting behavior and the schwa is no longer treated as punctuation.

luka:~$ obeliks "Kədar ne mačke doma, so mišə dobre volje."
1.1.1.1-5   Kədar
1.1.2.7-8   ne
1.1.3.10-14 mačke
1.1.4.16-19 doma
1.1.5.20-20 ,
1.1.6.22-23 so
1.1.7.25-28 mišə
1.1.8.30-34 dobre
1.1.9.36-40 volje
1.1.10.41-41    .

luka:~$ obeliks "Kədar ne mačke doma, so mišə dobre volje." -c
# newpar id = 1
# sent_id = 1.1
# text = Kədar ne mačke doma, so mišə dobre volje.
1   Kədar   _   _   _   _   _   _   _   _
2   ne  _   _   _   _   _   _   _   _
3   mačke   _   _   _   _   _   _   _   _
4   doma    _   _   _   _   _   _   _   SpaceAfter=No
5   ,   ,   PUNCT   Z   _   _   _   _   _
6   so  _   _   _   _   _   _   _   _
7   mišə    _   _   _   _   _   _   _   _
8   dobre   _   _   _   _   _   _   _   _
9   volje   _   _   _   _   _   _   _   SpaceAfter=No
10  .   .   PUNCT   Z   _   _   _   _   _

I am wondering whether something similar should be done with the other IPA extensions (i.e. this block)? Of course, not to consider them as part of the standard alphabet, but they should really not be treated as punctuation characters when they appear within a word.

TomazErjavec commented 1 year ago

Thanks @lukatercon for investigating this. Personally, I think listing characters or character blocks is wrong, because there are too many Unicode characters while blocks (I think) can have a mixture of letter / punctuation chars.

Would in not be possible to rather use Character categories. There you know something is a letter, punctuation or space char, which are the important bits.

But if this is for some reason impossible then, yes, IPA extensions are all letters.

lukatercon commented 1 year ago

Thanks @lukatercon for investigating this. Personally, I think listing characters or character blocks is wrong, because there are too many Unicode characters while blocks (I think) can have a mixture of letter / punctuation chars.

Would in not be possible to rather use Character categories. There you know something is a letter, punctuation or space char, which are the important bits.

But if this is for some reason impossible then, yes, IPA extensions are all letters.

This is possible as well, although it will require a bit more work, since character categories are not simple ranges of Unicode characters like blocks are (for instance the IPA block is simply U+0250–U+02AF). But I can obtain the list of lowercase letter character codes from here and then exclude those from being handled as punctuation in the Obeliks code.

If @simonkrek agrees with this plan, I will make the necessary changes.

simonkrek commented 1 year ago

Basically, we have three categories: letters/digits, symbols and punctuation. So far, the second category is miising from the conversation. I'm not sure we can resolve this via GitHub issues. For me, a meeting would be preferable.

TomazErjavec commented 1 year ago

This is possible as well, although it will require a bit more work, since character categories are not simple ranges of Unicode characters like blocks

I might be missing something but the whole idea was to make it simpler, not more complicated, by using Character categories directly, and not messing around with individual characters or ranges. All RegEx aware langauges that I am aware of let you use Character categories directly, e.g. you write /^\p{L}+$/ for a string consisting of letters only, or /^[\p{L}\p{N}]+$/ for an alphanumeric string, etc.

But I can obtain the list of lowercase letter character codes from here and then exclude those from being handled as punctuation in the Obeliks code

Hm, why would a letter ever by considered as punctuation?

TomazErjavec commented 1 year ago

Ah, crossing comments. So:

Basically, we have three categories: letters/digits, symbols and punctuation.

Don't forget spacing characters, you also need those of course. And all these classes are covered by character classes: L, D, S, P. There might be some fiddling if you don't consider Unicode symbols to correspond to the Obelix idea of symbols, and some punctuation is tricky, as it might not split a word, but apart from them it seems rather straightforward. But I didn't look at the code.

simonkrek commented 1 year ago

Well, yes - the idea was to have the possibility to look at the string of characters and know exactly how it would be split and tokenized. Spacing characters always split the string. We can revisit tokenization but: (1) I wouldn't change the basics of the system, and (2) I wouldn't complicate the system even more.

lukatercon commented 1 year ago

All RegEx aware langauges that I am aware of let you use Character categories directly, e.g. you write /^\p{L}+$/ for a string consisting of letters only, or /^[\p{L}\p{N}]+$/ for an alphanumeric string, etc.

I was not aware of this possibility. Then this simplifies the implementation quite a bit.

Hm, why would a letter ever by considered as punctuation?

I am not sure, but that is how the IPA extensions block is currently defined in Obeliks. I agree that a meeting would be best to work everything out.