Open lukatercon opened 1 year ago
As a matter of interest: who will fix this? Becuase it just might stay open forever....
Once @simonkrek ensures @lukatercon that the proposed modification is ok, @lukatercon will take care of it. This is currently possible as @lukatercon is working on the classla-stanza package, in the long term someone else should be available for resolving obeliks issues.
Yeah, I'm not against proclaiming schwa as a letter within the scope of the "standard" alphabet In the end it usually does represent a phoneme, as opposed to other symbols. We just have to see where to put it in the code, I guess together with @lukatercon?
From what I can gather, two minor changes would have to be implemented in the TokRulesPart1 and TokRulesPart3 files, where the Unicode character codes for punctuation characters are listed. I did a quick test locally by removing the character code for the schwa and it seems to have fixed the token splitting behavior and the schwa is no longer treated as punctuation.
luka:~$ obeliks "Kədar ne mačke doma, so mišə dobre volje."
1.1.1.1-5 Kədar
1.1.2.7-8 ne
1.1.3.10-14 mačke
1.1.4.16-19 doma
1.1.5.20-20 ,
1.1.6.22-23 so
1.1.7.25-28 mišə
1.1.8.30-34 dobre
1.1.9.36-40 volje
1.1.10.41-41 .
luka:~$ obeliks "Kədar ne mačke doma, so mišə dobre volje." -c
# newpar id = 1
# sent_id = 1.1
# text = Kədar ne mačke doma, so mišə dobre volje.
1 Kədar _ _ _ _ _ _ _ _
2 ne _ _ _ _ _ _ _ _
3 mačke _ _ _ _ _ _ _ _
4 doma _ _ _ _ _ _ _ SpaceAfter=No
5 , , PUNCT Z _ _ _ _ _
6 so _ _ _ _ _ _ _ _
7 mišə _ _ _ _ _ _ _ _
8 dobre _ _ _ _ _ _ _ _
9 volje _ _ _ _ _ _ _ SpaceAfter=No
10 . . PUNCT Z _ _ _ _ _
I am wondering whether something similar should be done with the other IPA extensions (i.e. this block)? Of course, not to consider them as part of the standard alphabet, but they should really not be treated as punctuation characters when they appear within a word.
Thanks @lukatercon for investigating this. Personally, I think listing characters or character blocks is wrong, because there are too many Unicode characters while blocks (I think) can have a mixture of letter / punctuation chars.
Would in not be possible to rather use Character categories. There you know something is a letter, punctuation or space char, which are the important bits.
But if this is for some reason impossible then, yes, IPA extensions are all letters.
Thanks @lukatercon for investigating this. Personally, I think listing characters or character blocks is wrong, because there are too many Unicode characters while blocks (I think) can have a mixture of letter / punctuation chars.
Would in not be possible to rather use Character categories. There you know something is a letter, punctuation or space char, which are the important bits.
But if this is for some reason impossible then, yes, IPA extensions are all letters.
This is possible as well, although it will require a bit more work, since character categories are not simple ranges of Unicode characters like blocks are (for instance the IPA block is simply U+0250–U+02AF). But I can obtain the list of lowercase letter character codes from here and then exclude those from being handled as punctuation in the Obeliks code.
If @simonkrek agrees with this plan, I will make the necessary changes.
Basically, we have three categories: letters/digits, symbols and punctuation. So far, the second category is miising from the conversation. I'm not sure we can resolve this via GitHub issues. For me, a meeting would be preferable.
This is possible as well, although it will require a bit more work, since character categories are not simple ranges of Unicode characters like blocks
I might be missing something but the whole idea was to make it simpler, not more complicated, by using Character categories directly, and not messing around with individual characters or ranges. All RegEx aware langauges that I am aware of let you use Character categories directly, e.g. you write /^\p{L}+$/
for a string consisting of letters only, or /^[\p{L}\p{N}]+$/
for an alphanumeric string, etc.
But I can obtain the list of lowercase letter character codes from here and then exclude those from being handled as punctuation in the Obeliks code
Hm, why would a letter ever by considered as punctuation?
Ah, crossing comments. So:
Basically, we have three categories: letters/digits, symbols and punctuation.
Don't forget spacing characters, you also need those of course. And all these classes are covered by character classes: L, D, S, P. There might be some fiddling if you don't consider Unicode symbols to correspond to the Obelix idea of symbols, and some punctuation is tricky, as it might not split a word, but apart from them it seems rather straightforward. But I didn't look at the code.
Well, yes - the idea was to have the possibility to look at the string of characters and know exactly how it would be split and tokenized. Spacing characters always split the string. We can revisit tokenization but: (1) I wouldn't change the basics of the system, and (2) I wouldn't complicate the system even more.
All RegEx aware langauges that I am aware of let you use Character categories directly, e.g. you write
/^\p{L}+$/
for a string consisting of letters only, or/^[\p{L}\p{N}]+$/
for an alphanumeric string, etc.
I was not aware of this possibility. Then this simplifies the implementation quite a bit.
Hm, why would a letter ever by considered as punctuation?
I am not sure, but that is how the IPA extensions block is currently defined in Obeliks. I agree that a meeting would be best to work everything out.
An issue was raised on the CLASSLA-Stanza github page about the tokenization of the schwa character "ə". It seems that Obeliks splits tokens on the schwa character.
For instance, with the following input string
Obeliks produces the following output:
I looked inside the code and it seems Obeliks treats all extended IPA characters as punctuation, which is why they get tokenized separately.
These characters fall within the "Lowercase Letter" category, so there is probably no reason for them to be splitting up tokens?