Closed ChameleonRed closed 10 months ago
Hi!
Here’s what I get:
import pyphen
dic = pyphen.Pyphen(lang='pl_PL')
dic.inserted('Maryśce')
# 'Ma-ry-ś-ce'
There’s no extra hyphen at the end.
This generate error since 4,5 = 'ś' and it is not valid syllable.
That’s something you should talk about with the dictionary creators, they know much more about Polish hyphenation than I do 😁.
If you have some "copy-pastable" code that shows the extra hyphen error, we can find to fix the bug (your script doesn’t really work). Otherwise, we can probably close this issue.
Here is code with generate these errors - see that both libraries generate same "errors".
Whatever it is not errors in Polish cut of syllables is ambiguous so Ma-ry-śce
is good and Ma-ryś-ce
is good whatever you can not cut both patterns and generate Ma-ry-ś-ce
since it doesn't have sense - no vowel.
Can you show what patterns do you use for this? Suffix 'śce' is very common so why it is not hit - since it is longer than 'ce' - what is hit for 'ś' since it is not pattern for sure? Can you explain this - I do not know Liang algorithm which you are using in details but I think that is something wrong here since lack of vowel is common (I found more such examples).
import hyphen
import pyphen
words = [
'postanowiliśmy',
'kurlandzki',
'Maryśce'
]
for word in words:
print('PyHypen')
h = hyphen.Hyphenator('pl_PL', 0, 0, 0, 0)
print(f"{h.pairs(word)}=")
print(f"{h.syllables(word)}=")
print()
print('pyphen')
p = pyphen.Pyphen(lang='pl_PL', left=0, right=0)
print(f"{list(p.iterate(word))}=")
print(f"{list(p.inserted(word).split('-'))}=")
print()
I also found that I can not specify 'COMPOUNDLEFTHYPHENMIN', 'COMPOUNDRIGHTHYPHENMIN' but can 'LEFTHYPHENMIN', 'RIGHTHYPHENMIN' it not related probably I do not know this by now.
Main code searching patterns looks like this so here I can see how it works later:
def positions(self, word):
"""Get a list of positions where the word can be hyphenated.
:param word: unicode string of the word to hyphenate
E.g. for the dutch word 'lettergrepen' this method returns ``[3, 6,
9]``.
Each position is a ``DataInt`` with a data attribute.
If the data attribute is not ``None``, it contains a tuple with
information about nonstandard hyphenation at that point: ``(change,
index, cut)``.
change
a string like ``'ff=f'``, that describes how hyphenation should
take place.
index
where to substitute the change, counting from the current point
cut
how many characters to remove while substituting the nonstandard
hyphenation
"""
word = word.lower()
points = self.cache.get(word)
if points is None:
pointed_word = '.%s.' % word
references = [0] * (len(pointed_word) + 1)
for i in range(len(pointed_word) - 1):
for j in range(
i + 1, min(i + self.maxlen, len(pointed_word)) + 1):
pattern = self.patterns.get(pointed_word[i:j])
if pattern:
offset, values = pattern
slice_ = slice(i + offset, i + offset + len(values))
references[slice_] = map(
max, values, references[slice_])
points = [
DataInt(i - 1, reference=reference)
for i, reference in enumerate(references) if reference % 2]
self.cache[word] = points
return points
Here is code with generate these errors - see that both libraries generate same "errors".
Then you can talk about this with the dictionary creators. Pyphen just includes this dictionary, but we’re not responsible for its content.
I also found that I can not specify 'COMPOUNDLEFTHYPHENMIN', 'COMPOUNDRIGHTHYPHENMIN' but can 'LEFTHYPHENMIN', 'RIGHTHYPHENMIN' it not related probably I do not know this by now.
These values are not used by Pyphen, but you can override these values using the left
and right
parameters:
It is not problem of dictionary but problem of algorithm. In about 10000 test I found only problems in old Polish when little different letters is used for 'j' like 'y'. Why it is problem of algorithm? You use hyphenation dictionary so ambiguous cut are correct and it is used to cut word at end of line. For syllables you can not use it because syllable need one vowel. So 'Ma-ry-ś-ce' cuts are good but 'ś' is not syllable.
Another problem is that left=0 or right=0 probably not works - I suspect but not tested. It generates multiple syllables probably in hyphenation dictionary patterns there is some assumption for length but it is speculations. For example 'U-kra-iny' - cuts are good but syllables is 'U-kra-i-ny' all vowels are full in this word no semi vowels like in old Polish.
So syllable algorithm is not hyphenation algorithm. In English it can be different but in English (and similar) what you write is different from what you talk in Polish it is almost same with some rare exceptions.
So syllable algorithm is not hyphenation algorithm.
As far as I know, Hunspell dictionaries (and thus Pyphen) give positions where hyphenation is possible. It gives 'Ma-ry-ś-ce'
, and according to what you say that’s accurate. The fact that 'ś'
is not a syllable is a problem, but it’s not Pyphen’s problem, is it?
If it only hyphenation it is O.K. but still is bug ...
generates: 'Ma-ry-ś-ce-'
should: 'Ma-ry-ś-ce'
It doesn’t for me:
import pyphen
dic = pyphen.Pyphen(lang='pl_PL')
dic.inserted('Maryśce')
# 'Ma-ry-ś-ce'
@ChameleonRed Did you find why you get an extra hyphen? Maybe there’s an invisible character (such as zero-width space) at the end of your string.
It is not invisible since I enter it by hand.
See comparison:
PyHypen
[['po', 'stanowiliśmy'], ['posta', 'nowiliśmy'], ['postano', 'wiliśmy'], ['postanowi', 'liśmy'], ['postanowili', 'śmy']]=
['po', 'sta', 'no', 'wi', 'li', 'śmy']=
pyphen
[('postanowiliśmy', ''), ('postanowili', 'śmy'), ('postanowi', 'liśmy'), ('postano', 'wiliśmy'), ('posta', 'nowiliśmy'), ('po', 'stanowiliśmy')]=
['po', 'sta', 'no', 'wi', 'li', 'śmy', '']=
PyHypen
[['kur', 'landzki'], ['kurlandz', 'ki']]=
['kur', 'landz', 'ki']=
pyphen
[('kurlandzki', ''), ('kurlandz', 'ki'), ('kur', 'landzki')]=
['kur', 'landz', 'ki', '']=
PyHypen
[['Ma', 'ryśce'], ['Mary', 'śce'], ['Maryś', 'ce']]=
['Ma', 'ry', 'ś', 'ce']=
pyphen
[('Maryśce', ''), ('Maryś', 'ce'), ('Mary', 'śce'), ('Ma', 'ryśce')]=
['Ma', 'ry', 'ś', 'ce', '']=
PyHypen
[['Ukra', 'iny']]=
['Ukra', 'iny']=
pyphen
[('Ukrainy', ''), ('Ukra', 'iny'), ('U', 'krainy')]=
['U', 'kra', 'iny', '']=
PyHypen
[['pi', 'sma']]=
['pi', 'sma']=
pyphen
[('pisma', ''), ('pi', 'sma')]=
['pi', 'sma', '']=
Code for debug:
import hyphen
import pyphen
words = [
'postanowiliśmy',
'kurlandzki',
'Maryśce',
'Ukrainy',
'pisma',
]
for word in words:
print('PyHypen')
h = hyphen.Hyphenator('pl_PL', 0, 0, 0, 0)
print(f"{h.pairs(word)}=")
print(f"{h.syllables(word)}=")
print()
print('pyphen')
p = pyphen.Pyphen(lang='pl_PL', left=0, right=0)
print(f"{list(p.iterate(word))}=")
print(f"{list(p.inserted(word).split('-'))}=")
print()
Thanks for this code.
The problem is here: pyphen.Pyphen(lang='pl_PL', left=0, right=0)
. left
and right
are the "minimum/maximum number of characters of the first/last syllabe" according to the documentation. You can’t put 0
, you need at least 1
.
I think it should generate ValueError or will be autofixed since it call one time only at start. Documentation is good but it not validates values. After chance it works.
Whatever I still have some anomalies for example syllables with two vowels split by consonant or two syllables. I tested over 280 000 not unique words some examples - maybe it is problem with templates Polish for language but I am not sure since pattern is matched for back what is not natural - back is suffix. For Polish natural matching is - first word core than prefix than rest. So suffix is in third order.
['obo', 'wią', 'zy', 'wał']= should o-bo-wią-zy-wał
['ozdo', 'bić']= should oz-do-nić
It is rare but looks like something wrong - maybe right
or left
not works.
I think it should generate ValueError or will be autofixed since it call one time only at start. Documentation is good but it not validates values. After chance it works.
We generally don’t check the types and the limits of the parameters in our libraries (and that’s quite common in the Python world).
It is rare but looks like something wrong - maybe
right
orleft
not works.
You should ask the dictionary creators, they’ll be able to tell you whether there’s a good reason for that, or if there’s a limitation or a bug in the dictionary.
It is rare but looks like something wrong - maybe
right
orleft
not works.
left
and right
avoid the first/last syllables to be smaller than the given value. Unless your code gives a first/last syllable that’s smaller than the given values, there’s no bug about this in Pyphen.
We generally don’t check the types and the limits of the parameters in our libraries (and that’s quite common in the Python world).
Python world is same programing world - no different rules but as you wish :)
It is rare but looks like something wrong - maybe right or left not works.
You should ask the dictionary creators, they’ll be able to tell you whether there’s a good reason for that, or if there’s a limitation or a bug in the dictionary.
It is not related to dictionary I think but to pattern selection.
Simple word mean almost 'given' 'na-da-wa-ne' - pattern matching generate this 'nada-wa-ne'. The best that 'na-da' is 'nada' maybe it is pattern? Maybe because 'na' and 'da' is common syllable. Strange problem. I will check what happen maybe. left == 1 but it looks it is 4. 'na' here is prefix so cut is very good.
I write some code to cover problems and see this little more working version - code is very dirty - end probably is dead. It need to add rules to split wrong syllable like 'nada' - it generates more valid syllables. Whatever I wrote something better that current patterns and this code is not included here.
def fixed_hunspell_syllables(word: str):
syllables = PL_DICTIONARY.inserted(word, ' ').split()
# syllables = PL_HYPHENATOR.syllables(word)
# old polish CyV
index = 0
max_index = len(syllables)
fixed_syllables = []
while index < max_index:
syllable = syllables[index]
if index + 1 == max_index:
fixed_syllables.append(syllable)
break
if RE_OLD_Y_PREFIX.match(syllable):
next_syllable = syllables[index + 1]
if next_syllable[0] in VOWELS_SET:
# only consonants
if VOWELS_SET.isdisjoint(next_syllable[1:]):
fixed_syllables.append(syllable + next_syllable)
# some vowel
else:
fixed_syllables.append(syllable + next_syllable[0])
fixed_syllables.append(next_syllable[1:])
index += 2
continue
fixed_syllables.append(syllable)
index += 1
syllables = fixed_syllables
fixed_syllables = []
letters = []
for syllable in syllables:
# try join orphan consonants
if not VOWELS_SET.isdisjoint(syllable):
# no orphan
if not letters:
# naive syllable split prefix
if (not CONSONANTS_SET.isdisjoint(syllable)
and len(VOWELS_SET.intersection(syllable)) >= 2
and syllable[0] in VOWELS_SET and syllable[1] in CONSONANTS_SET):
fixed_syllables.append(syllable[0])
fixed_syllables.append(syllable[1:])
# consonants or syllable
else:
fixed_syllables.append(syllable)
# orphan consonants
else:
fixed_syllables.append(''.join(letters) + syllable)
letters.clear()
# no vowel orphan
else:
letters.append(syllable)
# orphan consonants
if letters:
fixed_syllables.append(''.join(letters))
# join orphan consonants at end
while len(fixed_syllables) >= 2:
if VOWELS_SET.isdisjoint(fixed_syllables[-1]):
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
else:
break
# (?:c|dr|d|r|st|s|tr|t|z)y
if len(fixed_syllables) >= 2:
if fixed_syllables[-2] == 'cy' and fixed_syllables[-1][0] in VOWELS_SET:
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
elif fixed_syllables[-2] == 'ty' and fixed_syllables[-1][0] in VOWELS_SET:
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
elif fixed_syllables[-2] == 'ry' and fixed_syllables[-1][0] in VOWELS_SET:
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
elif fixed_syllables[-2] == 'dy' and fixed_syllables[-1][0] in VOWELS_SET:
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
elif fixed_syllables[-2] == 'sy' and fixed_syllables[-1][0] in VOWELS_SET:
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
elif fixed_syllables[-2] == 'sty' and fixed_syllables[-1][0] in VOWELS_SET:
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
elif fixed_syllables[-2] == 'dry' and fixed_syllables[-1][0] in VOWELS_SET:
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
elif fixed_syllables[-2] == 'zy' and fixed_syllables[-1][0] in VOWELS_SET:
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
elif fixed_syllables[-2] == 'try' and fixed_syllables[-1][0] in VOWELS_SET:
fixed_syllables[-2] += fixed_syllables[-1]
del fixed_syllables[-1]
return fixed_syllables
Hi there @liZe, @ChameleonRed,
Stumbled upon your discussion while facing similar issue, however in a much simpler use-case. I just wanted to get the raw syllables, but it fails for Polish language. Using default settings btw.
import pyphen
dic = pyphen.Pyphen(lang='pl_PL')
print(dic.inserted('zaledwie'))
print(dic.inserted('nigdzie'))
print(dic.inserted('ostrożnie'))
print(dic.inserted('ostatnio'))
za-le-d-wie # 'd' is not a syllable
ni-g-dzie # 'g' is not a syllable
ostroż-nie # 'o' is a syllable
ostat-nio # 'o' is a syllable
How would you guys suggest to proceed for such a toy usage? Did you maybe succeed in somehow overcoming the issue, @ChameleonRed? Do I get it correctly, @liZe, that you suggest we write to the authors of pl-PL dictionary?
Thank you in advance
Do I get it correctly, @liZe, that you suggest we write to the authors of pl-PL dictionary?
Yes. I doubt that there’s something wrong in Pyphen’s code for this example (but I may be wrong!) You’ll get more detailed explanation from the authors of the dictionary. If there’s something wrong in the code according to what the authors say, please open a new issue here, we’ll investigate.
You can also try this site that uses Hunspell too. For 3 of your 4 words, it gives the same result as Pyphen.
See your code examples it looks like bug. Syllables not works - this word is very popular "Mary" in English.
Valid is just Ma-ry-śce or Ma-ryś-ce.