Kozea / Pyphen

Hy-phen-ation made easy
https://courtbouillon.org/pyphen
Other
198 stars 24 forks source link

Polish syllables not works. Since wrong positions() use to create sullables - data is valid but not result. #47

Closed ChameleonRed closed 10 months ago

ChameleonRed commented 1 year ago

See your code examples it looks like bug. Syllables not works - this word is very popular "Mary" in English.

import pyphen
p = inserted('Maryśce')
# CVCVCCV - Vowel, Consonant

p.inserted('Maryśce', ' ')
# 'Ma-ry-ś-ce-' ? Syllables not works. Why cut on end?

list(p.iterate('Maryśce'))
# [('Maryśce', ''), ('Maryś', 'ce'), ('Mary', 'śce'), ('Ma', 'ryśce')]
# [('Maryście', '')? Cut on end?
# Rest is good. Two alternative cuts both are good first can be better Ma-ry-śce or Ma-ryś-ce.

# ś is consonant in Polish so invalid. Must be one vowel in syllable.
p.positions('Maryśce')
# [2, 4, 5, 7]
# This generate error since 4,5 = 'ś' and it is not valid syllable. 7 is no idea why?
# 4 or 5 is alternatives better to choose 4 and skip 5.

Valid is just Ma-ry-śce or Ma-ryś-ce.

liZe commented 1 year ago

Hi!

Here’s what I get:

import pyphen
dic = pyphen.Pyphen(lang='pl_PL')
dic.inserted('Maryśce')
# 'Ma-ry-ś-ce'

There’s no extra hyphen at the end.

This generate error since 4,5 = 'ś' and it is not valid syllable.

That’s something you should talk about with the dictionary creators, they know much more about Polish hyphenation than I do 😁.

If you have some "copy-pastable" code that shows the extra hyphen error, we can find to fix the bug (your script doesn’t really work). Otherwise, we can probably close this issue.

ChameleonRed commented 1 year ago

Here is code with generate these errors - see that both libraries generate same "errors". Whatever it is not errors in Polish cut of syllables is ambiguous so Ma-ry-śce is good and Ma-ryś-ce is good whatever you can not cut both patterns and generate Ma-ry-ś-ce since it doesn't have sense - no vowel.

Can you show what patterns do you use for this? Suffix 'śce' is very common so why it is not hit - since it is longer than 'ce' - what is hit for 'ś' since it is not pattern for sure? Can you explain this - I do not know Liang algorithm which you are using in details but I think that is something wrong here since lack of vowel is common (I found more such examples).

import hyphen
import pyphen

words = [
    'postanowiliśmy',
    'kurlandzki',
    'Maryśce'
]

for word in words:
    print('PyHypen')
    h = hyphen.Hyphenator('pl_PL', 0, 0, 0, 0)
    print(f"{h.pairs(word)}=")
    print(f"{h.syllables(word)}=")
    print()

    print('pyphen')
    p = pyphen.Pyphen(lang='pl_PL', left=0, right=0)
    print(f"{list(p.iterate(word))}=")
    print(f"{list(p.inserted(word).split('-'))}=")
    print()
ChameleonRed commented 1 year ago

I also found that I can not specify 'COMPOUNDLEFTHYPHENMIN', 'COMPOUNDRIGHTHYPHENMIN' but can 'LEFTHYPHENMIN', 'RIGHTHYPHENMIN' it not related probably I do not know this by now.

Main code searching patterns looks like this so here I can see how it works later:

    def positions(self, word):
        """Get a list of positions where the word can be hyphenated.

        :param word: unicode string of the word to hyphenate

        E.g. for the dutch word 'lettergrepen' this method returns ``[3, 6,
        9]``.

        Each position is a ``DataInt`` with a data attribute.

        If the data attribute is not ``None``, it contains a tuple with
        information about nonstandard hyphenation at that point: ``(change,
        index, cut)``.

        change
          a string like ``'ff=f'``, that describes how hyphenation should
          take place.

        index
          where to substitute the change, counting from the current point

        cut
          how many characters to remove while substituting the nonstandard
          hyphenation

        """
        word = word.lower()
        points = self.cache.get(word)
        if points is None:
            pointed_word = '.%s.' % word
            references = [0] * (len(pointed_word) + 1)

            for i in range(len(pointed_word) - 1):
                for j in range(
                        i + 1, min(i + self.maxlen, len(pointed_word)) + 1):
                    pattern = self.patterns.get(pointed_word[i:j])
                    if pattern:
                        offset, values = pattern
                        slice_ = slice(i + offset, i + offset + len(values))
                        references[slice_] = map(
                            max, values, references[slice_])

            points = [
                DataInt(i - 1, reference=reference)
                for i, reference in enumerate(references) if reference % 2]
            self.cache[word] = points
        return points
liZe commented 1 year ago

Here is code with generate these errors - see that both libraries generate same "errors".

Then you can talk about this with the dictionary creators. Pyphen just includes this dictionary, but we’re not responsible for its content.

I also found that I can not specify 'COMPOUNDLEFTHYPHENMIN', 'COMPOUNDRIGHTHYPHENMIN' but can 'LEFTHYPHENMIN', 'RIGHTHYPHENMIN' it not related probably I do not know this by now.

These values are not used by Pyphen, but you can override these values using the left and right parameters:

https://github.com/Kozea/Pyphen/blob/ebc37d13b83a77e2376d6a72e92ac5356e07883d/pyphen/__init__.py#L209

ChameleonRed commented 1 year ago

It is not problem of dictionary but problem of algorithm. In about 10000 test I found only problems in old Polish when little different letters is used for 'j' like 'y'. Why it is problem of algorithm? You use hyphenation dictionary so ambiguous cut are correct and it is used to cut word at end of line. For syllables you can not use it because syllable need one vowel. So 'Ma-ry-ś-ce' cuts are good but 'ś' is not syllable.

Another problem is that left=0 or right=0 probably not works - I suspect but not tested. It generates multiple syllables probably in hyphenation dictionary patterns there is some assumption for length but it is speculations. For example 'U-kra-iny' - cuts are good but syllables is 'U-kra-i-ny' all vowels are full in this word no semi vowels like in old Polish.

So syllable algorithm is not hyphenation algorithm. In English it can be different but in English (and similar) what you write is different from what you talk in Polish it is almost same with some rare exceptions.

liZe commented 1 year ago

So syllable algorithm is not hyphenation algorithm.

As far as I know, Hunspell dictionaries (and thus Pyphen) give positions where hyphenation is possible. It gives 'Ma-ry-ś-ce', and according to what you say that’s accurate. The fact that 'ś' is not a syllable is a problem, but it’s not Pyphen’s problem, is it?

ChameleonRed commented 1 year ago

If it only hyphenation it is O.K. but still is bug ...

generates: 'Ma-ry-ś-ce-'

should: 'Ma-ry-ś-ce'

liZe commented 1 year ago

It doesn’t for me:

import pyphen
dic = pyphen.Pyphen(lang='pl_PL')
dic.inserted('Maryśce')
# 'Ma-ry-ś-ce'
liZe commented 1 year ago

@ChameleonRed Did you find why you get an extra hyphen? Maybe there’s an invisible character (such as zero-width space) at the end of your string.

ChameleonRed commented 1 year ago

It is not invisible since I enter it by hand.

ChameleonRed commented 1 year ago

See comparison:

PyHypen
[['po', 'stanowiliśmy'], ['posta', 'nowiliśmy'], ['postano', 'wiliśmy'], ['postanowi', 'liśmy'], ['postanowili', 'śmy']]=
['po', 'sta', 'no', 'wi', 'li', 'śmy']=

pyphen
[('postanowiliśmy', ''), ('postanowili', 'śmy'), ('postanowi', 'liśmy'), ('postano', 'wiliśmy'), ('posta', 'nowiliśmy'), ('po', 'stanowiliśmy')]=
['po', 'sta', 'no', 'wi', 'li', 'śmy', '']=

PyHypen
[['kur', 'landzki'], ['kurlandz', 'ki']]=
['kur', 'landz', 'ki']=

pyphen
[('kurlandzki', ''), ('kurlandz', 'ki'), ('kur', 'landzki')]=
['kur', 'landz', 'ki', '']=

PyHypen
[['Ma', 'ryśce'], ['Mary', 'śce'], ['Maryś', 'ce']]=
['Ma', 'ry', 'ś', 'ce']=

pyphen
[('Maryśce', ''), ('Maryś', 'ce'), ('Mary', 'śce'), ('Ma', 'ryśce')]=
['Ma', 'ry', 'ś', 'ce', '']=

PyHypen
[['Ukra', 'iny']]=
['Ukra', 'iny']=

pyphen
[('Ukrainy', ''), ('Ukra', 'iny'), ('U', 'krainy')]=
['U', 'kra', 'iny', '']=

PyHypen
[['pi', 'sma']]=
['pi', 'sma']=

pyphen
[('pisma', ''), ('pi', 'sma')]=
['pi', 'sma', '']=
ChameleonRed commented 1 year ago

Code for debug:

import hyphen
import pyphen

words = [
    'postanowiliśmy',
    'kurlandzki',
    'Maryśce',
    'Ukrainy',
    'pisma',
]

for word in words:
    print('PyHypen')
    h = hyphen.Hyphenator('pl_PL', 0, 0, 0, 0)
    print(f"{h.pairs(word)}=")
    print(f"{h.syllables(word)}=")
    print()

    print('pyphen')
    p = pyphen.Pyphen(lang='pl_PL', left=0, right=0)
    print(f"{list(p.iterate(word))}=")
    print(f"{list(p.inserted(word).split('-'))}=")
    print()
liZe commented 1 year ago

Thanks for this code.

The problem is here: pyphen.Pyphen(lang='pl_PL', left=0, right=0). left and right are the "minimum/maximum number of characters of the first/last syllabe" according to the documentation. You can’t put 0, you need at least 1.

ChameleonRed commented 1 year ago

I think it should generate ValueError or will be autofixed since it call one time only at start. Documentation is good but it not validates values. After chance it works.

Whatever I still have some anomalies for example syllables with two vowels split by consonant or two syllables. I tested over 280 000 not unique words some examples - maybe it is problem with templates Polish for language but I am not sure since pattern is matched for back what is not natural - back is suffix. For Polish natural matching is - first word core than prefix than rest. So suffix is in third order.

['obo', 'wią', 'zy', 'wał']= should o-bo-wią-zy-wał

['ozdo', 'bić']= should oz-do-nić

It is rare but looks like something wrong - maybe right or left not works.

liZe commented 1 year ago

I think it should generate ValueError or will be autofixed since it call one time only at start. Documentation is good but it not validates values. After chance it works.

We generally don’t check the types and the limits of the parameters in our libraries (and that’s quite common in the Python world).

It is rare but looks like something wrong - maybe right or left not works.

You should ask the dictionary creators, they’ll be able to tell you whether there’s a good reason for that, or if there’s a limitation or a bug in the dictionary.

It is rare but looks like something wrong - maybe right or left not works.

left and right avoid the first/last syllables to be smaller than the given value. Unless your code gives a first/last syllable that’s smaller than the given values, there’s no bug about this in Pyphen.

ChameleonRed commented 1 year ago

We generally don’t check the types and the limits of the parameters in our libraries (and that’s quite common in the Python world).

Python world is same programing world - no different rules but as you wish :)

It is rare but looks like something wrong - maybe right or left not works.

You should ask the dictionary creators, they’ll be able to tell you whether there’s a good reason for that, or if there’s a limitation or a bug in the dictionary.

It is not related to dictionary I think but to pattern selection.

Simple word mean almost 'given' 'na-da-wa-ne' - pattern matching generate this 'nada-wa-ne'. The best that 'na-da' is 'nada' maybe it is pattern? Maybe because 'na' and 'da' is common syllable. Strange problem. I will check what happen maybe. left == 1 but it looks it is 4. 'na' here is prefix so cut is very good.

I write some code to cover problems and see this little more working version - code is very dirty - end probably is dead. It need to add rules to split wrong syllable like 'nada' - it generates more valid syllables. Whatever I wrote something better that current patterns and this code is not included here.

def fixed_hunspell_syllables(word: str):
    syllables = PL_DICTIONARY.inserted(word, ' ').split()
    # syllables = PL_HYPHENATOR.syllables(word)

    # old polish CyV
    index = 0
    max_index = len(syllables)
    fixed_syllables = []

    while index < max_index:
        syllable = syllables[index]
        if index + 1 == max_index:
            fixed_syllables.append(syllable)
            break
        if RE_OLD_Y_PREFIX.match(syllable):
            next_syllable = syllables[index + 1]
            if next_syllable[0] in VOWELS_SET:
                # only consonants
                if VOWELS_SET.isdisjoint(next_syllable[1:]):
                    fixed_syllables.append(syllable + next_syllable)
                # some vowel
                else:
                    fixed_syllables.append(syllable + next_syllable[0])
                    fixed_syllables.append(next_syllable[1:])
                index += 2
                continue
        fixed_syllables.append(syllable)
        index += 1

    syllables = fixed_syllables

    fixed_syllables = []
    letters = []
    for syllable in syllables:
        # try join orphan consonants
        if not VOWELS_SET.isdisjoint(syllable):
            # no orphan
            if not letters:
                # naive syllable split prefix
                if (not CONSONANTS_SET.isdisjoint(syllable)
                        and len(VOWELS_SET.intersection(syllable)) >= 2
                        and syllable[0] in VOWELS_SET and syllable[1] in CONSONANTS_SET):
                    fixed_syllables.append(syllable[0])
                    fixed_syllables.append(syllable[1:])
                # consonants or syllable
                else:
                    fixed_syllables.append(syllable)
            # orphan consonants
            else:
                fixed_syllables.append(''.join(letters) + syllable)
                letters.clear()
        # no vowel orphan
        else:
            letters.append(syllable)
    # orphan consonants
    if letters:
        fixed_syllables.append(''.join(letters))

    # join orphan consonants at end
    while len(fixed_syllables) >= 2:
        if VOWELS_SET.isdisjoint(fixed_syllables[-1]):
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]
        else:
            break

    # (?:c|dr|d|r|st|s|tr|t|z)y

    if len(fixed_syllables) >= 2:
        if fixed_syllables[-2] == 'cy' and fixed_syllables[-1][0] in VOWELS_SET:
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]
        elif fixed_syllables[-2] == 'ty' and fixed_syllables[-1][0] in VOWELS_SET:
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]
        elif fixed_syllables[-2] == 'ry' and fixed_syllables[-1][0] in VOWELS_SET:
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]
        elif fixed_syllables[-2] == 'dy' and fixed_syllables[-1][0] in VOWELS_SET:
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]
        elif fixed_syllables[-2] == 'sy' and fixed_syllables[-1][0] in VOWELS_SET:
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]
        elif fixed_syllables[-2] == 'sty' and fixed_syllables[-1][0] in VOWELS_SET:
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]
        elif fixed_syllables[-2] == 'dry' and fixed_syllables[-1][0] in VOWELS_SET:
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]
        elif fixed_syllables[-2] == 'zy' and fixed_syllables[-1][0] in VOWELS_SET:
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]
        elif fixed_syllables[-2] == 'try' and fixed_syllables[-1][0] in VOWELS_SET:
            fixed_syllables[-2] += fixed_syllables[-1]
            del fixed_syllables[-1]

    return fixed_syllables
JStrebeyko commented 10 months ago

Hi there @liZe, @ChameleonRed,

Stumbled upon your discussion while facing similar issue, however in a much simpler use-case. I just wanted to get the raw syllables, but it fails for Polish language. Using default settings btw.

import pyphen

dic = pyphen.Pyphen(lang='pl_PL')

print(dic.inserted('zaledwie'))
print(dic.inserted('nigdzie'))
print(dic.inserted('ostrożnie'))
print(dic.inserted('ostatnio'))
za-le-d-wie   # 'd' is not a syllable
ni-g-dzie     # 'g' is not a syllable
ostroż-nie    # 'o' is a syllable 
ostat-nio     # 'o' is a syllable

How would you guys suggest to proceed for such a toy usage? Did you maybe succeed in somehow overcoming the issue, @ChameleonRed? Do I get it correctly, @liZe, that you suggest we write to the authors of pl-PL dictionary?

Thank you in advance

liZe commented 10 months ago

Do I get it correctly, @liZe, that you suggest we write to the authors of pl-PL dictionary?

Yes. I doubt that there’s something wrong in Pyphen’s code for this example (but I may be wrong!) You’ll get more detailed explanation from the authors of the dictionary. If there’s something wrong in the code according to what the authors say, please open a new issue here, we’ll investigate.

You can also try this site that uses Hunspell too. For 3 of your 4 words, it gives the same result as Pyphen.