NHunspell subdir is missing

imarosi commented 1 month ago

The solution file refers to NHunspell.csproj in the NHunspell subdir, but that folder and all its source files are missing here. I could find them on sourceforge, but that looks like an old version.

tmaierhofer commented 1 month ago

There is now a fully managed version of hunspell for .net available. did you consider using this? https://github.com/aarondandy/WeCantSpell.Hunspell/

imarosi commented 4 weeks ago

Yes, I'm doing that, but there seems to be a problem with the content of HyphenationPoints, HyphenationPositions etc. arrays when non-standard hyphenation is used (like hyphenating the Hungarian "asszony"). I wanted to have a look at creation of these arrays in the source.

imarosi commented 3 weeks ago

Sorry, I did not read your answer correctly, I thought you mention your nuget package. I'll have a look at WeCantSpell, thanks for the link.

imarosi commented 3 weeks ago

Hmm, I'm afraid WeCantSpell doesn't support hyphenation, and all I need is hyphenation only.

I don't know if you still support NHunspell... anyway, the problem is like this:

(My C test code was created by modifying example.c from the original hunspell/hyphen source set. A functionally equivalent C# code was also written, on top of NHunspell.)

Below is the output of the C code when hyphenating the Hungarian words "kisasszony" (little lady) and "vénasszony" (old lady). The "ssz" needs non-standard hyphenation, while "é" is a multi-byte UTF-8 code:

imarosi@559T8H2:~/hyph$ ./hyph hu kisasszony
Input:      k i s a s s z o n y
HyphVector:  0 2 1 0 5 2 0 0 0
Details:
       Pos:  0 0 0 0 1 0 0 0 0
       Cut:  0 0 0 0 1 0 0 0 0
       Rep:  - - - - "sz=" - - - -
Syllables:  kis=asz=szony
Hyphenations:
 - kis=asszony
 - kisasz=szony
imarosi@559T8H2:~/hyph$ ./hyph hu vénasszony
Input:      v é n a s s z o n y
HyphVector:  0 2 1 0 5 2 0 0 0
Details:
       Pos:  0 0 0 0 1 0 0 0 0
       Cut:  0 0 0 0 1 0 0 0 0
       Rep:  - - - - "sz=" - - - -
Syllables:  vén=asz=szony
Hyphenations:
 - vén=asszony
 - vénasz=szony

The above is correct. "Syllables" is the word returned by the library itself, while "Hyphenations" is the list of possible hyphenations generated by using the pos, cut and rep arrays.

The output with my test code using NHunspell is like this:

C:\Work\NHunspell\nhyph\bin\Debug>nhyph.exe hu_HU kisasszony
Input:      k i s a s s z o n y
HyphVector:  0 2 1 0 5 2 0 0 0
Details:
       Pos:  0 0 0 1 0 0 0 0 0
       Cut:  0 0 0 1 0 0 0 0 0
       Rep:  - - - "sz=" - - - - -
Syllables:  kis=asz=szony
Hyphenations:
 - kis=asszony
 - kisas=szony

C:\Work\NHunspell\nhyph\bin\Debug>nhyph.exe hu_HU vénasszony
Input:      v é n a s s z o n y
HyphVector:  0 1 0 5 2 0 0 0 0
Details:
       Pos:  0 0 1 0 0 0 0 0 0
       Cut:  0 0 1 0 0 0 0 0 0
       Rep:  - - "sz=" - - - - - -
Syllables:  vén=asz=szony
Hyphenations:
 - vé=nasszony
 - véna=sszony

Note that Syllables is good in both cases, but the generated individual hyphenations are not.

The Hyphen vector (i.e. HyphenationPoints) is correct when there are no accented characters, but it is one off after the "é". It looks like the value '2' is dropped from the vector.

The pos, cut and rep arrays are one off even when there are no accented characters, while they are even more shifted with the presence of such multi-byte characters.

I'd be happy to help you find these bugs. Probably the latest HyphenExportFunctions.cpp source is to be checked, I would be happy to check that. You can send that code directly to me, too, if that's faster than uploading all files to github. My email is on gmail.com, with imarosi before the at.

imarosi commented 3 weeks ago

I checked the sourceforge version of HyphenExportFunctions.cpp and I think I've found the bug in HyphenHyphenate().

for( int multByteIndex = 1; ... should start with 0, not 1. This is seemingly compensated in the next line with hyphenPoints[multByteIndex -1], but other indexes don't decrement the index, and in fact the index itself is not good when there are multi-byte characters.
(*posPtr)[wideCharIndex] = pos[multByteIndex] and similar instructions should be (*posPtr)[wideCharIndex] = pos[wideCharIndex] because all the arrays returned by hnj_hyphen_hyphenate2 (hyphenPoints, pos, cut, rep) must be considered working on characters, not bytes.

In fact the loop can be simplified like this:

for (int i=0; i< wordChars - 1; i++) {
(*hyphenationPointsPtr)[i] = hyphenPoints[i] - '0';
if (rep && rep[i]) {
    // populate repTextBuffer like the current code does
    (*repPtr)[i] = repTextBuffer;
    (*posPtr)[i] = pos[i];
    (*cutPtr)[i] = cut[i];
} else
    (*repPtr)[i] = 0;
}

Thomas-Maierhofer-Consulting / NHunspell

NHunspell subdir is missing #1