Fails in wordbreak and linebreak algorithms

roever commented 7 years ago

When implementing the grapheme break algorithm I wrote a little test program that uses the unicode test files to check the output of my implementation.

see http://www.unicode.org/Public/9.0.0/ucd/auxiliary/

I did adapt the program for the linebreak and wordbreak algorithms.

according to this the wordbreak implementation fails 2 tests: line 1361 and 1363 and the linebreak fails 532 out of roughly than 7300 tests.

Is this intentional? Are you interested in the programs to see for yourself? They are pretty simple...

roever commented 7 years ago

Oh well... I did use the 9.0.0 testsuite, maybe that one is not the right one to use?

tasn commented 7 years ago

9.0.0 is the correct version, we should be up to date. Thanks for reporting this, and great timing, as my plan for this week was to add a test suite to libunibreak. :)

Could you please create a pull request with the suite, or alternatively just send me the code and I'll integrate it with the build system?

Thank you very much, this is highly needed. :)

roever commented 7 years ago

Let me clean them up a bit and I will append them here in a while. They are very similar to the test program that is included in my pull-request, just adapted to use the other funktion calls and defines for the other algorithms and of course to use the other test specs.

tasn commented 7 years ago

Great, thanks!

tasn commented 7 years ago

@roever, btw, I have some free time in the next few days if you want, I'd happily create this based on the code from your PR.

roever commented 7 years ago

OK, I have update the pull request. It includes the latest grapheme test program. It includes your comments but also some additions. For example it now also converts the string to utf8 and checks with that encoding. I'm more fluent with C++ so some of the stuff is a bit clunky... I really hate C string handling... anyway you might want to use that as a starting point for your test programs. Maybe even have a common test program to test all 3 interfaces... whatever. Conversion is straighforward:

exchange test file name
change function names for the interface to test
change default values written into the desired_breaks array

tasn commented 7 years ago

@roever, thanks, I added a test suite based on your code.

I also checked the issues with the wordbreak algorithm and to me they look like an issue with the test data, and not the implementation. I sent an email to the Unicode mailing list, will update as I get a reply.

@adah1972, please take a look at the linebreak issues. Just run "make check" (you'll may need to disable the wordbreak test in the makefile because it too is failing at the moment).

adah1972 commented 7 years ago

Thank you both! A test suite is very useful.

I see the errors, and I look into them deeply at weekend.

By the way, do you mind if I reformat the test code? It seems a strange mix of GNU style and our current style. I can also check in my clang-format configuration then. We may then only discuss the configuration items when we have code style issues (well, the parenthesis one not included).

adah1972 commented 7 years ago

I added my clang-format configuration first. Even though it does not use the latest clang-format options and we do not have complete consensus about everything, I think it may serve as a good starting point. We can change it later.

tasn commented 7 years ago

Yes, please do. That's why I asked for it, I knew I was doing it wrong.

On 22 Nov 2016 2:18 p.m., "Wu Yongwei" notifications@github.com wrote:

Thank you both! A test suite is very useful.

I see the errors, and I look into them deeply at weekend.

By the way, do you mind if I reformat the test code? It seems a strange mix of GNU style and our current style. I can also check in my clang-format configuration then. We may then only discuss the configuration items when we have code style issues (well, the parenthesis one not included).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adah1972/libunibreak/issues/19#issuecomment-262251822, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGofjdgmiuzB-YPLWeJ-aansSuKY-r2ks5rAvmmgaJpZM4K3oxm .

adah1972 commented 7 years ago

OK, I made some quick changes. The function is also a bit long to my taste, but I did not change this for the moment.

tasn commented 7 years ago

For reference, here's my mail in the unicode mailing list archive: http://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0146.html

tasn commented 7 years ago

Update: wordbreak now passes the test suite.

tasn commented 7 years ago

Btw, @adah1972, we should probably make a libunibreak release once you finish fixing the issues. It looks like this release will include many bug fixes and a new feature (grapheme break), so I'd love to have it in production.

adah1972 commented 7 years ago

Yup. Unicode 9.0 support, bug fixes, and a new feature. :-)

adah1972 commented 7 years ago

There are quite a few issues with line breaking. I am not sure whether I can make all the test suite pass quickly, but I am in the process of fixing it.

There is one specific thing that I want your input. This is what the UAX#14 says about CJ (Conditional Japanese Starter):

This character class contains Japanese small hiragana and katakana. Characters of this class may be treated as either NS or ID.

CSS Text Level 3 (which supports Japanese line layout) defines three distinct values for its line-break behavior:

strict, typically used for long lines

normal (CSS default), the behavior typically used for books and documents

loose, typically used for short lines such as in newspapers

These have different sets of “kinsoku” characters which cannot be at the beginning or end of a line; strict has the largest set, while loose has the smallest. The motivation for the smaller number of kinsoku characters is to avoid triggering justification that puts characters off the grid position.

Treating characters of class CJ as class NS will give CSS strict line breaking; treating them as class ID will give CSS normal breaking.

The test suite implies it uses the ‘strict’ mode, while I use ‘normal’. I still think ‘normal’ is better as the default. So I think these are the options:

Do not fix it, and let the test fail.
Change the default behaviour, and the behavioural change may surprise people.
Add a function like set_linebreak_mode(enum LineBreakMode); keep the current default, but set the strict mode for testing.

I think the last option is the best, don’t you agree? Also, do you see such a mode useful in word breaking and grapheme breaking? (I guess not.)

roever commented 7 years ago

The disadvantage of a set_linebreak_mode is that you can no longer use it in a multi-threaded environment. As far as I can see now.. right now no function has internal state and can potentially be called in different threads. With set_mode you'd destroy that property

roever commented 7 years ago

what about "on the fly" adaptation of the tests. Whenever you encounter a character that requires the strict linebreaks replace it with one that has the same character class for normal linebreak.

adah1972 commented 7 years ago

Right, I thought about the multi-threading problem, and the right solution is to create an object and do the rest with the object. It is a big interface change, and I am not sure it is worthwhile.

On-the-fly adaptation will make the test pass, but it is a test-only solution. It is a little bit ugly. I am wondering whether it is worse than introducing a global variable...

roever commented 7 years ago

there are other "ugly" possibilities: 1) add a new parameter to the linebreak function with the linebreak mode, the current function calles the new function with the normal linebreak modus. User who require a differen mode use the new function. No global state or object required, no interface change for existing applications, but it doesn't scale in case other parameters are added

adah1972 commented 7 years ago

Exactly the reason why Windows had DoSomething and then DoSomethingEx. Some people said their company even had ...ExEx. ;-)

No, if I were to add a new API, I would definitely make it an object. It would be more future-proof. It is now more about how to make a reasonable change without breaking the existing API.

tasn commented 7 years ago

I'd add a new API with an object. It's the only reasonable way going forward if you ask me.

With that being said, I do have a fairly clean hacky solution in mind: set the mode in the language parameter. That is accept "jp-normal", "jp-strict" as potential values. Though I'd rather you add the object API discussed above.

adah1972 commented 7 years ago

I think the key question is: is the change worth while? and how about word break and grapheme break?

Personally, I do not feel the reason for change is strong enough: it arises only from a test instead of real-world cases. This said, if you both agree, we can release a version without fixing this test case, and then work on a new version with an object-oriented API.

tasn commented 7 years ago

Although I initially called it a hack, maybe the solution I proposed above is a good one. Setting the language to "jp-strict" and "jp-normal" to control this behaviour. What do you think?

If you disagree with the above, I think it's a good idea to release a version without fixing this test case, because as we said, it's not broken, just a different mode.

roever commented 7 years ago

Yes, I think the "jp-strict" solution sounds very appealing to me. maybe use "jp-mode_strict" so that the parameter is always within the string, so that when another parameter must be given it can be done by "jp-mode_normal-parameterx_normal" that way normal can be used for all upcomming additional parameters

Regarding the tests. If it is only one test that fails then ok. But if there are multiple tests that fail due to it, I'd ask, if it is possible to automatically detect these cases (maybe by the usage of a certain character) and not check them or maybe even automatically fix them within the test tool, (by replacing those characters).

adah1972 commented 7 years ago

So far, the "strict" language option seems the best. I'll do it. I'll make "ja-jp_strict" or "ja_strict" trigger the strict mode.

The tests have more failures, but this is the only one that is related to strict vs. normal rendering. The next headache is the Object Replacement Character. It seems I need to add an additional row/column absent in table 2 to implement fully LB20. The reference code in UAX#14 does not implement this either.

tasn commented 7 years ago

I don't like the "mode_strict" suggestion. I don't think encoding parameters in the string is a good idea. It just happens to work for mode, because it's essentially a variation of the Japanese language, which is what lang is for.

Object replacement character: I don't know if you remember, but that was one of my first contributions (if not the first) to the project, liblinebreak was triggering an assert with this char. This char is actually quite common, at least in my usecase, so let's make it work. :P

adah1972 commented 7 years ago

One gotcha, I cannot even use "ja...strict" in the test, as the "ja" language changes the line breaking behaviour and makes more tests fail. It seems I can only check for the "_strict" suffix.

tasn commented 7 years ago

We have language specific behaviour in the linebreak algo? I didn't know that. I guess yes, just check for the suffix. Btw, I prefer using a - over _ so it can't be mistaken as a location. So: "ja-strict" "ja_JP-strict" over "ja_strict" "ja_JP_strict"

adah1972 commented 7 years ago

Oh, you are right. I was confused by "_" and "-". I should check for "-strict". The complication is that we use "zh_CN" in shell LANG, but "zh-CN" in HTML lang attribute. :-(

tasn commented 7 years ago

I may be mistaken, but if memory serves, zh_CN is the standard way of doing it, regardless of what there may be in html.

adah1972 commented 7 years ago

http://stackoverflow.com/questions/11318961/what-is-the-difference-between-html-lang-en-and-html-lang-en-us :-)

adah1972 commented 7 years ago

OK. One small improvement committed and pushed. Now test suite completeness is 97%.

adah1972 commented 7 years ago

I fixed the U+FFFC issue. However, there are still quite a few failures, and I would like you to check the failed tests and give some comments.

These words are in LineBreakTest-9.0.0.txt:

Note: The Line Break tests use tailoring of numbers described in Example 7 of Section 8.2 Examples of Customization. They also differ from the results produced by a pair table implementation in sequences like: ZW SP CL.

I can see the following categories about failures:

The number-related cases as mentioned above.
Similar cases where UAX#14 treats as indirect break opportunity but the test treats as direct break opportunity, as in "%(".
The current implementation does not break after leading spaces, but the test does.
Treatment of pairing RI characters (cannot be covered by the pair table).

Personally, I consider only the last category as bugs and would like to fix them. This would lead to impossibility to achieve 100% test pass.

What are your opinions?

adah1972 commented 7 years ago

RI treatment is done now.

roever commented 7 years ago

Generally I do agree with you observations. Two questions though: what difference makes the tayloring in the example... it doesn't state that in the text of the example as far as I can see.

2nd have you ever though about a good way that users of the library can do the changes for this kind of tayloring without requiring to change the library? I am just curious here...

adah1972 commented 7 years ago

It seems to me Example 7 of Section 8.2 gives tailoring that is better suited for a regexp-based implementation. One specific example is that the current implementation does not allow any breaks in either "12,345" or "word,3", but Example 7 and the tailoring would allow breaking after the comma in the latter case. It is arguably better, but requires more complicated processing, and the benefit is not that big in my eyes.

tasn commented 7 years ago

I'm OK with ignoring the offending lines. I think the best solution would be to add a mechanism to ignore specific lines of the test, so this way our suite would still pass indicating that everything is as expected, but we would still be able to use the unicode test data. If you are OK with this, I can implement it. I'll just ignore all of the tests currently failing.

adah1972 commented 7 years ago

I am curious how you wanted to do that. I thought about it, but did not come up with a good solution. E.g. if I skip tests with numbers, I may skip more tests than the failing ones. . . .

tasn commented 7 years ago

Skip tests based on line numbers. Why more tests than the failing ones? We know exactly which are failing, and I'll only skip those. Next time, when we update to Unicode 10, we will have to review the tests we skip again, but that's not much of a big deal, and better than the alternatives.

adah1972 commented 7 years ago

OK, that will work (I was thinking along the line of regex). The line numbers in the test probably are not that stable, but I really do not have a better solution.

tasn commented 7 years ago

I was considering regex, but as you said, it's too dangerous.

Edit: I think they should be stable-enough to not bother us that much, but anyway, the test data shouldn't change too often.

tasn commented 7 years ago

@adah1972, please make a new release.

adah1972 / libunibreak

Fails in wordbreak and linebreak algorithms #19