adah1972 / libunibreak

The libunibreak library
zlib License
173 stars 38 forks source link

Line breaks RTL case have wrong results #11

Closed dairt closed 8 years ago

dairt commented 8 years ago

Hello,

I have a problem with getting correct line breaks with the following case:

aaaa  bbbb

vs.

אאאא  בבבב

When using the test tool, I get (respectively):

aaaa |
 |
bbbb

vs.

אאאא |
 בבבב

I have used the object replacement char, but not sure if this is specific for this case.

For comparison, added results from http://unicode.org/cldr/utility/breaks.jsp: aaaa | |bbbb vs. אאאא | |בבבב

Can you check if there might be a problem? Thanks in advance.

adah1972 commented 8 years ago

Hi herdz,

This is due to rule 21a in the Unicode line breaking algorithm:

http://www.unicode.org/reports/tr14/#LB21a

So using "-" or "|" can achieve the same result.

If you know Hebrew, maybe you can tell why the rule is here and what the best way is to handle the case.

I believe the JSP page (you missed two dots in the URL) has not implemented rule 21a.

On 16 December 2015 at 23:59, herdz notifications@github.com wrote:

Hello,

I have a problem with getting correct line breaks with the following case:

aaaa  bbbb

vs

אאאא  בבבב

When using the test tool, I get (respectively):

aaaa |  | bbbb

vs

אאאא |  בבבב

I have used the object replacement char, but not sure if this is specific for this case

For comparison, added results from http://unicodeorg/cldr/utility/breaksjsp: aaaa | |bbbb vs אאאא | |בבבב

Can you check if there might be a problem? Thanks in advance

— Reply to this email directly or view it on GitHub https://github.com/adah1972/libunibreak/issues/11.

Wu Yongwei URL: http://wyw.dcweb.cn/

dairt commented 8 years ago

Thank you for the quick reply. Just to be clear, I didn't use any "-" or "|" characters in my examples. The output with those characters is what the test tool (tools/linebreak_test.c) prints.

The issue is for the simple alphabetic characters combined with object replacement character, and should be equivalent for both AL and HL.

Also, I checked the online line break reference (again: http://unicode.org/cldr/utility/breaks.jsp): Version 3.7; ICU version: 56.0.1.0; Unicode version: 8.0.0.0

tasn commented 8 years ago

I suspect it has something to do with the latest update to 8.0.0, @herdz, could you please try with commit 5cae14c7d3a50acd275bf90692e645a6ecdb1540?

I also speak Hebrew, and @herdz is right, it should be the same. 21a has nothing to do with it, look at his text, no hyphens.

dairt commented 8 years ago

Same result. Thanks for trying at this. Reading a bit http://unicode.org/reports/tr14/#CB, I am not sure the libunibreak implementation has a rule to mark ALLOW_BREAK after the OBJ (maybe it worked before for AL from other reasons).

tasn commented 8 years ago

I remember there used to be issues with OBJ and I fixed them a while back, maybe there are more.

adah1972 commented 8 years ago

Sorry I was not clear enough last time. Let me be thorough this time.

  1. U+FFFC has the category CB. Currently it is resolved to BA.
  2. Rule 21a says no break after HL followed by HY and BA.
  3. Thus the current behaviour.

@tasn Since you said you speak Hebrew, can you explain to me why HY and BA should not have a break after when they follows Hebrew letters, even when there are spaces in between? I am especially puzzled by BA, as some BA characters are not hyphens (they include hyphen, "|", software hyphen, Armenian hyphen, Devanagary double danda, etc.)? Is UAX #14 just trying to be simple?

Independent of this, maybe an easy way for the fix is to treat CB as B2 instead of BA. I have tried this, and it seems to solve the problem.

Any comments? I will check in this change if there are no objections.

adah1972 commented 8 years ago

Fix committed. I will close in a day, in case you find any problems.

dairt commented 8 years ago

Looks like the proper fix. Thanks for the quick reply! :)

tasn commented 8 years ago

Looking good.