Closed dairt closed 8 years ago
Hi herdz,
This is due to rule 21a in the Unicode line breaking algorithm:
http://www.unicode.org/reports/tr14/#LB21a
So using "-" or "|" can achieve the same result.
If you know Hebrew, maybe you can tell why the rule is here and what the best way is to handle the case.
I believe the JSP page (you missed two dots in the URL) has not implemented rule 21a.
On 16 December 2015 at 23:59, herdz notifications@github.com wrote:
Hello,
I have a problem with getting correct line breaks with the following case:
aaaa  bbbb
vs
אאאא  בבבב
When using the test tool, I get (respectively):
aaaa |  | bbbb
vs
אאאא |  בבבב
I have used the object replacement char, but not sure if this is specific for this case
For comparison, added results from http://unicodeorg/cldr/utility/breaksjsp: aaaa | |bbbb vs אאאא | |בבבב
Can you check if there might be a problem? Thanks in advance
— Reply to this email directly or view it on GitHub https://github.com/adah1972/libunibreak/issues/11.
Wu Yongwei URL: http://wyw.dcweb.cn/
Thank you for the quick reply.
Just to be clear, I didn't use any "-" or "|" characters in my examples. The output with those characters is what the test tool (tools/linebreak_test.c
) prints.
The issue is for the simple alphabetic characters combined with object replacement character, and should be equivalent for both AL
and HL
.
Also, I checked the online line break reference (again: http://unicode.org/cldr/utility/breaks.jsp):
Version 3.7; ICU version: 56.0.1.0; Unicode version: 8.0.0.0
I suspect it has something to do with the latest update to 8.0.0, @herdz, could you please try with commit 5cae14c7d3a50acd275bf90692e645a6ecdb1540?
I also speak Hebrew, and @herdz is right, it should be the same. 21a has nothing to do with it, look at his text, no hyphens.
Same result. Thanks for trying at this. Reading a bit http://unicode.org/reports/tr14/#CB, I am not sure the libunibreak implementation has a rule to mark ALLOW_BREAK after the OBJ (maybe it worked before for AL from other reasons).
I remember there used to be issues with OBJ and I fixed them a while back, maybe there are more.
Sorry I was not clear enough last time. Let me be thorough this time.
@tasn Since you said you speak Hebrew, can you explain to me why HY and BA should not have a break after when they follows Hebrew letters, even when there are spaces in between? I am especially puzzled by BA, as some BA characters are not hyphens (they include hyphen, "|", software hyphen, Armenian hyphen, Devanagary double danda, etc.)? Is UAX #14 just trying to be simple?
Independent of this, maybe an easy way for the fix is to treat CB as B2 instead of BA. I have tried this, and it seems to solve the problem.
Any comments? I will check in this change if there are no objections.
Fix committed. I will close in a day, in case you find any problems.
Looks like the proper fix. Thanks for the quick reply! :)
Looking good.
Hello,
I have a problem with getting correct line breaks with the following case:
vs.
When using the test tool, I get (respectively):
vs.
I have used the object replacement char, but not sure if this is specific for this case.
For comparison, added results from http://unicode.org/cldr/utility/breaks.jsp:
aaaa | |bbbb
vs.אאאא | |בבבב
Can you check if there might be a problem? Thanks in advance.