adah1972 / libunibreak

The libunibreak library
zlib License
173 stars 38 forks source link

Unicode 11 #21

Closed doublex closed 6 years ago

doublex commented 6 years ago

This file: https://github.com/adah1972/libunibreak/blob/master/src/wordbreak.c Unicode 11: http://www.unicode.org/reports/tr29/tr29-32.html

marked as obsolete:

E_Base
E_Modifier
Glue_After_Zwj
E_Base_GAZ

Added:

WSegSpace
adah1972 commented 6 years ago

@tasn Tom, comments?

tasn commented 6 years ago

@adah1972, sorry, but not time to take a look at the moment, and don't think I'll have any in the immediate future. :( Ping me again in a few weeks if you haven't managed to fix it by then?

doublex commented 6 years ago

Are there any plans to upgrade this great library to unicode 11? Best wishes!

adah1972 commented 6 years ago

@doublex I made some quick fixes. Please test and check.

adah1972 commented 6 years ago

@roever Do you have time to check any updates are necessary in grapheme breaking?

doublex commented 6 years ago

@adah1972 Great library! Thanks a lot!

roever commented 6 years ago

just had a look into the grapheme part.

I can easily update most of it. But the emoji stuff is a bit more complicated. I would need to add an additional table from here (https://www.unicode.org/Public/emoji//11.0/emoji-data.txt) the Extended_Pictographic one from the bottom to implement rule 11 properly. Without it we get some fails on emoji breaks.

Do we want that? I am not keen on doing that, but I think it would be the right thing to do anyways.

adah1972 commented 6 years ago

How about separating the work and doing the easy work first? We can go step by step, and every improvement is a good one. I do not feel it a problem if incorporating the emoji data table takes extra time and cannot be done right away.

roever commented 6 years ago

I'll do it. I don't think it will take that long...

adah1972 commented 6 years ago

@doublex I think we have fully updated the library for Unicode 11. I have uploaded a test 4.1 release here (NEWS, configure.ac, and src/Makefile.am have uncommitted changes):

http://wyw.dcweb.cn/libunibreak-4.1.tar.gz

If you have time, please take a look. I will make a new release in about a week.

doublex commented 6 years ago

@adah1972 Best lightweight alternative to "ICU"

adah1972 commented 6 years ago

Regretfully the last "RC" failed many test cases (kudos to Andreas for making all grapheme breaking tests pass). I have fixed all regression issues, and also updated the line breaking code to reduce the number of skipped/failed tests significantly.

The RC download link remains the same.

adah1972 commented 6 years ago

I have made release 4.1, and am closing this issue. If there are other problems, please open a new issue.