adah1972 / libunibreak

The libunibreak library
zlib License
173 stars 38 forks source link

Public version of the get_char_lb_class_lang() function #39

Closed virxkane closed 1 year ago

virxkane commented 1 year ago

Is it possible to make the get_char_lb_class_lang() function to determine the class of a unicode character public? Here is a patch that does something similar:

diff --git a/src/linebreak.c b/src/linebreak.c
index c2ccdda..45afba2 100644
--- a/src/linebreak.c
+++ b/src/linebreak.c
@@ -751,6 +751,12 @@ int lb_process_next_char(

     return brk;
 }
+enum LineBreakClass lb_get_char_class(
+        struct LineBreakContext *lbpCtx,
+        utf32_t ch)
+{
+    return get_char_lb_class_lang(ch, lbpCtx->lbpLang);
+}

 /**
  * Sets the line breaking information for a generic input string.
diff --git a/src/linebreakdef.h b/src/linebreakdef.h
index 5bb8838..8db80a6 100644
--- a/src/linebreakdef.h
+++ b/src/linebreakdef.h
@@ -169,3 +169,6 @@ void set_linebreaks(
         const char *lang,
         char *brks,
         get_next_char_t get_next_char);
+enum LineBreakClass lb_get_char_class(
+        struct LineBreakContext *lbpCtx,
+        utf32_t ch);
adah1972 commented 1 year ago

Do you have a strong reason to use the data external to the library? I am a bit curious.

virxkane commented 1 year ago

koreader/crengine (a fork of CoolReader) added some code using libunibreak with this patch. Unfortunately, I cannot say exactly what this function is for, since I am not the author of this code. As far as I understood, it's required for more correct linebreaks around em-dash.

https://github.com/koreader/crengine/pull/365

Perhaps the authors of this code can explain - @Jellby. Perhaps @poire-z will also be able to explain something.

I just want to use this code without applying additional patches.

https://github.com/koreader/crengine/blob/a4f7509a475ba233ddd92b19c1a5868f098c4655/crengine/src/textlang.cpp#L494

https://github.com/koreader/crengine/blob/a4f7509a475ba233ddd92b19c1a5868f098c4655/crengine/src/textlang.cpp#L670

adah1972 commented 1 year ago

Interesting....

@jellby @poire-z Can you think of a solution that work directly inside libunibreak? This can be, but not limited to, adding more custom data to linebreakdef.c.

Jellby commented 1 year ago

I hope @poire-z can provide a more useful answer. To be honest, my knowledge of crengine, libunibreak and the UAX stuff at the time is pretty much contained in the PR description and comments, and I've forgotten most of it since :D

poire-z commented 1 year ago

(It's been a while since we integrated libunibreak into crengine, so the following is from what I can gather having a quick look again at our code.)

Can you think of a solution that work directly inside libunibreak? This can be, but not limited to, adding more custom data to linebreakdef.c.

I'm not sure I want to :) I mean, it's not really only custom data that we would need to add, but custom code. We have a module where we have put all the language specific tweaks: https://github.com/koreader/crengine/blob/master/crengine/src/textlang.cpp that, among other things, provide some tweaks and callbacks to our main use of libunibreak: https://github.com/koreader/crengine/blob/a4f7509a475ba233ddd92b19c1a5868f098c4655/crengine/src/lvtextfm.cpp#L1471-L1484

For some stuff (quotes & al.), we use your lb_props codepoint-to-class mapping (your linebreakdef.c is a bit limited, and we're happy being able to tweak it on our side without having to push PRs to your side). But for other stuff, we may need a bit more context (ie, depending on what's before or next to the current char) to decide how to tweak its class (depending on language or CSS properties). And I'm happy being able to do that and have full control on our side, and not having to feed libunibreak with callbacks (and not having to plug new ones on text language change - the alternative would be to have a single callback, but libunibreak would need to call it on each char, which would drop performances a lot).

I'm not sure you would want all this language specific tweaks in libunibreak: you do UAX14 well, dunno if you want to do more. UAX14 allows for language specific tweaks, but doesn't specify them. Dunno what kind of higher level hooks you could provide to allow your users to plug them, but I think such low level hooks like the above patch (and this litlle trick that you allowed us to use) are good enough (at least, for us :)

Also, all our languages tweaks in there were mostly added by KOReader users, may be looking at docs, but also possibly just from the personnal taste of the user reading that language that dared doing something to make his reading more pleasant :) None of these have been validated by scholars or academic organisation - so bringing our choices for a EPUB reading software to libunibreak, that may be used for more serious stuff, could lead to hot discussions among language communities, that I'd rather not be part of :)

adah1972 commented 1 year ago

@virxkane As this seems the smallest change that does not break anything, feel free to submit a patch.

@Jellby @poire-z If you have new ideas how to extend libunibreak, please let me know.

virxkane commented 1 year ago

feel free to submit a patch.

Done.