Closed virxkane closed 1 year ago
Do you have a strong reason to use the data external to the library? I am a bit curious.
koreader/crengine (a fork of CoolReader) added some code using libunibreak with this patch. Unfortunately, I cannot say exactly what this function is for, since I am not the author of this code. As far as I understood, it's required for more correct linebreaks around em-dash.
https://github.com/koreader/crengine/pull/365
Perhaps the authors of this code can explain - @Jellby. Perhaps @poire-z will also be able to explain something.
I just want to use this code without applying additional patches.
Interesting....
@jellby @poire-z Can you think of a solution that work directly inside libunibreak? This can be, but not limited to, adding more custom data to linebreakdef.c.
I hope @poire-z can provide a more useful answer. To be honest, my knowledge of crengine, libunibreak and the UAX stuff at the time is pretty much contained in the PR description and comments, and I've forgotten most of it since :D
(It's been a while since we integrated libunibreak into crengine, so the following is from what I can gather having a quick look again at our code.)
Can you think of a solution that work directly inside libunibreak? This can be, but not limited to, adding more custom data to linebreakdef.c.
I'm not sure I want to :) I mean, it's not really only custom data that we would need to add, but custom code. We have a module where we have put all the language specific tweaks: https://github.com/koreader/crengine/blob/master/crengine/src/textlang.cpp that, among other things, provide some tweaks and callbacks to our main use of libunibreak: https://github.com/koreader/crengine/blob/a4f7509a475ba233ddd92b19c1a5868f098c4655/crengine/src/lvtextfm.cpp#L1471-L1484
For some stuff (quotes & al.), we use your lb_props codepoint-to-class mapping (your linebreakdef.c is a bit limited, and we're happy being able to tweak it on our side without having to push PRs to your side). But for other stuff, we may need a bit more context (ie, depending on what's before or next to the current char) to decide how to tweak its class (depending on language or CSS properties). And I'm happy being able to do that and have full control on our side, and not having to feed libunibreak with callbacks (and not having to plug new ones on text language change - the alternative would be to have a single callback, but libunibreak would need to call it on each char, which would drop performances a lot).
I'm not sure you would want all this language specific tweaks in libunibreak: you do UAX14 well, dunno if you want to do more. UAX14 allows for language specific tweaks, but doesn't specify them. Dunno what kind of higher level hooks you could provide to allow your users to plug them, but I think such low level hooks like the above patch (and this litlle trick that you allowed us to use) are good enough (at least, for us :)
Also, all our languages tweaks in there were mostly added by KOReader users, may be looking at docs, but also possibly just from the personnal taste of the user reading that language that dared doing something to make his reading more pleasant :) None of these have been validated by scholars or academic organisation - so bringing our choices for a EPUB reading software to libunibreak, that may be used for more serious stuff, could lead to hot discussions among language communities, that I'd rather not be part of :)
@virxkane As this seems the smallest change that does not break anything, feel free to submit a patch.
@Jellby @poire-z If you have new ideas how to extend libunibreak, please let me know.
feel free to submit a patch.
Done.
Is it possible to make the
get_char_lb_class_lang()
function to determine the class of a unicode character public? Here is a patch that does something similar: