Closed djowel closed 4 years ago
Breaks are always between characters. So here, it means a break after ‘x’. Libunibreak started using exactly the same algorithm as described in UAX#14-19, which always added a line break at the end.
While it might be debatable whether the last break is useful, it is hardly harmful. Changing the behaviour at this moment is too late, as it may potentially break existing applications.
Breaks are always between characters. So here, it means a break after ‘x’.
I see. OK.
While it might be debatable whether the last break is useful, it is hardly harmful. Changing the behaviour at this moment is too late, as it may potentially break existing applications.
Is it not possible to provide the behavior using a configuration switch (e.g. a #define)?
Nothing is impossible, but I think it is much simpler for you to change the result as you wish. You can do something as simple as:
std::string input{…};
std::vector<char> breaks(input.size());
set_linebreaks_utf8(input.c_str(), input.size(), lang, breaks.data());
breaks[input.size() - 1] = LINEBREAK_ALLOWBREAK; // Or whatever value you want to change to
One line of code can get all the flexibility, and it looks to me much better than adding a macro.
What if there's a real hard break there? For example, instead of "x", you get "\n"?
Nice catch. My last answer was not really thoughtful.
The real catch is, if your input does not end, there is no good value to fit here. If you do not like LINEBREAK_MUSTBREAK
, LINEBREAK_ALLOWBREAK
and LINEBREAK_NOBREAK
do not fit better. Either your input ends and LINEBREAK_MUSTBREAK
works (kind of, like how Vim treats a text file that does not end with LF), or your input does not end and the value is actually indeterminate. The current implementation assumes that the input must end.
If I understand correctly (maybe not), you are saying that at the "end of text", there's no good way to know what kind of break/no-break there is at that point because the break algorithm (in general) needs two codepoints (before and after) to determine what needs to be done, correct? In some cases, you do know, like in the example of the "\n". You do know that you want a LINEBREAK_MUSTBREAK
there.
Perhaps, you need another enum for setting the indeterminate case where it can't decide if it is LINEBREAK_ALLOWBREAK
or LINEBREAK_NOBREAK
, perhaps LINEBREAK_INDETERMINATE
for that indeterminate state? For continuous text streams that are chunked, you can pass in the LAST codepoint from a previous chunk as the first codepoint of the succeeding chunk, so you won't lose anything. But for real end of input, you get it correct with the first run without having to guess. The caller (me) knows if it is the real EOF and I have all the information I need: it's either a LINEBREAK_MUSTBREAK
OR something else... LINEBREAK_INDETERMINATE
will be equivalent to LINEBREAK_NOBREAK
if you are sure it is the REAL EOF.
Come to think of it, perhaps just set the last codepoint as LINEBREAK_NOBREAK
OR LINEBREAK_MUSTBREAK
... Then the caller (me) will have the information to decide. If I know that it is the real EOF, good, if not, I'll just pass in the last codepoint from a previous chunk as the first codepoint of the succeeding chunk as mentioned above and continue there.
kind of, like how Vim treats a text file that does not end with LF
BTW, for complete (non-chunked) blocks of text, my hack here is to add an LF at the end of a block of text, so I get to know the REAL case at the end of the text block (one before the LF). It is not optimal though because inserting the LF requires memory allocation (I can't use the raw input as-is).
Interesting this issue was just recently created, I've always tweaked the libunibreak code to fill the last slot with ALLOWBREAK. I was upgrading to 4.3 and was about to open an issue about the same question (to avoid having to do the edit every time). For my usage cases the ALLOWBREAK is a better indeterminate state.
Even you two disagree on the last state, so I introduced LINEBREAK_INDETERMINATE. Take a look at commit aadc2f9.
Even you two disagree on the last state, so I introduced LINEBREAK_INDETERMINATE. Take a look at commit aadc2f9.
Yes, because LINEBREAK_INDETERMINATE is the right choice :-) and reinforces the disagreement in the first place. It is indeterminate and can go either way.
Way to go, @adah1972 :-) Thank you very much.
Works like a charm! 👍 I just tested it. No more hacks in my code.
Work for me :)
Why does libunibreak always place a hard break at the end of text even if there's no real break? So, I had to hack around that, but consider this:
I get the odd result
brks
: "\0"But 'x' is not a break. What am I missing?