adah1972 / libunibreak

The libunibreak library
zlib License
173 stars 38 forks source link

hard break at the end of text #29

Closed djowel closed 4 years ago

djowel commented 4 years ago

Why does libunibreak always place a hard break at the end of text even if there's no real break? So, I had to hack around that, but consider this:

std::string brks(1, 0);
set_linebreaks_utf8((utf8_t const*) "x", 1, lang, brks.data());

I get the odd result brks: "\0"

But 'x' is not a break. What am I missing?

adah1972 commented 4 years ago

Breaks are always between characters. So here, it means a break after ‘x’. Libunibreak started using exactly the same algorithm as described in UAX#14-19, which always added a line break at the end.

While it might be debatable whether the last break is useful, it is hardly harmful. Changing the behaviour at this moment is too late, as it may potentially break existing applications.

djowel commented 4 years ago

Breaks are always between characters. So here, it means a break after ‘x’.

I see. OK.

While it might be debatable whether the last break is useful, it is hardly harmful. Changing the behaviour at this moment is too late, as it may potentially break existing applications.

Is it not possible to provide the behavior using a configuration switch (e.g. a #define)?

adah1972 commented 4 years ago

Nothing is impossible, but I think it is much simpler for you to change the result as you wish. You can do something as simple as:

std::string input{…};
std::vector<char> breaks(input.size());
set_linebreaks_utf8(input.c_str(), input.size(), lang, breaks.data());
breaks[input.size() - 1] = LINEBREAK_ALLOWBREAK; // Or whatever value you want to change to

One line of code can get all the flexibility, and it looks to me much better than adding a macro.

djowel commented 4 years ago

What if there's a real hard break there? For example, instead of "x", you get "\n"?

adah1972 commented 4 years ago

Nice catch. My last answer was not really thoughtful.

The real catch is, if your input does not end, there is no good value to fit here. If you do not like LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK and LINEBREAK_NOBREAK do not fit better. Either your input ends and LINEBREAK_MUSTBREAK works (kind of, like how Vim treats a text file that does not end with LF), or your input does not end and the value is actually indeterminate. The current implementation assumes that the input must end.

djowel commented 4 years ago

If I understand correctly (maybe not), you are saying that at the "end of text", there's no good way to know what kind of break/no-break there is at that point because the break algorithm (in general) needs two codepoints (before and after) to determine what needs to be done, correct? In some cases, you do know, like in the example of the "\n". You do know that you want a LINEBREAK_MUSTBREAK there.

Perhaps, you need another enum for setting the indeterminate case where it can't decide if it is LINEBREAK_ALLOWBREAK or LINEBREAK_NOBREAK, perhaps LINEBREAK_INDETERMINATE for that indeterminate state? For continuous text streams that are chunked, you can pass in the LAST codepoint from a previous chunk as the first codepoint of the succeeding chunk, so you won't lose anything. But for real end of input, you get it correct with the first run without having to guess. The caller (me) knows if it is the real EOF and I have all the information I need: it's either a LINEBREAK_MUSTBREAK OR something else... LINEBREAK_INDETERMINATE will be equivalent to LINEBREAK_NOBREAK if you are sure it is the REAL EOF.

djowel commented 4 years ago

Come to think of it, perhaps just set the last codepoint as LINEBREAK_NOBREAK OR LINEBREAK_MUSTBREAK... Then the caller (me) will have the information to decide. If I know that it is the real EOF, good, if not, I'll just pass in the last codepoint from a previous chunk as the first codepoint of the succeeding chunk as mentioned above and continue there.

kind of, like how Vim treats a text file that does not end with LF

BTW, for complete (non-chunked) blocks of text, my hack here is to add an LF at the end of a block of text, so I get to know the REAL case at the end of the text block (one before the LF). It is not optimal though because inserting the LF requires memory allocation (I can't use the raw input as-is).

mbechard commented 4 years ago

Interesting this issue was just recently created, I've always tweaked the libunibreak code to fill the last slot with ALLOWBREAK. I was upgrading to 4.3 and was about to open an issue about the same question (to avoid having to do the edit every time). For my usage cases the ALLOWBREAK is a better indeterminate state.

adah1972 commented 4 years ago

Even you two disagree on the last state, so I introduced LINEBREAK_INDETERMINATE. Take a look at commit aadc2f9.

djowel commented 4 years ago

Even you two disagree on the last state, so I introduced LINEBREAK_INDETERMINATE. Take a look at commit aadc2f9.

Yes, because LINEBREAK_INDETERMINATE is the right choice :-) and reinforces the disagreement in the first place. It is indeterminate and can go either way.

Way to go, @adah1972 :-) Thank you very much.

djowel commented 4 years ago

Works like a charm! 👍 I just tested it. No more hacks in my code.

mbechard commented 4 years ago

Work for me :)