bakkeby / dmenu-flexipatch

A dmenu build with preprocessor directives to decide which patches to include during build time
MIT License
183 stars 78 forks source link

fuzzyhighlight: Invalid UTF-8 passed to drw_text #24

Closed speedie1337 closed 1 month ago

speedie1337 commented 1 year ago

If the fuzzy highlight patch is used along with Pango (maybe an issue with Xft too?) and emojis or non-English characters are highlighted, invalid UTF-8 is passed to the drw_text function. Looking at the function though I'm pretty sure this is because it doesn't properly handle Unicode made up of multiple characters when drawing each character. Not only does this result in the text not drawing properly, Pango will output many warnings about it.

I don't really have the time to fix this (and I'm really not familiar with encoding) but I thought I'd report the issue anyway.

N-R-K commented 2 months ago

Can you give some concrete examples, and maybe screenshots?

speedie1337 commented 1 month ago

Apologies for the late response, I'm not on GitHub very much anymore.

  1. Enable Pango, fuzzy matching and fuzzy highligting
  2. printf "%s\n" "åäö" | dmenu
  3. Call some function that redraws the menu (e.g. press tab)

You get similar results with any non-ASCII characters. I think the problem from glancing over the fuzzy matching code is we're not necessarily iterating over each UTF-8 character, so perhaps an incomplete (and thus invalid) character is passed to drw_text().

Although not fit for this project whatsoever for obvious reasons, I dealt with a similar problem using icu in a C++ project. You could probably deal with the problem in a more suckless fashion, though.

image

N-R-K commented 1 month ago

I was actually looking into this but then it went off my mind. Basically the way dmenu (or libsl rather) deals with invalid utf8 doesn't make much sense.

It does it's calculation on an imaginary U+FFFD but then hands off the invalid utf8 to xft. Which on my system just cuts the text off at the first invalid byte.

$ printf "0\xef1234567\ntest" | dmenu

pic0

It'd probably be better if it actually rendered invalid sequence as U+FFFD instead of using it only for width calculations.

pic1

I've sent a mail to the dev mailing list about this: https://lists.suckless.org/dev/2407/35646.html. I'll try to fix it upstream depending on what the expected behavior is.

bakkeby commented 1 month ago

I had a look at this and the particular issue that speedie reports here is isolated to the fuzzyhighlight patch.

The underlying issue is that the patch assumes that each letter takes up one char / byte and thus sends broken multi-byte characters to drw_text.

I played around with an example using libgrapheme to get the correct length for multi-byte UTF-8 characters and got fuzzy highlights working.

image

diff --git a/patch/fuzzyhighlight.c b/patch/fuzzyhighlight.c
index d31a150..5b90c2d 100644
--- a/patch/fuzzyhighlight.c
+++ b/patch/fuzzyhighlight.c
@@ -6,6 +6,7 @@ drawhighlights(struct item *item, int x, int y, int maxw)
 #endif // EMOJI_HIGHLIGHT_PATCH
 {
        int i, indent;
+       int utf8charlen;
        char *highlight;
        char c;

@@ -33,8 +34,9 @@ drawhighlights(struct item *item, int x, int y, int maxw)
                           ? SchemeSelHighlight
                           : SchemeNormHighlight]);
        for (i = 0, highlight = itemtext; *highlight && text[i];) {
+               utf8charlen = grapheme_next_character_break_utf8(highlight, SIZE_MAX);
                #if FUZZYMATCH_PATCH
-               if (!fstrncmp(&(*highlight), &text[i], 1))
+               if (!fstrncmp(&(*highlight), &text[i], utf8charlen))
                #else
                if (*highlight == text[i])
                #endif // FUZZYMATCH_PATCH
@@ -46,8 +48,8 @@ drawhighlights(struct item *item, int x, int y, int maxw)
                        *highlight = c;

                        /* highlight character */
-                       c = highlight[1];
-                       highlight[1] = '\0';
+                       c = highlight[utf8charlen];
+                       highlight[utf8charlen] = '\0';
                        drw_text(
                                drw,
                                x + indent + (lrpad / 2),
@@ -58,8 +60,8 @@ drawhighlights(struct item *item, int x, int y, int maxw)
                                , True
                                #endif // PANGO_PATCH
                        );
-                       highlight[1] = c;
-                       i++;
+                       highlight[utf8charlen] = c;
+                       i += utf8charlen;
                }
                highlight++;
        }

As for the invalid UTF-8 issue that would be unrelated to the Pango patch, but using libgrapheme one could sanity check that the character is valid UTF-8 and bail if not, e.g.

            utf8charlen = grapheme_next_character_break_utf8(text, SIZE_MAX);
            if (!grapheme_decode_utf8(text, utf8charlen, &utf8codepoint)) {
                /* Bail in the event the string ended unexpectedly in the middle of a multi-byte
                 * sequence. */
                overflow = 1;
                break;
            }

The above is from some previous experimentation that I have done integrating the grapheme library in dwm in place of the existing implementation.

speedie1337 commented 1 month ago

libgrapheme seems like the right library to use for this, given it's also suckless in nature. Maybe if you're a wizard you could figure out how to do this without a library, but I think this seems like a good solution. I think it's very likely the original author just assumed that since it can draw ASCII characters correctly, it should draw other characters too when that's not necessarily the case.

bakkeby commented 1 month ago

You can most likely also use utf8decode. It would be part of the code that the pango patch deletes, but one could undo that.

When using pango markup I'd expect the highlighting to break things, like if you have bold text then the highlighted text would not be bold for example. Probably not worth the effort of trying to fix that.

bakkeby commented 1 month ago

Closing this based on the above commit.