Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.56k stars 109 forks source link

background coloring artifact due to scrolling when the first character of a continuation line is matched #366

Closed vinc17fr closed 5 months ago

vinc17fr commented 5 months ago

A coloring bug occurs in ugrep 5.0.0 under the following conditions:

For instance, consider GREP_COLORS="mt=41" (i.e. matches will be in white over a red background) and a 80-column terminal with a file containing the following line:

..............................................................................abcdef

(78 periods followed by "abcdef", so that cdef appears on the next line).

Go to the bottom of the terminal and type

ugrep c file

The c is colored as wanted, but the spaces after the end of the text also have a red background.

Note that this issue does not occur when matching another letter, only the letter c (the one that appears at the beginning of the continuation line).

This issue is reproducible under Debian/unstable with xterm and under Android/Termux.

genivia-inc commented 5 months ago

Thank you for your feedback.

This visualization is related to the use (or not) of ANSI control codes to clear a line (EL). There are pro/cons to using this to clear lines. Clearing colors does not clear the background color (match background) to display the terminal background in this scenario, but rather retains it on most terminals.

When I looked into this issue months ago when I saw this with ugrep, I started looking for info to figure out what the best approach would be. I could not find a consensus.

What I found for example, is that GNU grep mentions this problem in their code (ugrep does not share GNU grep code at all, in fact 0% of third-party open source is used in ugrep since it is written from scratch, I also never borrowed any code from it or anywhere else (also no co-pilot, than you), because I know I can write better code than that):

/* Select Graphic Rendition (SGR, "\33[...m") strings.  */
/* Also Erase in Line (EL) to Right ("\33[K") by default.  */
/*    Why have EL to Right after SGR?
         -- The behavior of line-wrapping when at the bottom of the
            terminal screen and at the end of the current line is often
            such that a new line is introduced, entirely cleared with
            the current background color which may be different from the
            default one (see the boolean back_color_erase terminfo(5)
            capability), thus scrolling the display by one line.
            The end of this new line will stay in this background color
            even after reverting to the default background color with
            "\33[m', unless it is explicitly cleared again with "\33[K"
            (which is the behavior the user would instinctively expect
            from the whole thing).  There may be some unavoidable
            background-color flicker at the end of this new line because
            of this (when timing with the monitor's redraw is just right).

[...]

         -- Even more possible background color flicker (when timing
            with the monitor's redraw is just right), even when not at the
            bottom of the screen.
      There are no additional disadvantages specific to doing it after
      SGR END.

      It would be impractical for GNU grep to become a full-fledged
      terminal program linked against ncurses or the like, so it will
      not detect terminfo(5) capabilities.  */

In some cases I've found that EL occasionally causes other visualization weirdness. That's why I am uncertain how to effectively address this terminal background coloring manifestation. Perhaps output EL and take it for granted?

EDIT: "clearing colors" causes this, not when using EL.

vinc17fr commented 5 months ago

Hmm... This is actually the same issue as with GNU grep when ne is used in GREP_COLORS. The GNU grep man page says:

              ne     Boolean value that prevents clearing to the end of line
                     using Erase in Line (EL) to Right (\33[K) each  time  a
                     colorized  item  ends.   This is needed on terminals on
                     which EL is not supported.  It is otherwise  useful  on
                     terminals  for which the back_color_erase (bce) boolean
                     terminfo capability does not  apply,  when  the  chosen
                     highlight  colors do not affect the background, or when
                     EL is too slow or causes too much flicker.  The default
                     is false (i.e., the capability is omitted).

Now, concerning the bugs:

In particular, concerning the testcase in the Debian bug in xterm:

echo "                                                                               1234" | grep --color=auto '[1-9]'

(note that contrary to what is said, there are 79 spaces, not 80), the characters 1234 are not displayed at all if ne is not used, e.g. with GREP_COLORS="mt=41", which is a much more serious issue. And if I add ne (my usual setting), i.e. use GREP_COLORS="mt=41:ne", then the characters 1234 are displayed and colored as expected, but the following spaces also get the red background, which is exactly the same issue as with ugrep.

I didn't noticed that GNU grep was affected, because I always use it with a wrapper, which has the effect to make this bug disappear, AFAIK without any drawback!

BTW, this wrapper also makes the bug disappear in ugrep, so that perhaps this would be a good starting point. Basically, this is equivalent to

ugrep --color=always c file | less -+c -FRX

Note that in this case, due to the -F option, less quits immediately, i.e. the use of less is invisible (except that it avoids the background issue).

vinc17fr commented 5 months ago

It seems that what less does to avoid the issue is to output a space followed by a backspace (there are also some escape sequences) before handling the c.

genivia-inc commented 5 months ago

Yes, that is what I was referring to. That's what I already know. It's not that complicated. What is not trivial is to decide what's best, having to emit an EL for "each time a colored item ends" seems an overkill and causes flicker. Some more experimentation may be needed to decide on a better solution than what has been tried and proposed before (that caused other problems when adopted, caveat emptor also applies to free advice).

The ugrep TUI handles color output differently, to improve the terminal output and avoid such artifacts. It outputs EL for each line ending (and does a lot more than that, to properly display Unicode wide characters and colors etc).

The normal ugrep output (not in the TUI) is not controllable, just like GNU grep source code says it can't emulate a terminal and use ncurses, since normal output can be sent to tty or piped to less.

vinc17fr commented 5 months ago

The less solution does not emit any EL.

genivia-inc commented 5 months ago

Got it. But less uses the terminal properties (rows, columns, etc) to "fit" text, so it "knows" when to use the trick, or perhaps I am misunderstanding when less does the space/backspace trick.

Grep streams the output. It is also not 100% guaranteed to work if placement decisions to use such tricks are based on column positions, because Unicode double width glyphs (e.g. emojis) take two columns and can cause miscalculations (e.g. early line wrap). The ugrep TUI checks which glyph ranges are double width using some tricks that I came up with. Streaming colored text is a different ballgame. Perhaps emitting some correcting ESC codes at the end of a line can render the background color correctly. Whatever might work this way should also be carefully tested to make sure it works when piped to less and more.

vinc17fr commented 5 months ago

Got it. But less uses the terminal properties (rows, columns, etc) to "fit" text, so it "knows" when to use the trick, or perhaps I am misunderstanding when less does the space/backspace trick.

Yes, after some tests, it appears that less uses the space/backspace trick only at the end of the physical line, and it is able to take double-width Unicode characters into account for that. This trick is not used when no escape sequences have been sent in the line (i.e. no changes of colors). This makes sense.

[...] Whatever might work this way should also be carefully tested to make sure it works when piped to less and more.

Well, if the output is piped or redirected to a file, I don't think that commands should try to do any terminal-related correction, because it is not possible for the command to know how the output will be used. Said otherwise, this terminal related issue should be considered only by commands that send output directly to a tty (i.e. they should first test whether the standard output is attached to a tty).

genivia-inc commented 5 months ago

Some observations:

A) with EL before \n

image

and also

image

without EL we get the coloring artifact (current ugrep versions, here I've used ne to override in my test version):

image

B) with EL before \n makes the matching c disappear!

image

without EL (current ugrep versions, here I've used ne to override in my test version):

image
genivia-inc commented 5 months ago

C) with space+backspace (and with/without EL, no difference) before \n

image

without space+backspace+EL

image

D) with space+backspace only after emitting 80 columns, which means EL is placed here without space+backspace, making the matching c disappear:

image
genivia-inc commented 5 months ago

Yes, after some tests, it appears that less uses the space/backspace trick only at the end of the physical line, and it is able to take double-width Unicode characters into account for that. This trick is not used when no escape sequences have been sent in the line (i.e. no changes of colors). This makes sense.

It looks like space+backspace hack may cause line skips/scrolls when streaming text to the terminal. This happens unless we don't emit \n and we know for sure we are at the end of a line. This condition cannot be 100% guaranteed for several reasons:

  1. the text output may end up appended after some other text sent to the terminal (it is possible, but not likely)
  2. not emitting a \n means that terminal screen width rescaling will no longer retain the screen content correctly (lines appended)
  3. determining the column position on terminals is difficult to always get correct, because Unicode characters sets are represented differently (single width or double width) on different terminals. We also should take into account remote terminals. Some use single width, other use double width. The ugrep TUI tests this by emitting some Unicode characters then asks the terminal for the column position. This tells the TUI how various Unicode characters are displayed, which sets are single column and double column.

The less tool does not concern 1) and may have the problem 2), unless it emits \n before space+backspace but that does not remove the color artifact.

vinc17fr commented 5 months ago
  1. the text output may end up appended after some other text sent to the terminal (it is possible, but not likely)

Indeed, but in such a case, I would say that this additional text will be more a problem than the fix potentially not working in such a case.

  1. not emitting a \n means that terminal screen width rescaling will no longer retain the screen content correctly (lines appended)

I don't understand what you mean. If you mean "increase the terminal width" (i.e. with the effect to add columns), there is does not seem to be any issue.

  1. determining the column position on terminals is difficult to always get correct, because Unicode characters sets are represented differently (single width or double width) on different terminals.

AFAIK, this is standard (information provided by the C library). Terminals that do not honor this standard will break curses applications. See for instance Debian bug 1027733 (GNU Screen uses obsolete hardcoded widths instead of the wcwidth function, and this has eventually been fixed in Debian).

We also should take into account remote terminals. Some use single width, other use double width.

This means that some version of the software is not up-to-date (possible in practice), or that the terminal is buggy. Not much an issue here.

The less tool does not concern 1)

It does, and it is even buggy at the point that this can corrupt the terminal status in practice, because external text (typically stderr) can be sent to the terminal in the middle of an escape sequence. See Debian bug 949139, which I reported 4 years ago.

and may have the problem 2), unless it emits \n before space+backspace but that does not remove the color artifact.

You do not need this \n.

For instance, in xterm, I haven't seen any issue with

printf foo ; for i in `seq 3` ; do printf "..............................................................................ab \bc\n" ; done | ugrep --color=always 'c'

i.e. I have added just a space+backspace before "c". The appended text "foo" doesn't break the fix (while it is an issue with less).

Note that this fix could be optional.

genivia-inc commented 5 months ago

I disagree. I ran various tests with various possible ways to address this, with and without \n and EL and so on. I am not going to report all the things I tried. I just broke it down in a few points. I am not happy with the results. If there was a perfect solution then I would have done this already. If it was that easy, the GNU grep devs would have done that already too, no? Instead, there's the buggy use of EL.

IMHO trying to work around a terminal bug should not be a hack that requires a much logic, especially when a pager already does this.

Just to be clear. The less utility controls its full screen layout, like ugrep does in the TUI, which make sense to use wcwidth_l and other logic. It's not an issue in the TUI and it is not an issue when using a pager.

Without TUI and pager, the text streams to the tty whatever the tty is. To calculate column position with wcwidth_l for every character converted from UTF-8 to UCS-4 is out of the question. It's too slow for streaming a lot of text to a tty and will be noticeable. Also tab character and control codes like \r and backspace must be taken into account, as well as a possible start from a column other than column one. It's easy when you ignore all the details.

vinc17fr commented 5 months ago

I ran various tests with various possible ways to address this, with and without \n and EL and so on. I am not going to report all the things I tried.

You haven't tried the following: a space+backspace before the coloring escape sequence (without \n and EL) appears to work most of the time, and all the time if the end of line can be guessed correctly. This needs to split coloring into two parts if the colored part appears on 2 lines. At worst, the background problem would still occur when the end-of-line cannot be correctly detected. But at least this would solve the most common cases without any drawback.

Here's a testcase: out-80col.txt, which gives in xterm: out-80col

With this testcase, each colored "cd" gets split escape sequences with an associated space+backspace, but the goal is to show that there is nothing wrong when the space+backspace occurs at a "bad" place. In practice, a split and a space+backspace would be used only when needed.

Note: using one escape sequence per colored character would even eliminate the need to detect the end of line, and this would still be very efficient. When using grep --color . or ugrep --color ., each colored character gets its own escape sequence, and you can see that this is still quite fast.

If it was that easy, the GNU grep devs would have done that already too, no?

Perhaps because the solution has never been suggested?

IMHO trying to work around a terminal bug should not be a hack that requires a much logic, especially when a pager already does this.

This is not a terminal bug, but the poor way VT100 was designed in the past.

Without TUI and pager, the text streams to the tty whatever the tty is. To calculate column position with wcwidth_l for every character converted from UTF-8 to UCS-4 is out of the question. It's too slow for streaming a lot of text to a tty and will be noticeable.

Except that you don't output a lot of text without using a pager. I recall that the trick should be used only when the (u)grep output is a tty. That's why I think that using one escape sequence per colored character would be the best workaround.

genivia-inc commented 5 months ago

...appears to work most of the time...

"most of the time" is not acceptable. It's not giving POLA assurance wrt. to this problem.

This is not a terminal bug, but the poor way VT100 was designed in the past.

Most folks find this behavior counter-intuitive and inconsistent across terminals, hence a "bug". Not all terminals behave this way. Like aqua xterm. Therefore, it's not something that can be consistently counted on as a "feature".

Another possible workaround is to define the selected line sl= color to include EL to erase the line with the given background for it, like --colors=mt=yB:sl=KE to draw a black K background with E to erase it, where E is a pseudo-color that emits EL (\033[K) which is something ugrep does not do, ignoring E. Adding this feature to ugrep allows an E to be specified when and where people want it in a color code:

image

But this is still a hack to define a black background to the selected lines (or whatever color you want as background). It's not perfect. For example, when the line ends with the match, then the background is not erased:

image

And if we're unlucky, it deletes a matching character:

image

I'm unhappy with all and any such hacks so far.

genivia-inc commented 1 month ago

I just thought of this, but another way to show background colors without the artifact is to invert the foreground color and leave the background color alone. For example

$ ug --colors='mt=hyi' PATTERN FILE

shows matches in black on bright yellow (assuming the terminal background is black).