Column wrapping may break ANSI escape codes

devdanzin commented 7 months ago

When creating a table with maxcolwidths, ANSI escape codes sometimes get wrongly split.

Here's an example, increasing the length of the "0123..." sequence to show the issue:

print(tabulate.tabulate(tabular_data=(('01234 (\x1b[32mabcdefghij\x1b[0m)', 'XX'),), maxcolwidths=11, tablefmt="grid"))
+-------------+----+
| 01234 (abcd | XX |  # Correctly broken up, colors work on both lines
| efghij)     |    |
+-------------+----+
>>> print(tabulate.tabulate(tabular_data=(('012345 (\x1b[32mabcdefghij\x1b[0m)', 'XX'),), maxcolwidths=11, tablefmt="grid"))
+-------------+----+
| 012345 ( XX | 
| 2mabcdefghi |    |
| j)          |    |
+-------------+----+
>>> print(tabulate.tabulate(tabular_data=(('0123456 (\x1b[32mabcdefghij\x1b[0m)', 'XX'),), maxcolwidths=11, tablefmt="grid"))
+-------------+----+
| 0123456 ( XX |
| 32mabcdefgh |    |
| ij)         |    |
+-------------+----+
>>> print(tabulate.tabulate(tabular_data=(('01234567 (\x1b[32mabcdefghij\x1b[0m)', 'XX'),), maxcolwidths=11, tablefmt="grid"))
+-------------+----+
| 01234567 ( | XX |
| [32mabcdefg |    |
| hij)        |    |
+-------------+----+
>>> print(tabulate.tabulate(tabular_data=(('012345678 (\x1b[32mabcdefghij\x1b[0m)', 'XX'),), maxcolwidths=11, tablefmt="grid"))
+-------------+----+
| 012345678 ( | XX |  # Correctly broken up, colors work on both lines
| abcdefghij) |    |
+-------------+----+

We can see how the ANSI escape code is broken by looking at the repr, e.g. '| 0123456 (\x1b[ | XX |', '| 32mabcdefgh | |' for the "0123456" case.

Tested on Windows Terminal with both Powershell and WSL, on Python 3.11 and 3.12.

Thank you for this wonderful library!

devdanzin commented 7 months ago

The issue is that _CustomTextWrap._handle_long_word doesn't take ANSI escape codes into account when breaking up words.

There is a simple but incomplete fix: add len(_ansi_codes.search(chunk).group()) to i so when we run cur_line.append(chunk[: i - 1]) it actually copies the whole thing, normal chars and escape codes included.

But it's incomplete as the escape codes may appear after the split, so we'd include spurious normal characters in the line. I'm working on creating tests for these cases and a proper fix.

devdanzin commented 7 months ago

You can create a file with:

import tabulate

strip_ansi = tabulate._strip_ansi  # type: ignore
ansi_codes = tabulate._ansi_codes  # type: ignore

def handle_long_word(
    self, reversed_chunks: List[str], cur_line: List[str], cur_len: int, width: int
):
    """
    Handle a chunk of text that is too long to fit in any line.
    Fixed version of tabulate._CustomTextWrap._handle_long_word that avoids a
    wrapping bug (https://github.com/astanin/python-tabulate/issues/307) where
    ANSI escape codes would be broken up in the middle.
    """
    # Figure out when indent is larger than the specified width, and make
    # sure at least one character is stripped off on every pass
    if width < 1:
        space_left = 1
    else:
        space_left = width - cur_len

    # If we're allowed to break long words, then do so: put as much
    # of the next chunk onto the current line as will fit.
    if self.break_long_words:
        # Tabulate Custom: Build the string up piece-by-piece in order to
        # take each character's width into account
        chunk = reversed_chunks[-1]
        i = 1
        # Only count printable characters, so strip_ansi first, index later.
        while len(strip_ansi(chunk)[:i]) <= space_left:
            i = i + 1
        # Consider escape codes when breaking words up
        total_escape_len = 0
        last_group = 0
        if ansi_codes.search(chunk) is not None:
            for group, _, _, _ in ansi_codes.findall(chunk):
                escape_len = len(group)
                if group in chunk[last_group : i + total_escape_len + escape_len - 1]:
                    total_escape_len += escape_len
                    found = ansi_codes.search(chunk[last_group:])
                    last_group += found.end()
        cur_line.append(chunk[: i + total_escape_len - 1])
        reversed_chunks[-1] = chunk[i + total_escape_len - 1 :]

    # Otherwise, we have to preserve the long word intact.  Only add
    # it to the current line if there's nothing already there --
    # that minimizes how much we violate the width constraint.
    elif not cur_line:
        cur_line.append(reversed_chunks.pop())

    # If we're not allowed to break long words, and there's already
    # text on the current line, do nothing.  Next time through the
    # main loop of _wrap_chunks(), we'll wind up here again, but
    # cur_len will be zero, so the next line will be entirely
    # devoted to the long word that we can't handle right now.

Then you import handle_long_word from that file and monkeypatch tabulate after you import it, but before you use it:

from some_file import handle_long_word
import tabulate

tabulate._CustomTextWrap._handle_long_word = handle_long_word 

# Use tabulate.tabulate() here and it should be fixed.

Hope this helps, please let me know if it doesn't work or you find any new issues.

astanin / python-tabulate

Column wrapping may break ANSI escape codes #307