astanin / python-tabulate

Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.
https://pypi.org/project/tabulate/
MIT License
2.08k stars 162 forks source link

Bug fix for issue #318 | `intfmt`: error for strings which contain integer values #319

Open airvzxf opened 4 months ago

airvzxf commented 4 months ago

Bug fix for issue #318 | intfmt: error for strings which contain integer values

Description.

Resolved the problem with the intfmt argument.

Added tests for these changes. Test results: 308 passed, 2 skipped in 1.10s.

devdanzin commented 4 months ago

Thank you for identifying the issue and proposing a bug fix!

I was wondering that maybe we should rather convert val to int instead, similarly to what happens when valtype is float. That way, an int in a string would still get formatted, like what happens for a float. What do you think?

Also, I see you added a nice test for colored ints. If we choose to format ints passed as string, I think it would be nice to format them when the string has color too, same as happens for float. I'd be willing to provide a new PR for that if we decide it's an improvement worth having.

airvzxf commented 4 months ago

I was wondering that maybe we should rather convert val to int instead, similarly to what happens when valtype is float. That way, an int in a string would still get formatted, like what happens for a float. What do you think?

Also, I see you added a nice test for colored ints. If we choose to format ints passed as string, I think it would be nice to format them when the string has color too, same as happens for float. I'd be willing to provide a new PR for that if we decide it's an improvement worth having.

In fact, it was one of my first solutions, but this breaks some unit tests. I made several changes before arriving at the proposed solution, but the most important thing is that after all these experiments, I discovered that tabulate may need deeper refactoring.

Unit test

Here are the results of the unit tests with this change:

elif valtype is int:
    # if isinstance(val, str):
    #     intfmt = ""
    return format(int(val), intfmt)

Unit tests results:

==================================================================== FAILURES ====================================================================
_______________________________________________________ test_intfmt_with_string_as_integer _______________________________________________________

    def test_intfmt_with_string_as_integer():                                                                                                     
        "Output: integer format"                                                                                                                  
        result = tabulate([[82642], ["1500"], [2463]], intfmt=",", tablefmt="plain")                                                              
        expected = "82,642\n  1500\n 2,463"                                                                                                       
>       assert_equal(expected, result)                                                                                                            

test/test_output.py:2646:                                                                                                                         
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

expected = '82,642\n  1500\n 2,463', result = '82,642\n 1,500\n 2,463'                                                          10:14:37 [140/612]

    def assert_equal(expected, result):                                                                                                           
        print("Expected:\n%s\n" % expected)                                                                                                       
        print("Got:\n%s\n" % result)                                                                                                              
>       assert expected == result                                                                                                                 
E       AssertionError                                                                                                                            

test/common.py:8: AssertionError                                                                                                                  
-------------------------------------------------------------- Captured stdout call --------------------------------------------------------------
Expected:                                                                                                                                         
82,642                                                                                                                                            
  1500                                                                                                                                            
 2,463                                                                                                                                            

Got:                                                                                                                                              
82,642                                                                                                                                            
 1,500                                                                                                                                            
 2,463                                                                                                                                            

____________________________________________________________ test_intfmt_with_colors _____________________________________________________________

    def test_intfmt_with_colors():                                                                                                                
        "Regression: Align ANSI-colored values as if they were colorless."                                                                        
        colortable = [                                                                                                                            
            ("abcd", 42, "\x1b[31m42\x1b[0m"),                                                                                                    
            ("elfy", 1010, "\x1b[32m1010\x1b[0m"),                                                                                                
        ]                                                                                                                                         
        colorheaders = ("test", "\x1b[34mtest\x1b[0m", "test")                                                                                    
>       formatted = tabulate(colortable, colorheaders, "grid", intfmt=",")                                                                        

test/test_output.py:2664:                                                                                                                         
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tabulate/__init__.py:2198: in tabulate                                                                                                            
    cols = [                                                                                                                                      
tabulate/__init__.py:2199: in <listcomp>                                                                                        10:14:37 [105/612]
    [_format(v, ct, fl_fmt, int_fmt, miss_v, has_invisible) for v in c]                                                                           
tabulate/__init__.py:2199: in <listcomp>                                                                                                          
    [_format(v, ct, fl_fmt, int_fmt, miss_v, has_invisible) for v in c]                                                                           
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

val = '\x1b[31m42\x1b[0m', valtype = <class 'int'>, floatfmt = 'g', intfmt = ',', missingval = '', has_invisible = True                           

    def _format(val, valtype, floatfmt, intfmt, missingval="", has_invisible=True):                                                               
        """Format a value according to its type.                                                                                                  

        Unicode is supported:                                                                                                                     

        >>> hrow = ['\u0431\u0443\u043a\u0432\u0430', '\u0446\u0438\u0444\u0440\u0430'] ; \                                                       
            tbl = [['\u0430\u0437', 2], ['\u0431\u0443\u043a\u0438', 4]] ; \                                                                      
            good_result = '\\u0431\\u0443\\u043a\\u0432\\u0430      \\u0446\\u0438\\u0444\\u0440\\u0430\\n-------  -------\\n\\u0430\\u0437       
      2\\n\\u0431\\u0443\\u043a\\u0438           4' ; \                                                                                           
            tabulate(tbl, headers=hrow) == good_result                                                                                            
        True                                                                                                                                      

        """  # noqa                                                                                                                               
        if val is None:                                                                                                                           
            return missingval                                                                                                                     

        if valtype is str:                                                                                                                        
            return f"{val}"                                                                                                                       
        elif valtype is int:                                                                                                                      
            # if isinstance(val, str):                                                                                                            
            #     intfmt = ""                                                                                                                     
>           return format(int(val), intfmt)                                                                                                       
E           ValueError: invalid literal for int() with base 10: '\x1b[31m42\x1b[0m'                                                               

tabulate/__init__.py:1234: ValueError                                                                                                             
________________________________________________________ test_alignment_of_colored_cells _________________________________________________________

    def test_alignment_of_colored_cells():                                                                                       10:14:37 [70/612]
        "Regression: Align ANSI-colored values as if they were colorless."                                                                        
        colortable = [                                                                                                                            
            ("test", 42, "\x1b[31m42\x1b[0m"),                                                                                                    
            ("test", 101, "\x1b[32m101\x1b[0m"),                                                                                                  
        ]                                                                                                                                         
        colorheaders = ("test", "\x1b[34mtest\x1b[0m", "test")                                                                                    
>       formatted = tabulate(colortable, colorheaders, "grid")                                                                                    

test/test_regression.py:30:                                                                                                                       
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tabulate/__init__.py:2198: in tabulate                                                                                                            
    cols = [                                                                                                                                      
tabulate/__init__.py:2199: in <listcomp>                                                                                                          
    [_format(v, ct, fl_fmt, int_fmt, miss_v, has_invisible) for v in c]                                                                           
tabulate/__init__.py:2199: in <listcomp>                                                                                                          
    [_format(v, ct, fl_fmt, int_fmt, miss_v, has_invisible) for v in c]                                                                           
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

val = '\x1b[31m42\x1b[0m', valtype = <class 'int'>, floatfmt = 'g', intfmt = '', missingval = '', has_invisible = True                            

    def _format(val, valtype, floatfmt, intfmt, missingval="", has_invisible=True):                                                               
        """Format a value according to its type.                                                                                                  

        Unicode is supported:                                                                                                                     

        >>> hrow = ['\u0431\u0443\u043a\u0432\u0430', '\u0446\u0438\u0444\u0440\u0430'] ; \                                                       
            tbl = [['\u0430\u0437', 2], ['\u0431\u0443\u043a\u0438', 4]] ; \                                                                      
            good_result = '\\u0431\\u0443\\u043a\\u0432\\u0430      \\u0446\\u0438\\u0444\\u0440\\u0430\\n-------  -------\\n\\u0430\\u0437       
      2\\n\\u0431\\u0443\\u043a\\u0438           4' ; \                                                                                           
            tabulate(tbl, headers=hrow) == good_result                                                                                            
        True                                                                                                                                      

        """  # noqa                                                                                                                               
        if val is None:                                                                                                                           
            return missingval                                                                                                                     

        if valtype is str:                                                                                                                        
            return f"{val}"                                                                                                                       
        elif valtype is int:                                                                                                                      
            # if isinstance(val, str):                                                                                                            
            #     intfmt = ""                                                                                                                     
>           return format(int(val), intfmt)                                                                                                       
E           ValueError: invalid literal for int() with base 10: '\x1b[31m42\x1b[0m'                                                               

tabulate/__init__.py:1234: ValueError                                                                                                             
________________________________________________________ test_ansi_color_bold_and_fgcolor ________________________________________________________

    def test_ansi_color_bold_and_fgcolor():                                                                                                       
        "Regression: set ANSI color and bold face together (issue #65)"                                                                           
        table = [["1", "2", "3"], ["4", "\x1b[1;31m5\x1b[1;m", "6"], ["7", "8", "9"]]                                                             
>       result = tabulate(table, tablefmt="grid")                                                                                                 

test/test_regression.py:374:                                                                                                                      
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tabulate/__init__.py:2198: in tabulate                                                                                                            
    cols = [                                                                                                                                      
tabulate/__init__.py:2199: in <listcomp>                                                                                                          
    [_format(v, ct, fl_fmt, int_fmt, miss_v, has_invisible) for v in c]                                                                           
tabulate/__init__.py:2199: in <listcomp>                                                                                                          
    [_format(v, ct, fl_fmt, int_fmt, miss_v, has_invisible) for v in c]                                                                           
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

val = '\x1b[1;31m5\x1b[1;m', valtype = <class 'int'>, floatfmt = 'g', intfmt = '', missingval = '', has_invisible = True                          

    def _format(val, valtype, floatfmt, intfmt, missingval="", has_invisible=True):                                                               
        """Format a value according to its type.                                                                                                  

        Unicode is supported:                                                                                                                     

        >>> hrow = ['\u0431\u0443\u043a\u0432\u0430', '\u0446\u0438\u0444\u0440\u0430'] ; \
            tbl = [['\u0430\u0437', 2], ['\u0431\u0443\u043a\u0438', 4]] ; \
            good_result = '\\u0431\\u0443\\u043a\\u0432\\u0430      \\u0446\\u0438\\u0444\\u0440\\u0430\\n-------  -------\\n\\u0430\\u0437             2\\n\\u0431\\u0443\\u043a\\u0438           4' ; \
            tabulate(tbl, headers=hrow) == good_result
        True

        """  # noqa
        if val is None:
            return missingval

        if valtype is str:
            return f"{val}"
        elif valtype is int:
            # if isinstance(val, str):
            #     intfmt = ""
>           return format(int(val), intfmt)
E           ValueError: invalid literal for int() with base 10: '\x1b[1;31m5\x1b[1;m'

tabulate/__init__.py:1234: ValueError
============================================================ short test summary info =============================================================
FAILED test/test_output.py::test_intfmt_with_string_as_integer - AssertionError
FAILED test/test_output.py::test_intfmt_with_colors - ValueError: invalid literal for int() with base 10: '\x1b[31m42\x1b[0m'
FAILED test/test_regression.py::test_alignment_of_colored_cells - ValueError: invalid literal for int() with base 10: '\x1b[31m42\x1b[0m'
FAILED test/test_regression.py::test_ansi_color_bold_and_fgcolor - ValueError: invalid literal for int() with base 10: '\x1b[1;31m5\x1b[1;m'
==================================================== 4 failed, 304 passed, 2 skipped in 2.06s ====================================================

Conclusions

Description

If we made this change, the strings which contain integers will be formatted. It is not bad, but I was thinking that if I want some number as a string is because maybe I would rather not format as an integer. It will be a hard decision for the owners and maintainers of this Python package.

In the test test_intfmt_with_string_as_integer, the data [[82642], ["1500"], [2463]] is transformed to 82,642 | 1,500 | 2,463.


We need to explore the way to solve the colored numbers. I'll research in the code if there is something that I can reuse or create a new functionality to detect the colored number and format the integer.


Taking advantage of this talk, I would like to recommend that each data value be treated independently, since apparently if it encounters at least one floating number, it takes all numbers as floats, even if they are not.

@pytest.mark.skip(reason="It detects all values as floats but there are strings and integers.")
def test_intfmt_with_string_with_floats():
    "Output: integer format"
    result = tabulate([[82000.38], ["1500.47"], ["2463"], [92165]], intfmt=",", tablefmt="plain")
    expected = "82000.4\n 1500.47\n 2,463\n92,165"
    assert_equal(expected, result)

For this, data [[82000.38], ["1500.47"], ["2463"], [92165]] all the numbers were assigned as a float.

val: 82000.38                                                                                                                                     
    val: <class 'float'>                                                                                                                          
valtype: <class 'float'>                                                                                                                          

val: 1500.47                                                                                                                                      
    val: <class 'str'>                                                                                                                            
valtype: <class 'float'>

val: 2463
    val: <class 'str'>
valtype: <class 'float'>

val: 92165
    val: <class 'int'>
valtype: <class 'float'>

Expected:
82000.4
 1500.47
 2,463
92,165

Got:
82000.4
 1500.47
 2463
92165

Furthermore, I noticed that the float 82000.38 is rounded to 82000.4, but not the float 1500.47. Is this an expected behavior?


test_intfmt_with_string_as_integer

tabulate([[82642], ["1500"], [2463]], intfmt=",", tablefmt="plain")

Expected: 82,642 1500 2,463

Got: 82,642 1,500 2,463


test_intfmt_with_colors

tabulate(colortable, colorheaders, "grid", intfmt=",")

ValueError: invalid literal for int() with base 10: \x1b[31m42\x1b[0m

--

test_alignment_of_colored_cells

colortable = [
    ("test", 42, "\x1b[31m42\x1b[0m"),
    ("test", 101, "\x1b[32m101\x1b[0m"),
]
colorheaders = ("test", "\x1b[34mtest\x1b[0m", "test")
formatted = tabulate(colortable, colorheaders, "grid")

ValueError: invalid literal for int() with base 10: \x1b[31m42\x1b[0m


test_ansi_color_bold_and_fgcolor

ValueError: invalid literal for int() with base 10: \x1b[1;31m5\x1b[1;m

Final note

I look forward to your reply or guide to start working on solutions. Otherwise, I could make code, but I really don't know if it aligns with where you want to go.

airvzxf commented 4 months ago

It would be the solution for the colored numbers, at least at Regular Expression level.

Regular expression: (\\x[0-9a-fA-F]+\[\d+m)([0-9\.]+)(\\.*)

For this example

\x1b[31m42\x1b[0m
\x1b[32m1010\x1b[0m

Match for these examples:

\x1b[31m   42   \x1b[0m
\x1b[32m   1010   \x1b[0m
airvzxf commented 4 months ago

In the above commit: https://github.com/astanin/python-tabulate/pull/319/commits/e45cac16676f0955b2fd05389d97b3e6f71fd66c; I added the validation and formatting when the number is colored.

airvzxf commented 4 months ago

Here other option to format the strings which contain integers. For example, in this data [[82642], ["1500"], [2463]] the result should be 82,642 | 1,500 | 2,463 rather than 82,642 | 1500 | 2,463.

The only change that needs to be performed is adding some elif:

            elif val.isdigit():
                val = format(int(val), intfmt)

In this code:

    if valtype is str:
        return f"{val}"
    elif valtype is int:
        if isinstance(val, str):
            val_striped = val.encode('unicode_escape').decode('utf-8')
            colored = re.search(r'(\\[xX]+[0-9a-fA-F]+\[\d+[mM]+)([0-9.]+)(\\.*)$', val_striped)
            if colored:
                total_groups = len(colored.groups())
                if total_groups == 3:
                    digits = colored.group(2)
                    if digits.isdigit():
                        val_new = colored.group(1) + format(int(digits), intfmt) + colored.group(3)
                        val = val_new.encode('utf-8').decode('unicode_escape')
            elif val.isdigit():
                val = format(int(val), intfmt)
            intfmt = ""
        return format(val, intfmt)

This is the result of the test test_intfmt_with_string_as_integer.

result = tabulate([[82642], ["1500"], [2463]], intfmt=",", tablefmt="plain")
expected = "82,642\n  1500\n 2,463"
Expected:
82,642
  1500
 2,463

Got:
82,642
 1,500
 2,463

Note:

If you think this change is important and relevant to your package, please let me know, so I can add this change to the pull requests.