Bug: Many serious TOML parsing/loading bugs caused by underlying toml library, consider switching?

pirate commented 1 week ago

Thanks so much for building python-benedict, it's really awesome!

So unfortunatley the underlying TOML library https://github.com/uiri/toml used by benedict has a bunch of long-time outstanding bugs that break parsing/loading and generally make it unsafe to load->dump->load the same string.

For example, you cannot dump any dict containing escape sequences without it throwing an exception:

>>> benedict({
    "reset": "\033[00;00m",
    "lightblue": "\033[01;30m",
}).to_toml()

...
File ~/test/.venv/lib/python3.11/site-packages/toml/encoder.py:113, in _dump_str(v)
    111     else:
    112         joiner = "u00"
--> 113     v = [v[0] + joiner + v[1]] + v[2:]
    114 return unicode('"' + v[0] + '"')

IndexError: list index out of range

But thats not it, there are many other fairly major string escaping, quoting, and parse/dump cycle inconsistency bugs that have bitten other projects using uiri/toml:

Almost all of these are 2yr+ old, indicating they're probably not going to all get fixed anytime soon without significant increase in velocity. No harm no foul, it's open source we can't demand they go faster and they don't owe us anything, but maybe benedict could consider a different library with fewer major outstanding parser consistency issues?

Is benedict open to switching to a library without these issues? Maybe one of these:

I'll chip in $20 towards the work to make the switch if it's an option ⬇️

Upvote & Fund

We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.

pirate commented 1 week ago

After looking more into the other options unfortunately they are worse in other ways, for example they can barely handle a fraction of the types that uiri/toml supports :/

I guess we are stuck writing a custom TomlEncoder class to work around the bugs for now.

import re
import toml

def better_toml_dump_str(val):
    try:
        return toml.encoder._dump_str(val)
    except Exception:
        # if we hit any of toml's numerous encoding bugs,
        # fall back to using json representation of string
        return json.dumps(str(val))

class CustomTOMLEncoder(toml.encoder.TomlEncoder):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

        # override the dumper funcs for any types you need to support here:
        self.dump_funcs[str]            = better_toml_dump_str
        self.dump_funcs[re.RegexFlag]   = better_toml_dump_str
        ...

# this works:
>>> test_dict = {
    "string_with_escape_sequences": "\033[00;00m",
    "example_weird_type": re.IGNORECASE | re.UNICODE | re.MULTILINE,
    # etc. try more weird types here
}
>>> print(benedict(test_dict).to_toml(encoder=CustomTOMLEncoder()))

string_with_escape_sequences = "\u001b[00;00m"
example_weird_type = "re.IGNORECASE|re.UNICODE|re.MULTILINE"

You're welcome to close this issue 🤷, up to you.

fabiocaccamo commented 1 week ago

@pirate thank you for reporting this problem.

Actually, for decoding toml on Python >= 3.11, the tomlib standard lib is used: https://github.com/fabiocaccamo/python-benedict/blob/main/benedict/serializers/toml.py#L10

I think it would be cool to improve TOML support, would you like to submit a PR with a list of failing test cases (that should succeed with the perfect library)? So it would be much easier to test other TOML libraries or doing fixes in a different way.

fabiocaccamo / python-benedict

Bug: Many serious TOML parsing/loading bugs caused by underlying toml library, consider switching? #439

Upvote & Fund