dflook / python-minifier

Transform Python source code into its most compact representation
MIT License
553 stars 40 forks source link

Add an option to disable conversion of unicode chars #100

Open fmmarzoa opened 3 months ago

fmmarzoa commented 3 months ago

Hello!

I have a dictionary that looks like this:

translit_map = { u"\u0027": "", u"\u00C0": "A", u"\u00C1": "A", u"\u00C2": "A", u"\u00C3": "A", u"\u00C4": ["A", "AE"], u"\u00C5": ["A", "AA"], u"\u00C6": "AE", u"\u00C7": "C", ...

When this is converted, the unicode stuff is actually converted to their literal representation, like in:

 Ac={"'":D,'À':H,'Á':H,'Â':H,'Ã':H,'Ä':[H,'AE'],'Å':[H,'AA'],'Æ':'AE','Ç':O

So it'd be nice to have an option to disable this behaviour so it keeps the keys like u"\u0027", because there are some cases whether it could be needed (I have one in which I have to upload this code to a server through a web form and those non-ascii chars get converted into '?'. I have reported it to the server admin too, but anyway, it could be great if you could choose not to convert these into literal UTF-8 chars).

Thanks! Fran

dflook commented 3 months ago

Hello @fmmarzoa.

You can get close to what your are looking for by using code like this:

from python_minifier import minify

with open('snippet.py', 'rb') as f:
    source = f.read()

minified = minify(source)

with open('minified.py', 'w', encoding='ascii', errors='backslashreplace') as f:
    f.write(minified)

which will output:

translit_map={"'":'','\xc0':'A','\xc1':'A','\xc2':'A','\xc3':'A','\xc4':['A','AE'],'\xc5':['A','AA'],'\xc6':'AE','\xc7':'C'}

But this will break any program that uses non-ascii unicode names, e.g.

def Á():pass
fmmarzoa commented 1 month ago

Hi dflook,

Thanks for that workaround suggestion, I missed the notification.

What I did myself was to run a script after minify to restore those escaped chars using this:

def unicode_to_escape(input_str):
    """
    Convert non-ASCII Unicode characters in the input string to their escape sequences.

    Args:
        input_str (str): The input string containing Unicode characters.

    Returns:
        str: The modified string with non-ASCII Unicode characters converted to escape sequences.
    """
    def replace_unicode(match):
        return match.group(0).encode('unicode-escape').decode('ascii')

    # Match characters beyond the basic ASCII set
    return re.sub(r"[\x80-\uffff]", replace_unicode, input_str)