Keep magic comments in the minify output for shebang and encoding

dflook / python-minifier

Transform Python source code into its most compact representation

MIT License

558 stars 41 forks source link

Keep magic comments in the minify output for shebang and encoding #8

Open thinkdoggie opened 4 years ago

thinkdoggie commented 4 years ago

Shebang and encoding declare are two types of magic comments in the beginning lines of Python source code. In general, the minifier had better keep these lines as the origin source.

For example,

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
...

dflook commented 4 years ago

The output from python-minifier is always UTF-8, so the coding comment is not needed. An option to preserve the shebang would be sensible.

thinkdoggie commented 4 years ago

Hi @dflook, thanks for your reviewing. Even the encoding for the output has been UTF-8, Python 2 interpreter still need this magic comment to load code correctly if non-latin characters used in the content. Because,

Python 2 uses ASCII as its default encoding for its strings which cannot store Chinese characters. On the other hand, Python 3 uses Unicode encoding for its strings by default which can store Chinese characters.

Here're some reference links about this topic:

thinkdoggie commented 4 years ago

Hi @dflook, do you think shebang or encoding hint should be preserved by default?

If not, the minified program will not be fully equivalent to the original source. For example,

Without shebang, the script could not be executable from shell;
Without encoding comment, the source containing non-lantin characters cannot be loaded by Python 2.

On the other hand, I also suggest to provide the following options in the command-line to suppress this behavior: --remove-shebang and --remove-encoding-comment. Thanks!

dflook commented 4 years ago

Hi, thanks for continuing to work on this!

The encoding comment is an instruction for the parser. Since we work on the AST produced by the parser, the coding comment in the source is no longer relevant. In Python 3 we can safely output UTF-8 and rely on the parser to produce the same string.

The sequence of bytes might be different between input and output if the encoding is different, but the parser should produce the same string. In this case copying the coding comment from the input to output is wrong, as it will probably not match the output encoding. If using the pyminify command the output is UTF-8. When using the minify() function you get a python str that you can encode how you like, but you will need to add a coding comment yourself to ensure it parses correctly.

I'm not too sure of how this works in Python 2 to be honest, but I'm reluctant to make any changes without a test case. Do you have an example where the current behaviour is wrong?

dflook commented 4 years ago

For the shebang I can't quite decide if it should be removed or preserved by default, but --remove-shebang is along the right lines.

thinkdoggie commented 4 years ago

Hi @dflook , I totally agree that test cases are always necessary for any code changes. Nevertheless, it looks no existing test scaffold for the minify() function. If I understand correctly, I am willing to add more test code for this new enhancement.

The magic comment for encoding is just a hit for Python 2 interpreter rather than the file encoding itself. There is a very old spec PEP263 to address this topic. I will show more examples and negative cases to explain how it works and why it is necessary for Python 2.

blikjeham commented 2 years ago

How is it going with this Pull Request? I ran into the issue that my shebang gets removed from the output file (https://github.com/dflook/python-minifier/issues/34). This pull request has been stale for a year and a half. What is the status?