cmhughes / latexindent.pl

Perl script to add indentation (leading horizontal space) to LaTeX files. It can modify line breaks before, during and after code blocks; it can perform text wrapping and paragraph line break removal. It can also perform string-based and regex-based substitutions/replacements. The script is customisable through its YAML interface.
GNU General Public License v3.0
864 stars 84 forks source link

Decode a command line argument according to the encoding contained in the command line argument #510

Closed fengzyf closed 5 months ago

fengzyf commented 7 months ago

what is this pull request about?

This pull request decodes a command line argument according to the encoding contained in the command line argument

does this relate to an existing issue?

Here is the link: https://github.com/cmhughes/latexindent.pl/issues/505

does this change any existing behaviour?

yes

what does this add?

specific details of what this pull request adds:

how do I test this?

In the Simplified Chinese version of Windows, run one of the following commands from the command line: latexindent 新建.tex -w --encoding GB2312 -g issue-505.log or latexindent 新建.tex -w --encoding cp936 -g issue-505.log

anything else?

Two links about the code page: https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/chcp

cmhughes commented 7 months ago

Thanks for this, I think it looks really good.

I'm testing this using github actions; you'll see the

Can you take a look at these files and let me know what you think and what needs to be done to get this test to pass? Thanks so much!

fengzyf commented 7 months ago

The default language of my system is Simplified Chinese, and the Active Code Page is 936. That is, the key value of "ACP" in the registry "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage" is 936.

According to ikegami in https://stackoverflow.com/a/63868721 answer:

Every Windows system call that deals with strings comes in two varieties: An "A"NSI version that uses the Active Code Page (aka ANSI Code Page), and a "W"ide version that uses UTF-16le.[1] Perl uses the A version of all system calls. That includes the call to get the command line.

The ACP is hard-coded. chcp changes the console's CP, but not the encoding used by the A system calls. 

The choice of console's CP (as set by chcp) has no effect on how Perl receives the command line. Because Perl uses the A version of the system calls, the command line will be encoded using the ACP regardless of the console's CP and the OEM CP.

I think the code page in windows provided by github action is not cp936, we set it to EncodeA and may not support Chinese,(see https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers), so when the command line uses the following code:

./latexindent.exe -w --encoding "GB2312" test-cases/back-up-tests/新建.tex -y 'onlyOneBackUp:1' 

"新建" is encoded with EncodeA and decoded with cp936 to become "??".

The commit https://github.com/cmhughes/latexindent.pl/commit/30d0ebeb2345a2219d2d04cfa1b5b8e8d7e44a61 use

my $encoding_sys = 'cp' . Win32::GetACP(); 

to retrieve the Active Code Page of Windows. This eliminates the need for users to specify an encoding, but it also means that only characters supported by the Active Code Page are allowed.

The commit https://github.com/cmhughes/latexindent.pl/commit/00250a164804137861a0f3c5dbce80ba2ef0846e utilizes a method provided by ikegami in https://stackoverflow.com/a/44489228. It uses the Win32::API module to invoke Windows API functions in order to retrieve command-line arguments. Additionally, the FileOperation.pm file utilizes the Win32::Unicode::File module to handle Unicode character file names. Therefore, with this commit, https://github.com/cmhughes/latexindent.pl/commit/00250a164804137861a0f3c5dbce80ba2ef0846e, UTF-8 characters can be used in the Windows command prompt.

cmhughes commented 7 months ago

When I try and run this on my Ubuntu machine I'm getting all kinds of errors

Can't locate Win32/Unicode/File.pm in @INC (you may need to install the Win32::Unicode::File module) (@INC contains: /home/cmhughes/projects/latexindent /home/cmhughes/perl5/perlbrew/perls/perl-5.36.0/lib/site_perl/5.36.0/x86_64-linux /home/cmhughes/perl5/perlbrew/perls/perl-5.36.0/lib/site_perl/5.36.0 /home/cmhughes/perl5/perlbrew/perls/perl-5.36.0/lib/5.36.0/x86_64-linux /home/cmhughes/perl5/perlbrew/perls/perl-5.36.0/lib/5.36.0) at /home/cmhughes/projects/latexindent/LatexIndent/FileOperation.pm line 13.
BEGIN failed--compilation aborted at /home/cmhughes/projects/latexindent/LatexIndent/FileOperation.pm line 13.
Compilation failed in require at /home/cmhughes/projects/latexindent/LatexIndent/LogFile.pm line 33.
BEGIN failed--compilation aborted at /home/cmhughes/projects/latexindent/LatexIndent/LogFile.pm line 33.
Compilation failed in require at /home/cmhughes/projects/latexindent/LatexIndent/Document.pm line 28.
BEGIN failed--compilation aborted at /home/cmhughes/projects/latexindent/LatexIndent/Document.pm line 28.
Compilation failed in require at /home/cmhughes/projects/latexindent/latexindent.pl line 28.
BEGIN failed--compilation aborted at /home/cmhughes/projects/latexindent/latexindent.pl line 28.

I don't understand why the script latexindent.pl needs to change.

cmhughes commented 7 months ago

I think this needs a switch (as you had previously) and then the new modules are only conditionally loaded (like the GCString).

cmhughes commented 6 months ago

Any update on this?

fengzyf commented 6 months ago

Based on the commit https://github.com/cmhughes/latexindent.pl/pull/510/commits/abca906bc1171c67785eb3b327c78f0681089c9a: For the Perl script latexindent.pl, command line parameters support UTF-8 characters.

perl latexindent.pl test-cases/back-up-tests/新建äö.tex -o=+新建 -y 'onlyOneBackUp:1'

After packaging with PAR:: Packer, the generated latexindent.exe only supports characters within the range contained in the system's ACP. For example, when ACP is 1252,

latexindent.exe test-cases/back-up-tests/äöao.tex -g="äöao.log"

ANSI code page 1252: ANSI Latin 1; Western European. ANSI code page 936: ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312) .

cmhughes commented 6 months ago

Thanks for your continued work on this.

Looks like github actions is still failing with this, see https://github.com/cmhughes/latexindent.pl/actions/runs/7716600261

fengzyf commented 6 months ago

Please check the commit https://github.com/cmhughes/latexindent.pl/pull/510/commits/fdd429c4d7ad114b4c336edc465745ad4bb62dea. Script programs packaged by PAR-Packer-1.061 in Windows rely on the current system's code page when passing command line parameters. Before a new release is released, the problem can be solved by rebuilding and installing the PAR Packer "ci" branch, the script program packaged by PAR Packer no longer depends on the current system's code page, command line parameters support UTF-8 characters.

cmhughes commented 6 months ago

I'm still getting errors on this https://github.com/cmhughes/latexindent.pl/actions/runs/7776000560/job/21202618547

fengzyf commented 6 months ago

Sorry, I forgot to upload PAR-Packer-ci.zip. Now you can retest.

cmhughes commented 6 months ago

Why is this closed?

fengzyf commented 6 months ago

Why is this closed?

I think the latest commit has solved this problem.

I reopened the pull request if that's what you want.

cmhughes commented 6 months ago

Thanks this, is looking good. It's looking like github actions is passing now, which is great:

https://github.com/cmhughes/latexindent.pl/actions/runs/7943469295/job/21687880748

When I run the test cases, I get output as in the following:

image

This shouldn't be printed to the terminal, can you fix it, please? Additionally, can you ensure that subroutines are named with lower case names?

Thanks so much!

cmhughes commented 6 months ago

Thanks this is looking good.

Why does PAR-Packer-ci/PAR-Packer-ci.zip need to be part of the repository? If it's only needed for latexindent.exe which is produced by github actions then can it be retrieved using wget at the appropriate point?

Additionally, I see that you've added two new .pm files. Can they combined into one file?

fengzyf commented 6 months ago

You're right, you can make changes as you think it's appropriate.

cmhughes commented 6 months ago

Yes, please remove par packer from the repo.

On Sat, 24 Feb 2024, 12:53 fengzyf, @.***> wrote:

You're right, you can make changes as you think it's appropriate.

— Reply to this email directly, view it on GitHub https://github.com/cmhughes/latexindent.pl/pull/510#issuecomment-1962356725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ7CYDXNKAWYQPYAFDM34TYVHPD3AVCNFSM6AAAAABCCEPWLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGM2TMNZSGU . You are receiving this because you commented.Message ID: @.***>

cmhughes commented 5 months ago

Thanks for this.

I've merged this manually as of https://github.com/cmhughes/latexindent.pl/commit/5abb1a8586881e8cebbba6cb63a5deeba2f76e83

You'll see that I have

I hope this sounds reasonable, and makes sense. Let me know if you have any feedback on this.

Thanks for your work on this!

fengzyf commented 5 months ago

I think the last two changes you made are very reasonable.

The command line arguments received by latexindent.exe now support characters in the UTF-8 range, not just CJK characters. Therefore, I think the change of renamed CmdLineArgsFileOperation.pm to be CJKCmdLineArgsFileOperation.pm is unreasonable. Perhaps you can use the name UTF8CmdLineArgsFileOperation. pm

cmhughes commented 5 months ago

Many thanks, that's very helpful.

I've renamed CJKCmdLineArgsFileOperation.pm to UTF8CmdLineArgsFileOperation.pm as you suggest.

Thanks again!