utf-8 output PoC - Githubissues

gh0stwizard commented 2 months ago

This is a proof of concept. In the file bzrunM.bat notice chcp commands below:

                Call :LuaEndedOkREMOVE
chcp 65001 1>NUL 2>NUL
                if exist LoadAndExecuteModScript_DEV.lua (
                    %_mLUA% LoadAndExecuteModScript_DEV.lua
                ) else (
                    %_mLUA% LoadAndExecuteModScript.lua
                )
                set exitCode=%ERRORLEVEL%
chcp %_CodePage% 1>NUL 2>NUL
                REM echo exitCode = %exitCode%
                Call :LuaEndedOk

Now, instead of hardcoded 65001 above, put an option variable, like %_OutputCodePage%. Make it changeable in BUILDMOD{_AUTO}.bat as any other option, for instance, -UseLuaScriptInPak ASK. The value %_CodePage% above may be used on your own wish, even hardcoded to cp850, cp437 or gathered from system. In the end, an end user will see utf-8 output on screen and the rest of the code will working as expected.

HolterPhylo commented 2 months ago

When a windows user has "Beta: Use Unicode UTF-8..." check box be checked, it causes problems.

So, I am not going to change the requirement that the windows "Beta: Use Unicode UTF-8..." check box be unchecked.

Curious, what different output do you see/get when you run it with chcp 65001 as above then without? Do you have "Beta: Use Unicode UTF-8..." check box be checked?

gh0stwizard commented 2 months ago

I don't use "Beta: Use Unicode UTF-8...".
I need to see unicode characters provided by lua in your project, when amumss does EXML modifications.
chcp 65001 does exactly the same things as when the checkbox "Beta: Use Unicode UTF-8..." is checked.
Due the fact that utf-8 was never been popular on Windows, this checkbox is added for experienced users only. A user will check it on only and only when one have read about it somewhere or learn from a friend. By default this checkbox is unchecked.
In continue of p.4, a very few batch scripts were written with utf-8 support on Windows. Your project is one of such exceptions. In 99% other cases, it's enough to use local charset.

In the end, I am using AMUMSS since 4.2.1.4 version. I have added to the top of BUILDMOD_AUTO.bat the command chcp 65001. Until this very moment I have ZERO issues with utf-8 on my Windows 10. And I see no real technical reason why utf-8 can't be used to print out unicode characters on the screen in AMUMSS.

I don't forcing you to enable utf-8 output by default. I am asking to stop blocking it as was done in AMUMSS v4.5.6.0W, because anyone, who want to see unicode characters, must perform changes in bzrunM.bat and comment the block if [%_CodePage%]==[65001] (... pause exit).

Again, the logic is simple. A concept code:

rem  AT THE TOP OF THIS FILE FORCE USING SAFE CHARSET
chcp 850

rem HERE GOES OUR CODE WHICH AFFECTED BY CHARSET
rem ...

rem HERE WE RUN EXTERNAL PROGRAM (LUA) WHICH MAY PRINT UTF-8 CHARACTERS
rem ENABLE FOR THIS VERY MOMENT UNICODE IN CONSOLE
chcp 65001
lua.exe somescript.lua

rem SET BACK SAFE ENCODING
chcp 850

rem REST OF THE PROGRAM CONTINUE WORKING WITH SAFE CHARSET
rem ...

rem END OF FILE
exit

HolterPhylo commented 2 months ago

Still want to see:

Curious, what different output do you see/get when you run it with chcp 65001 as above then without?

When I add chcp 65001 like you suggest, I do not see any difference. So, for me, it only adds a problem for some other users that cannot use AMUMSS when 65001 is active AND no obvious plus otherwise...

gh0stwizard commented 2 months ago

Okay. One day you will learn a simple thing: it's better provide opportunities than blocking existing possibilities. Close the ticket and forget about it.

HolterPhylo commented 2 months ago

Why not answer my question? `Curious, what different output do you see/get when you run it with chcp 65001 as above then without?

gh0stwizard commented 2 months ago

Holy molly. Guys, learn a bit about utf-8, please.

Find any unicode characters. Here 1shot google regexp: [^\x00-\x7F]+. Put it in VS Code search over exml files of the game. Here is an example output from LANGUAGE\NMS_LOC1_ENGLISH.EXML:

<Property name="Id" value="BUI_ATLAS" />
      <Property name="English" value="At1αs" />

Try to print out unicode string above, "At1αs". Compare results when chcp 850 and when chcp 65001.

Plus, you have to configure your terminal/console (cmd.exe) to use unicode-friendly font, for instance, Consolas. Otherwise, you would not see a difference.

Curious, what different output do you see/get when you run it with chcp 65001 as above then without?

I see non-unicode characters. Awkward ASCII-characters. It's normal behavior when output charset does not match input one's, e.g. your lua program prints out utf-8 characters, but cmd.exe expects as input local charset (850 or any legacy single-byte charset, 1250, etc).

HolterPhylo commented 2 months ago

You do not need to bring molly into this, we are talking...

My cmd.exe IS configured to use Consolas AND I still do not see a difference when I insert chcp 65001 or not. I do not care about VS Code, I use Notepap++.

In Notepad++: With encoding ANSI, it shows each byte representation:

  <Property name="Id" value="BUI_ATLAS" />
  <Property name="English" value="At1Î±s" />
With encoding utf-8, it looks like what I use in Notepad++ and your example:

<Property name="English" value="At1αs" />

BUT, THIS is coming from the output of MBINCompiler.exe itself, not something AMUMSS does to the EXML nor the MBIN file. Look at those files in a Hex Editor of your choice... The bytes are "41 74 31 CE B1 73" ( which shows as "A t 1 Î ± s" ). This byte sequence is in both the EXML AND the MBIN files. It is as HG created it! Maybe you could take it up with them?

Anyway, the real point is this: If 65001 was allowed in a normal AMUMSS installation, then some users (that I had to help figure the problem out) will not be able to use AMUMSS. So AMUMSS needs to flag it and request a correction (as it does right now) to allow everyone to use it.

If you really want to use 65001, just go ahead and add "chcp 65001" where you want it. The source code is there for you.

And like you said: cmd.exe expects as input local charset (850 or any legacy single-byte charset, 1250, etc).

Hope this help a bit

gh0stwizard commented 2 months ago

cmd.exe is too smart to test unicode in its console (by typing in). Here is a working example. Create a file utf8test.bat, it should be encoded as UTF8 (not ANSI):

echo off
echo ---------------------
chcp 850
echo CAFÉ
echo ---------------------
chcp 65001
echo CAFÉ
echo ---------------------

Then run in it in cmd:

C:\Users\SECRET>utf8test.bat

C:\Users\SECRET>echo off
---------------------
Active code page: 850
CAF├ë
---------------------
Active code page: 65001
CAFÉ
---------------------

C:\Users\SECRET>

I will not connect with HG, because the question is trivial. As I said, lookup at the very first line of any EXML:

<?xml version="1.0" encoding="utf-8"?>

And this is not a joke. The XML is very strict format. When has been said encoding="utf-8", it means that XML will and must use utf-8 to parse files and save values into these files.

I hope, you have found out this already. Otherwise, I may only describe this issue as:

A NEW GALAXY DISCOVERED
       UTF-8

:) Happy coding!

HolterPhylo commented 2 months ago

You too!

HolterPhylo / AMUMSS

utf-8 output PoC #37