Add UTF-8 filename support on Windows

aqnuep commented 11 months ago

Fixes https://github.com/KhronosGroup/KTX-Software/issues/781.

aqnuep commented 11 months ago

Related test changes in PR https://github.com/KhronosGroup/KTX-Software-CTS/pull/8.

MarkCallow commented 10 months ago

Does the ktx suite never call, e.g, ktxTexture_CreateFromNamedFile? If it does, how is that working?

aqnuep commented 10 months ago

Note, as the CTS for PR #785 is already merged, but not the code in PR #785, this will require one more rebase.

(we really are complicating our work with the current process of not using simple fast-forwarding with the CTS, but well...)

MarkCallow commented 10 months ago

We still have to resolve how to fix the library functions ktxPrint*ForNamedFile, ktxTexture*_CreateFromNamedFile and ktxTexture*WriteToNamedFile. Choices I see are:

Require utf8 input which means everyone writing a Windows app has to do the same things being done in this PR.
Don't support non-ansi filenames via these APIs. Tell people to use the StdioStream or Stream variants if they want to use such files.
Provide 'W' versions of the functions which accept wchar_t strings together with macros that select the narrow or wide version depending on whether UNICODE is defined.

What do you thing, @aqnuep?

aqnuep commented 10 months ago

I don't know. I'd certainly prefer that to be handled separately as I wouldn't like to increase the scope of this change.

Using _UNICODE is not an option, as that will make all strings to be by default UTF-16 (i.e. wide strings), which has all sorts of implications. But that does not prevent code from using wide strings explicitly (e.g. std::wstring), so transcoding any UTF-8 input to UTF-16 for WinAPI call purposes could be an option.

I think having filename based APIs part of the KTX API was somewhat a bad design decision, but if anything I'd suggest supporting only UTF-8, or adding "W" suffixed versions for wide-string input, like the WinAPIs do.

MarkCallow commented 10 months ago

I don't know. I'd certainly prefer that to be handled separately as I wouldn't like to increase the scope of this change.

Yes. Let's do a separate commit.

Using _UNICODE is not an option, as that will make all strings to be by default UTF-16 (i.e. wide strings), which has all sorts of implications.

Doesn't this happen only if the application is using the generic text macros (_TCHAR, etc.)?

But that does not prevent code from using wide strings explicitly (e.g. std::wstring), so transcoding any UTF-8 input to UTF-16 for WinAPI call purposes could be an option.

I think having filename based APIs part of the KTX API was somewhat a bad design decision, but if anything I'd suggest supporting only UTF-8, or adding "W" suffixed versions for wide-string input, like the WinAPIs do.

Too late to change it now.

I am leaning towards making the current functions accept utf-8 which they will transcode to UTF-16 on Windows and adding "W" suffixed versions on Windows only. The functions are very small so this is not a problem. Probably should add "_t" prefixed macros as well for Windows users.

aqnuep commented 10 months ago

Doesn't this happen only if the application is using the generic text macros (_TCHAR, etc.)?

For the most part, yes, but there could be other implications. If you'd prefer to use UTF-16 on Windows and try to get things built with UNICODE, that's an option too, but this PR (which hopefully will pass CI now) is trying to use UTF-8 instead.

I am leaning towards making the current functions accept utf-8 which they will transcode to UTF-16 on Windows and adding "W" suffixed versions on Windows only. The functions are very small so this is not a problem. Probably should add "_t" prefixed macros as well for Windows users.

Adding support for UTF-8 file names passed through the const char* interfaces of the current *ForNamedFile interfaces could be done in a similar fashion as it's done in this PR for the CLI interfaces.

MarkCallow commented 10 months ago

I am looking at this code again with a view to using it in the legacy tools. It uses _TCHAR, _tmain and _UNICODE. Is there any point to this? I don't think we'll ever want to compile it with _UNICODE defined

inline std::string DecodeUTF8Path(std::string path)

(line 41 in platform_utils.h) would break things. It needs to use _tstring because in the _UNICODE case it would be receiving and needing to return wstring.

Note that when compiling for the Windows Store, _UNICODE is automatically defined. Not that we have any need to do that, but people connected with vcpkg have tried to do it, and filed a bug, without realizing that command line tools are pointless for the Windows Store.

MarkCallow commented 10 months ago

Another question. In

inline std::wstring DecodeUTF8Path(std::string path) {
    std::wstring result;
...
    return result;
}

Shouldn't we use return std::move(result); to avoid copying the string's data on the return? Or is the data on the stack so it will be lost on return even with std::move?

aqnuep commented 10 months ago

Shouldn't we use return std::move(result); to avoid copying the string's data on the return? Or is the data on the stack so it will be lost on return even with std::move?

This is a common question amongst people new to C++ move semantics. For such return statements is unnecessary (see e.g.: https://stackoverflow.com/questions/14856344/when-should-stdmove-be-used-on-a-function-return-value).

MarkCallow commented 10 months ago

This is a common question amongst people new to C++ move semantics

Thanks for the education.

Please respond to my question regarding _UNICODE and generic text functions.

I see you removed the call to SetConsoleOutputCP(CP_UTF8) in the final PR. Why did you do that? Without it the filename displayed in error messages is gibberish, at least in the case I'm testing right now, which is not using {fmt}.

aqnuep commented 10 months ago

Thanks for the education.

I apologize if my response sounded scoffing, that was certainly not my intent. When I first started using C++ move semantics I had that very same question.

Please respond to my question regarding _UNICODE and generic text functions.

Sorry, I missed that. So in the current form, with the UTF-8 based approach the use of _TCHAR and friends is unnecessary, as the compile-time-decided character time is indeed only needed if we want to build with _UNICODE.

I did not want to remove that code though in case someone wants to build e.g. the library only with _UNICODE at some point (if that ever going to work).

Note that when compiling for the Windows Store, _UNICODE is automatically defined. Not that we have any need to do that, but people connected with vcpkg have tried to do it, and filed a bug, without realizing that command line tools are pointless for the Windows Store.

I'm not sure what part of the repo they are building for that purpose, so we can revisit at some point to also support a UTF-16 based path (i.e. _UNICODE). But enabling proper support for that would have a wide impact on the entire stack, probably lots of modifications for things that weren't written with _UNICODE in mind, so that's really off the table for now, and would require its own test set (besides changing the code with support for it).

I see you removed the call to SetConsoleOutputCP(CP_UTF8) in the final PR. Why did you do that? Without it the filename displayed in error messages is gibberish, at least in the case I'm testing right now, which is not using {fmt}.

After some more elaborate testing, I noticed that changing the code page would leave you with a command prompt with that code page afterwards, and even if we'd reset it at exit you could have font/output changes temporarily that would be awkward.

So I thought it's best to just leave the codepage as the user uses it. If the user uses a codepage that shows the unicode characters properly as it types it, then the output will also show it correctly. So the final solution will produce consistent output with the input and does not mess (temporarily or otherwise) with the codepage the users set for themselves.

I hope that makes sense.

MarkCallow commented 10 months ago

Thanks for the education.

I apologize if my response sounded scoffing,

Not at all.

I did not want to remove that code though in case someone wants to build e.g. the library only with _UNICODE at some point (if that ever going to work).

Currently _UNICODE has no effect on library compilation. My current plan for supporting international file names via the *{From,To}NamedFile functions in the library is to make them support utf8, in the same way as you've just done for the tools, and document that people with _UNICODE applications must do their own file opens and then use the {From,To}{Stdio,}Stream functions. So _UNICODE will continue to have no effect on library compilation.

I can't see any reason to support building the application stack with _UNICODE defined. It is a lot of work to make _UNICODE compiles work without, after the work you've just done, any benefit as far as I can see.

Since neither the library nor the apps will need _UNICODE compilation, we should remove the code to avoid confusion.

Note that when compiling for the Windows Store, _UNICODE is automatically defined. ...

I'm not sure what part of the repo they are building for that purpose, so we can revisit at some point to also support a UTF-16 based path (i.e. _UNICODE). But enabling proper support for that would have a wide impact on the entire stack, probably lots of modifications for things that weren't written with _UNICODE in mind, so that's really off the table for now, and would require its own test set (besides changing the code with support for it).

They were building the legacy tools in a 4.2.x release. I would hope what you just implemented will work for a Windows Store app. I think only the library and the loadtest apps are relevant to the Windows Store. The latter are not currently included in release builds and are subject to the availability of GLon12 and VulkanOn12 for Store apps.

I see you removed the call to SetConsoleOutputCP(CP_UTF8) in the final PR. Why did you do that? Without it the filename displayed in error messages is gibberish, at least in the case I'm testing right now, which is not using {fmt}.

After some more elaborate testing, I noticed that changing the code page would leave you with a command prompt with that code page afterwards, and even if we'd reset it at exit you could have font/output changes temporarily that would be awkward.

I thought it might be something like that.

So I thought it's best to just leave the codepage as the user uses it. If the user uses a codepage that shows the unicode characters properly as it types it, then the output will also show it correctly. So the final solution will produce consistent output with the input and does not mess (temporarily or otherwise) with the codepage the users set for themselves.

There is something strange going on in the case of the legacy tool I am converting now. Without SetConsoleOutputCP the file name in error messages appears as gibberish. With it is is fine. The name is shown correctly when typing the input. With, e.g. ktx info, both input and output are fine. I'm using PowerShell and its associated terminal. The legacy app is using cerr to write the error message while the new one is using {fmt}. Is {fmt} converting messages to UTF-16? Perhaps there is some setting I can make on the cerr output stream to get it convert a UTF-8 string.

After exploring the settings of the PowerShell terminal app I could not find any way to set the code page myself. Maybe it has to be done through some PowerShell command or environment setting. Do you know? On the other hand, given the experience with ktx info it should not be necessary.

aqnuep commented 10 months ago

Windows shells are really weird with their codepage + font selection combo. I, for example couldn't get Japanese characters show up if I used the UTF-8 codepage with the default font, only if I explicitly changed the codepage to a Japanese one. FYI, codepage selection can be performed using the chcp command.

MarkCallow commented 10 months ago

Curiouser and curiouser. I manually set chcp 65001 where 65001 is the value of CP_UTF8. A message saying "Active code page is 65001" was printed. The output error message is still gibberish. I reenabled SetConsoleOutputCP and then the message is fine. After running the program in the console then running chcp, it reports the active code page is 65001. I have no idea why it only works with SetConsoleOutputCP.

KhronosGroup / KTX-Software

Add UTF-8 filename support on Windows #788