AgentD / squashfs-tools-ng

A new set of tools and libraries for working with SquashFS images
Other
194 stars 30 forks source link

support unicode characters in command line arguments for windows #96

Closed forworldm closed 2 years ago

forworldm commented 2 years ago

on Windows the argv is encoded in the ANSI codepage. your code seems to assume it is UTF-8 and convert it to wide characters when call system functions.

forworldm commented 2 years ago

https://github.com/AgentD/squashfs-tools-ng/blob/ab98505847c50b05decc0b2db6dd9396794f1722/lib/compat/chdir.c#L19 https://github.com/AgentD/squashfs-tools-ng/blob/ab98505847c50b05decc0b2db6dd9396794f1722/lib/sqfs/win32/io_file.c#L184

your code seems to be mixing these two encodings?

AgentD commented 2 years ago

It took a little bit longer than expected, but I finally got around to looking into this (and also got stuck with another bug along the way during testing), but I hope that this should be fixed soon-ish for a new release with primarily Windows fixes. There are now commits on master and fixes-1.1.0 that try to address this issue, but I'm afraid that it will require a little more research, review and testing.

A wrapper for the main() function was added that obtains the actual UTF-16 command line and converts it to UTF-8 before running the real main() function. The libsquashfs Windows port has been modified to automatically convert the filename argument from UTF-8 to UTF-16 internally, and use the wide-char API. A feature flag is used to retain the existing code-page-random behavior, if desired. The libfstream code (primarily used for processing tar files with transparent decompression) has also been fixed. The directory scanning code already uses the wide-char API.

This was sufficient that I could use the command line tools for accessing files/archives with German and Chinese names when running some quick tests.

Input files (i.e. the gensquashfs pack file) are interpreted as being UTF-8 encoded. This might be a problem, since plain text files on Windows could easily be code-page-random or UTF-16. Furthermore, the strings in an archive could in theory be anything, not necessarily UTF-8, which might also have to be addressed.

forworldm commented 2 years ago

thanks for you work. I can create archive file with non-ASCII directory name now. however the tool will print garbled text if file name contains non-ASCII characters. one possible solution is to call SetConsoleOutputCP(CP_UTF8) in the main function.

AgentD commented 2 years ago

Hi,

first of all, sorry for the long delay. While I was preoccupied with work/personal issues for much longer than I had initially hoped, I did occasionally find some time to look into this and test several approaches on a Windows 7 VM.

Sadly, the suggested drop-in solution doesn't seem to work. Using SetConsoleOutputCP still causes individual code units to be sent to the console. Apparently printf/fputs internally use the ANSI version of the underlying API and simply interpret the UTF-8 multi-byte sequences as Latin-1 (I guess?) and they end up themselves converted to UTF-8.

Trying to do _setmode(_fileno(stdout), _O_U8TEXT); causes printf and friends to trigger an assert. As the MSDN page says, they do not support output to a "Unicode stream".

I tried another approach to use pre-processor magic to redirect the stdio functions to Windows specific, custom implementations (generate a finished string for the printf ones) and then convert it to UTF-16 and use the wide-char versions. This strangely worked for German Umlaut characters, but Chinese text magically disappeared. Also, if it had worked, this would result in UTF-16 files when redirecting the output to a file or a pipe. Particularly rdsquashfs -d is supposed to generate output that gensquashfs can use as a manifest file.

I modified this approach and instead added a hacky check if the target stream is stdout or stderr, directly get the handles using GetStdHandle, check if it is a console using GetFileType, then convert to UTF-16 and use ConsoleWriteW. If it isn't a console, the original (presumed) UTF-8 is kept, so redirecting to a file or pipe causes the output to remain unmodified. This is what ultimately ended up in commit 6447b19.

I also alternatively tried to change the codepage to UTF-8, not convert the strings at all and use ConsoleWriteA instead. This worked for both German and Chinese text, but broke line wrapping behavior on the console for some reason.

The approach in 6447b19 worked to most reliably so far, but is still not perfect. In the Windows 7 VM, printing Chinese text causes a weird indentation to be added in front of every printed line (I guess this caused by the different font being switched to?). Also, when manually setting the codepage to UTF-8 (by running chcp 65001), I can see continuation characters again but mapped somewhere into HJK range. I guess in the end of the day, Windows is just not an ideal platform to write CLI programs for.