Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
8.97k stars 769 forks source link

Does not accept unicode file path #293

Closed KonstantinMogilev closed 4 years ago

KonstantinMogilev commented 4 years ago

Does not accept unicode file path, for example:

xxhsum.exe "еÐÌ.pdf"
Error: Could not open '?╡?I.pdf': Invalid argument.
Cyan4973 commented 4 years ago

Unfortunately, the code handling file names does nothing special, and relies entirely on standard C fopen().

I have no idea how fopen() behaves with unicode characters. This may be system dependent.

I presume the issue happens on Windows ?

KonstantinMogilev commented 4 years ago

Yes, the problem is in Windows, checked on 2 systems: Windows Server 2008 R2 Windows 7 Maybe fopen () has options for working correctly with unicode, or use another function that can correctly handle such names?

Cyan4973 commented 4 years ago

How was the binary produced ? Did you compiled it using Visual Studio ?

KonstantinMogilev commented 4 years ago

I used your latest compiled version from here: https://github.com/Cyan4973/xxHash/releases/download/v0.7.2/xxhsum_win64_0_7_2_avx2.zip

easyaspi314 commented 4 years ago

Oof, this will be fun.

fopen on Windows uses ANSI encoding, while files are stored in UTF-16.

I might take a look into this when I get a chance. It might be as simple as _wfopen with a mbtowc, but maybe not.

t-mat commented 4 years ago

Hi, I've made small code which introduces UTF-8 compatibility for Windows+MSVC build. @easyaspi314 since I don't test this code much, I don't send PR so far. But hope it helps.

t-mat commented 4 years ago

Here's some notes.

(1) We can safely use UTF-8 to represent Unicode filename for modern systems (except Windows).

I have no idea how fopen() behaves with unicode characters. This may be system dependent.

Nowadays as for modern Linux, *NIX and OS X systems, basically we can say all possible filenames are sequence of chars. So in these systems, fopen() is fine for Unicode filenames. Some OS X file system/API has Unicode normalization issues and it may cause issue of interoperability of xxhsum -c. But I think we can address this problem later.

(2) I believe "UTF-8 Everywhere" is good approach. For xxhsum, I mean that almost all part of program should think about and be written for UTF-8. And only input/output functions are taking care about other encodings.

(3) I'd like to avoid to use low level WIN32 API such as GetCommandLineW() or CommandLineToArgvW(). For example, Microsoft's ucrt has wildcard expansion in its startup routine (see argv_parsing.cpp in Windows Kits). To support these functions, it'd be nice to avoid to use low level APIs.

(4) New entry point should be wmain(). Since mapping between Unicode and MBCS isn't perfect, MBCS may introduce other problem. So, I strongly recommend to use Unicode directly. One possible idea is that redefine main as main_utf8 by macro. wmain() receives UTF-16 argv[], convert them to UTF-8 and passes UTF-8 argv[] to main_utf8().

(5) For stdout and stderr, _setmode(x, _O_U8TEXT) looks promising. But according to the document, there're many restrictions. One of the biggest restriction is:

Caution Unicode mode is for wide print functions (for example, wprintf) and is not supported for narrow print functions. Use of a narrow print function on a Unicode mode stream triggers an assert.

Here, "Unicode mode" includes UTF-8 (_O_U8TEXT). Therefore, we must use wprintf() family which only accepts UTF-16 string. So, we must convert UTF-8 to UTF-16 to output UTF-8!

Other restriction is it doesn't allow to use narrow print functions such as printf(). If we use narrow print function, program crashes immediately. For xxhsum it means we should keep using DISPLAY() macro family.

(6) Windows Powershell has problematic default encoding method for redirection. You can't redirect UTF-8 encoding stdout/stderr with default settings of the Windows PowerShell. Please read this Q&A at StackOverflow: Displaying Unicode in Powershell. As this answer shows, you can use UTF-8 encoding in redirection by the following command:

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
sergeevabc commented 4 years ago

Windows 7, CHCP set to 65001. 0.7.0 outputs привет.txt, whereas 0.7.1+ outputs ������.txt

easyaspi314 commented 4 years ago

0.7.0 doesn't seem to work, at least not in MinGW64:

$ ./xxhsum "ファイル.txt"
Error: Could not open '????.txt': Invalid argument.
easyaspi314 commented 4 years ago

wmain causes a linker error on MinGW, but I think I found a solution.

There has to be a better way to do this, though.

static int XXH_main(int argc, char **argv)
{
   ...
}

int main(int argc, char **argv)
{
    int result;
    /* We need to get the real argv on Windows for Unicode. */
#ifdef _WIN32
    const int oldStdoutMode = _setmode(_fileno(stdout), _O_U8TEXT);
    const int oldStderrMode = _setmode(_fileno(stderr), _O_U8TEXT);
    wchar_t **utf16_argv = CommandLineToArgvW(GetCommandLineW(), &argc);
    argv = createUtf8Argv(argc, utf16_argv);
#endif /* _WIN32 */

    result = XXH_main(argc, argv);

    /* Cleanup */
#ifdef _WIN32
    freeUtf8Argv(argc, argv);
    fflush(stdout); _setmode(_fileno(stdout), oldStdoutMode);
    fflush(stderr); _setmode(_fileno(stderr), oldStderrMode);
#endif /* _WIN32 */
    return result;
}
easyaspi314 commented 4 years ago

This should be fixed by #304.

sergeevabc commented 4 years ago

@easyaspi314, err… where is a Windows binary to verify the fix?

easyaspi314 commented 4 years ago

Here are a few binaries for the people who are too lazy to compile their own code (jk):

xxhsum-windows-binaries-eba72be9.zip

easyaspi314 commented 4 years ago

At least on my PC, Windows 10, they all pass the Unicode test I made, even if the terminal/font cannot display it:

Testing filename provided on command line...
xxhsum.exe "ユニコード.unicode"
2d7f1808da1fa63c  ユニコード.unicode
Testing a checksum file...
xxhsum.exe -c unicode_test.xxh64
ユニコード.unicode: OK
easyaspi314 commented 4 years ago

@KonstantinMogilev @t-mat @sergeevabc When you get a chance, can you check to make sure this works on your machines?

sergeevabc commented 4 years ago

@easyaspi314, that version is still better as its output is consistent no matter what code page is, i.e.

chcp 65001 
xxhsum070sse2.exe привет.txt
a2a5260369b4da80  привет.txt
chcp 1251
xxhsum070sse2.exe привет.txt
a2a5260369b4da80  привет.txt

whereas your approach outputs

chcp 65001 
xxhsum-msvc-x64.exe привет.txt
a2a5260369b4da80    привет.txt
chcp 1251
xxhsum-msvc-x64.exe привет.txt
a2a5260369b4da80    привет.txt
easyaspi314 commented 4 years ago

I honestly don't understand why 0.7.0 works — AFAICT, nothing changed that would cause it to work, and also, I am pretty sure I tried compiling 0.7.0 and it did not work.

easyaspi314 commented 4 years ago

OK, I think I see the difference.

I analyzed the binary, and it looks like the 0.7.0 binary was compiled by MSYS2 instead of MinGW or MSVC.

MSYS2 and Cygwin both use a wrapper which handles everything UTF-8 automagically.

Cyan4973 commented 4 years ago

Well, strangely, I thought that all Windows binaries provided in the release section were compiled using mingw64 from msys2 environment. Maybe I missed or confused something...

easyaspi314 commented 4 years ago

The difference between MinGW and MSYS2/Cygwin is that MinGW basically compiles entirely against the MSVC CRT and acts like Windows, but MSYS2/Cygwin try to emulate a POSIX system, which requires many wrappers. It basically hijacks MSVC CRT and replaces it with its own Newlib libc.

Command line conversion is done here: https://github.com/msys2/Cygwin/blob/bede85ba6d90b9383f3c83a6e99152284ca90f6a/winsup/cygwin/dcrt0.cc#L947

Cyan4973 commented 4 years ago

Btw, I don't see a v0.7.0 Windows binary in the release section ...

easyaspi314 commented 4 years ago

Weird. Maybe it was a test binary.

Cyan4973 commented 4 years ago

Anyway, is there a preferred environment for the production of next version of Windows binaries ?

easyaspi314 commented 4 years ago

xxhsum-i686-mingw32-test.zip

New test based more off of @t-mat's code

Compiled on classic MinGW-w32, which links directly to msvcrt.dll which is system specific.

@sergeevabc can you please test this on your Windows 7 machine?

Also, if anyone has XP installed for some reason, I'd like to know if it works too, out of curiosity.

easyaspi314 commented 4 years ago

@Cyan4973 @KonstantinMogilev This should be completely solved by #323.

easyaspi314 commented 4 years ago

Anyway, is there a preferred environment for the production of next version of Windows binaries ?

@Cyan4973 Uh, well MSVC sucks at XXH3, so I'm guessing either clang-cl or mingw-w64.

sergeevabc commented 4 years ago

@easyaspi314, about XP… it works here, but in a peculiar way.

easyaspi314 commented 4 years ago

Thanks!

That is actually a crash. On most Windows terminals, when a console program crashes, the last line is blanked out and the program exits silently. If you watch in slow motion, the first bit of the hash is displayed: Screenshot_20200305-005043

I'd kill for the helpful Segmentation fault or Aborted messages on Linux, or at least an error message box.

This blankout doesn't happen when redirecting the output to a file, which is why you see that.

We don't need to fully support XP. We didn't need to since 2014.

But the important thing is that ASCII works.

I would be more concerned if the program crashed immediately.

Plus, it might actually work if we link against something Microsoft wanted us to link against.

easyaspi314 commented 4 years ago

Untested possible solution:

int oldstdoutmode, oldstderrmode;
/* Try to switch to UTF-8 */
oldstdoutmode = _setmode(_fileno(stdout), _O_U8TEXT);
/* On error (likely with XP), switch back to normal text */
if (oldstdoutmode == -1)
    oldstdoutmode = _setmode(_fileno(stdout), _O_TEXT);
/* Repeat for stderr */
oldstderrmode = _setmode(_fileno(stderr), _O_U8TEXT);
if (oldstderrmode == -1)
    oldstderrmode = _setmode(_fileno(stderr), _O_TEXT);

In this case, what would probably happen is UTF-8 appears as mojibake.

easyaspi314 commented 4 years ago

Nevermind, I'm a massive idiot…if _setmode errors out, it wouldn't change from _O_TEXT...

I don't immediately see the solution, but I don't think this should delay v0.7.3.

XP support is good enough for now, and there are higher priorities.

easyaspi314 commented 4 years ago

I think we can safely close this.

While this patch may crash XP on Unicode text with the default msvcrt.dll, that is not much worse than it was before the patch where Unicode also didn't work. And XP is 19 years old and EOL for 6 years so yeah.

sergeevabc commented 4 years ago

@easyaspi314, I wholeheartedly disagree with that EOL reasoning. It means something for relatively wealthy people in industrially developed areas, but what about the rest of mankind who are being happy on little and have limited means to upgrade at once by an order of Silicon Valley hegemon? Otherwise, let’s appeal to extremes and abandon x86 in favour of x64, because, you know, “the progress calls”.

Firstly, XP is still patched by enthusiasts like Simplix. Secondly, there are projects such as DOSBox and ScummVM (accepted for Google Summer of Code, for that matter), which give a second life to pre-Win apps with an amazing benefit-to-every-byte ratio. Thirdly, the stats about XP machines exist, at least Internet-connected ones (not always the case), which show they are in no hurry to leave top 10 OS list.

That’s why I urge you to see this issue as an intellectual challenge like demoscene craftsmen have with 64K intros, like visual artists have with pixel paintings, like algorithm researchers have with code golfing, like Tom Cruise has with “Mission impossible”. Pull it together, you can do it.

easyaspi314 commented 4 years ago

memed

Well the only thing I have left to try is mucking with Win32 API. :nauseated_face: Apparently, Git does this.

I might try to set up an XP VM.

easyaspi314 commented 4 years ago

Holy shit, WriteConsoleW works. unicode progress

Mostly.

easyaspi314 commented 4 years ago

unicode fixed

Yatta! @sergeevabc The new fix is up in #335

easyaspi314 commented 4 years ago

I don't want to see a wchar_t ever again.