Parchive / par2cmdline

Official repo for par2cmdline and libpar2
http://parchive.sourceforge.net
GNU General Public License v2.0
720 stars 75 forks source link

Properly handle UTF8 and long-paths on Windows #189

Open Safihre opened 1 year ago

Safihre commented 1 year ago

As requested by @animetosho, reporting the original bug here 💯

On Windows, par2cmdline can basically only handle ASCII filenames. This was fine when par2cmdline was created long ago, but the world is UTF8 now. Simply try to open a file named 你好世界.par2, will fail.

par2 file must not have a wildcard in it.
failed to set the main par file

This was fixed long ago in par2-tbb, but that fork has been unmaintained for a long long time. It also displays any UTF8-charahters as question marks in the console output.

Additionally, par2cmdline does not seem to really support long-path notation on Windows. Which is needed to open files where the path exceeds 255 characters. For example \\\\?\\C:\\test_win_unicode\\test_win_unicode.par2 results also in:

par2 file must not have a wildcard in it.
failed to set the main par file

We use that by default so we can avoid any problems on longer paths and having to check path-lengths every time.

animetosho/par2cmdline-turbo/issues/13 animetosho/par2cmdline-turbo/issues/17 sabnzbd/sabnzbd/pull/2674

Safihre commented 11 months ago

@animetosho any chance we can get you interested in fixing this one? Then we could use par2cmdline-turbo also on Windows and don't need Multipar anymore.

animetosho commented 11 months ago

If someone submits a pull request to par2cmdline, I can consider merging it into par2cmdline-turbo.
Otherwise, I have enough on my plate for now.

mnightingale commented 4 months ago

I intend to submit a PR for this; I've got UTF8 working including in the console it just needs tidying up.

I haven't really looked at the long paths issue yet, but from a quick glance I think there are numerous issues;

Safihre commented 2 months ago

@mnightingale Any updates on this work?

mnightingale commented 2 months ago

Sorry I have been busy with other things, the Unicode problem is much involved than expected, needs a lot of MultiByteToWideChar and convert the other way for outputting to the code page in use the console.

I have some changes that work, but it is very messy and would break non-Windows builds.

The long path problem appears a simpler fix at least for verify, create needs more work on how it splits a path into components and the MAX_PATH it used for buffers.

Safihre commented 2 months ago

Is there anything we can learn from nzbget handling of this? https://github.com/nzbgetcom/nzbget/blob/develop/daemon/util/FileSystem.cpp

mnightingale commented 2 months ago

Looks like their might be interest from their to fix it (mentioned issue), they certainly have more experienced C++ programmers than me 😄

But yes nzbget handles some of what needs implementing here, see helpers UtfPathToWidePath and WidePathToUtfPath along with handling long paths as required it looks like there are several string helpers to make the manipulation easier.

How MultiPar handles it may also be useful to someone else looking https://github.com/Yutaka-Sawada/MultiPar/blob/d733ada21ae81405de468dd2cd458bcbf09ab9ea/source/par2j/common2.c#L66-L204 specifically the parts for converting to the code page used by the console and handling the possible errors (try again with dwFlags = 0).

edit: just to mention the other part which nzbget doesn't have to worry about is every filename including from par packets, needs preparing for the console when printing log/debug lines, so you don't get "?" or broken characters.

mnightingale commented 2 months ago

@Safihre I've cleaned up what I'd started working on https://github.com/mnightingale/par2cmdline/tree/unicode_filenames it won't compile outside of Windows currently and it only attempts to addresses the problems with UTF8 filenames.

I haven't made a draft PR for now as I don't want to put someone else having a go.

edit: main thing is figuring out how to remove TCHAR, at the time just wanted to get it to compile.

mnightingale commented 2 months ago

I think it's about there now, I've added support for \\?\ paths, an improvement could be made to always use such paths even if the path passed in isn't but I wasn't going to bother with that.

utf8::compatible tries to deal with normalisation of filenames in packets on Windows. I'm not sure how to do similar for other platforms, may need a separate library or just not bother so both are broken.

Changes can be applied applied with minimal changes to par2cmdline-turbo

Safihre commented 2 months ago

@mnightingale We only have to care for Windows, on the other platforms there are no problems. The SABnzbd unit-tests already show that :) So all things can be wrapped in macros to only build on Windows platforms!

mnightingale commented 2 months ago

utf8::compatible was specifically to deal with test_win_unicode https://github.com/sabnzbd/sabnzbd/issues/1633 which as far as I can tell fails on all platforms.

Test file has:

00000070  84 00 00 00 00 00 00 00  66 72 e8 6e 63 68 5f 67  |........fr.nch_g|
00000080  65 72 6d 61 6e 5f 64 65  6d f6 2e 72 61 72 00 00  |erman_dem..rar..|

But generating new recovery volumns with par2cmdline uses unicode, against the spec but I don't think it's the only implementation to do as such.

000000d0  00 00 00 00 66 72 c3 a8  6e 63 68 5f 67 65 72 6d  |....fr..nch_germ|
000000e0  61 6e 5f 64 65 6d c3 b6  2e 72 61 72 50 41 52 32  |an_dem...rarPAR2|

I think I'll just remove the part trying to handle this, then it will behave the same broken way on all platforms.