brenthuisman / par2deep

Produce, verify and repair par2 files recursively.
GNU Lesser General Public License v3.0
84 stars 8 forks source link

non-asciii characters in file names cause failure #18

Closed yringot closed 5 months ago

yringot commented 3 years ago

System: Win 10 Home 64bits, En-US locale Par2deep version: 1.9.4

When run against a folder of files, par2deep doesn't process files with characters such as ä and 我 and fails with the message createdfiles_err. This suggests to me that the windows binary of par2deep isn't Unicode aware (enough).

brenthuisman commented 3 years ago

Correct! By default par2deep uses a builtin libpar2 through a C binding, through which I pass C chars, which are the cause for this issue.

What you could try is define an external par2in the first screen, and see if that works. You could download (a version of) par2cmdline to test. I think it accepts unicode inputs. Let me know if that works!

yringot commented 3 years ago

Hm. using the par2.exe from par2cmdline sort of works, but there's still some weirdness. Here is the command line output:

PS C:\Users\mememe\Desktop\par_test\original> .\par2.exe c test.par2 .\März-2020.jpg

Block size: 512
Source file count: 1
Source block count: 1999
Recovery block count: 100
Recovery file count: 7

Opening: Mõrz-2020.jpg
Computing Reed Solomon matrix.
Constructing: done.
Wrote 51200 bytes to disk
Writing recovery packets
Writing verification packets
Done
PS C:\Users\mememe\Desktop\par_test\original>

The file name is still displayed wrongly: Mõrz-2020.jpg instead of März-2020.jpg. Using CJK yields questionmark-in-box characters this error: "Ignoring non-existent source file: C:\Users\mememe\Desktop\par_test\original\M?rz-2020_copy.jpg You must specify a list of files when creating." My Powershell codepage is 850 (Mulitligual Latin), same as in CMD.

I'm sorry but I don't have the time to pursue this further. But I look forward to you continuing to work on this because I like your approach of a living folder tree. in contrast to the whole directory approach of MultiPar. I have a lot of photos to sort through and do not look forward to having to regenerate all the par files every time I add or move something.

brenthuisman commented 3 years ago

Thanks for taking a look. par2 saw its last commits in 2004, and this pre-UTF8 behavior makes sense for that timeline. par2cmdline is a hairy ball, so I'm not touching it any further (and since upstreaming such updates seem impossible, the use would be very limited). So, assume this will never be fixed.

The good news is I'm looking at a port to a new Go library, which should not have these issues.

yringot commented 3 years ago

The good news is I'm looking at a port to a new Go library, which should not have these issues.

That would be great!

brenthuisman commented 3 years ago

Personally I get by by normalizing filenames. I started building archives in an era where UTF8 was just a distant dream ;) You're right though, in 2021 there is no reason for this anymore (other than legacy, which par2's interface certainly is!). Now I have an extra reason for the Go port.

yringot commented 3 years ago

95% of my files have machine generated file names (e.g. from the camera) but as my backup also contains text documents or pictures I got from other people, there is a high chance that I'll have some files with non-ASCII names. As I also learn Chinese, I have a bunch of files with CJK-characters and renaming them would be an absolute non-starter. This is certainly going to be true for many other people in the world. So I applaud and encourage this attempt to switch to a more modern library.

I had a look at the issues section of par2cmdline and it does seem that most development in the par ecosystem mostly stopped after 2010 and is just now resuming.

brenthuisman commented 3 years ago

Those are the Issues. Now look at (the absence) merged PRs ;) (OK, I count ~2PR/yr over the last decade, and speed does seem to be picking up). Now check which versions Debian/Ubuntu/Fedora/etc ship....

You should definitely make an Issue there too. If it's fixed there, there's a chance that we actually might see a UTF8 proof par2cmdline in our lifetime.

yringot commented 3 years ago

You should definitely make an Issue there too.

will do :)

yringot commented 2 years ago

Hi. I just saw that you added a Go version of Par. I tried it out but the issue of choking on non-ASCII chars remains:

Problem: C:\Users\user\Desktop\test>D:\progs\_backup\par2deep-master\par2deep\gopar_win.exe c test.par2 grünkohl.jpg [1/1] Loaded data file "C:\\Users\\user\\Desktop\\test\\grünkohl.jpg" (644627 bytes) Write parity error: invalid ASCII character

Success: C:\Users\user\Desktop\test>D:\progs\_backup\par2deep-master\par2deep\gopar_win.exe c test.par2 felixpennt2.JPG [1/1] Loaded data file "C:\\Users\\user\\Desktop\\test\\felixpennt2.JPG" (619579 bytes) Wrote index file "test.par2" (6580 bytes) [0+1/3] Wrote recovery file "test.vol00+01.par2" (2068 data bytes, 8648 bytes) [1+2/3] Wrote recovery file "test.vol01+02.par2" (4136 data bytes, 10716 bytes)

(just a status update and to indicate that I am still interested in a par2deep. happy to see that you're still working on it.)

brenthuisman commented 2 years ago

There are problems with the Go versions, are rather, there were. Since then, it was updated significantly but I have not had time to check it out and see if I can integrate it now. I hope it will fix this and other outstanding problems ;)

brenthuisman commented 2 years ago

Two things are messing UTF8 support up. gopar does not support it. par2cmline does, but not libpar2, since it exposes a C ABI with chars, which can't contain UTF8 characters. I've asked if gopar has any plans for unicode in the short term.

TODO

brenthuisman commented 2 years ago

@yringot I've just pushed a new version to PyPI. I tested this with a few UTF8 filenames and it works fine here. If you have time, could you update and tell me if your unicode issues still exist?

yringot commented 2 years ago

Hi. I'm not sure how to install it from Pypi. (I'm on Windows 10.)

Instead I installed v1.9.4.2 from the releases page (par2deep-1.9.4.2-amd64.msi) and tried it out on two folders. Unfortunately it still choked on CJK characters and umlauts in file and folder names:

image

brenthuisman commented 2 years ago

Could you post a file here containing (some of) those names?

And for the record, can you describe the behavior with direct use of par2cmdline?

yringot commented 2 years ago

Could you post a file here containing (some of) those names?

Here you go:

dönerladen.jpg
küche.jpg
friedhofsgärtnerei hans.jpg
wer wird millionär.jpg

and some files with CJK folder and file names:

20080428 北京\好老的街了.JPG
20080428 北京\故宫.JPG

And for the record, can you describe the behavior with direct use of par2cmdline?

Here's the output of par2cmdline (downloaded from here: https://github.com/Parchive/par2cmdline/releases/tag/v0.8.1) for a file with an umlaut:

C:\Users\me>"C:\Users\me\Downloads\par2cmdline-0.8.1-win-x64\par2.exe" create "C:\Users\me\Desktop\par2deep test\2003-05 misc\dönerladen.jpg"

Block size: 604
Source file count: 1
Source block count: 1997
Recovery block count: 100
Recovery file count: 7

Opening: d÷nerladen.jpg
Computing Reed Solomon matrix.
Constructing: done.
Wrote 60400 bytes to disk
Writing recovery packets
Writing verification packets
Done

The ÷ in place of the ö makes me think that there is some codepage thing going on in CMD. My Windows OS is in English but German language is also installed. But it fails with CJK characters:

C:\Users\me>"C:\Users\me\Downloads\par2cmdline-0.8.1-win-x64\par2.exe" create "C:\Users\me\Desktop\par2deep test\test.par2" "C:\Users\me\Desktop\par2deep test\20080428 北京\故宫.JPG"
You must specify a list of files when creating.
yringot commented 2 years ago

in any case I'd need to see if I can pass UTF8 strings through the C ABI (either tool). wchar_t might be the answer

* Actually, it already works, no changes needed! I tested this filename `Ballaké Sissoko - Tomora - 07 - Berekôlan.lossy.flac`, `März-2020.jpg`, `grünkohl.jpg`

Just to check, I just renamed a file to Berekôlan.lossy.flac like in the example above and par2deep failed on it:

image

Did I download the wrong thing? As I said I downloaded par2deep-1.9.4.2-amd64.msi from https://github.com/brenthuisman/par2deep/releases

brenthuisman commented 5 months ago

I've just created a new release, v1.10, which I believe addresses this issue (since it includes par2cmdline-turbo which really contains the fix). Let me know if it didn't.