Parchive / par2cmdline

Official repo for par2cmdline and libpar2
http://parchive.sourceforge.net
GNU General Public License v2.0
719 stars 75 forks source link

Known Issues #133

Open mdnahas opened 5 years ago

mdnahas commented 5 years ago

When working on the code, I found a number of issues and odd behaviors. I tried to minimize the number of changes in my big push that added the libpar2.a library. Do we want to make any of these fixes?

  1. Wildcard matching does not work. Par2 does its own wildcard matching. I imagine that is so you can run it recursively with the "-R" option and find all your mp3 files. I found that patterns like "input1.txt" and "ipt.txt" did not match the file "input1.txt".

  2. Wildcards that match 0 files do not produce an error (or even warning!). If you use the pattern "*.mp3" on the commandline and no files match it, there is no error or even a warning to say that intended files were not found.

  3. Our estimates of disk usage are not good for small recovery sets If you ask for par2 files that fill 1024 bytes of disk space, the program uses 4144 bytes! For larger files, the disk usage may be an under-estimate.

  4. Options must appear before filenames It's convention that most options are before filenames, but par2 requires it. If you do "par2 create foo.par2 -v input1.txt", it will assume "-v" is a filename, not an option! We have a specific option called "--" that lets us add files that have a prefix of "-". We shouldn't disable options after the first filename.

  5. Many options aren't checked well When I do "-Bdirectory" or -btext", I don't get an error and the "directory" part is ignored. The correct format of the options has a space: "-B directory" or a number "-b200".

  6. We may mishandle blocksizes for small recovery sets In commandline.cpp, when (blockcount > totalsize) then blocksize=4. This should be when (blockcount*4 >= totalsize), right?

  7. Files of length 0 are ignored. They are fine by the spec. They can exist in either the recovery or non-recovery section. I don't know why they are skipped.

  8. You can use the -a option when doing "repair" or "verify" I'm not sure why you need this then.

  9. default basepath is strange If you don't specify a basepath, it's the path of the first file. Why? Shouldn't it be the path that is common to all input files?

  10. location of par2 output file Most programs that create a file put it in the current directory. However, if the first input file is in a different directory, the par2 file is written there. Why?

  11. reedsolomon.cpp code allows parpresent > datamissing This should probably be checked to make sure that parpresent=datamissing. That is, that the number of parity blocks used exactly matches the number of missing input blocks. I don't think the code correctly handles the care when parpresent is strictly greater than datamissing.

  12. reedsolomon.cpp / galois.cpp doesn't handle negatives This is a minor math detail, but reedsolomon.cpp should work with any type that obeys the property of a Field. Ours does not. It assumes we're using a Galois power-2 Field, which has the property that -a = a. So, if we change away from that for any reason, it won't work.

Suggestions:

  1. UpdateVerificationResults is called more times than necessary in par2repairer.cpp

  2. Galois.h's function Alog should take an int and return a Galois.

  3. reedsolomon.h has a variable outputcount, which probably can be replaced with outputrows.size()

  4. parfilename is a /complete/path/filename.par2 when we do "create", but "filename.par2" when running repair. Why the difference?

  5. To create a parfile of 10MB, I use the option "-rm10". Wouldn't "-r10m" make more sense?

animetosho commented 5 years ago

I noticed that par2cmdline doesn't allow more than 32768 recovery blocks. Should it?

Also, something that's kinda bothered me: the PAR2 spec says that files containing recovery slices should end with ".volXX-YY.par2" where XX-YY specify the range of recovery slices, however no PAR2 client does this (so much so that my application following the spec probably broke something else). It seems like some ".volXX+ZZ.par2" format, where ZZ refers to the number of slices in the file, has effectively become the standard.
At this point, it's probably better to just update the standard to prefer this naming style over the one already in the spec (ref #73).

mdnahas commented 5 years ago

The spec says that's the limit. Each recovery block needs an exponent. " A power of two has order 65535 if the exponent is not equal to 0 modulus 3, 5, 17, or 257. In C code, that would be (n%3 != 0 && n%5 != 0 && n%17 != 0 && n%257 != 0). Note - this is the exponent being tested, and not the constant itself. There are 32768 valid constants."

I wrote that and it is wrong, but that's the current spec.

Thanks for the convention. I thought that was an option in the spec, but I guess not. I'll add it to version 2.0.2 when I write the amendment.

animetosho commented 5 years ago

That bit of text is referring to the coefficients for input blocks. The exponent used for recovery blocks are just 0,1,2,3...65535. This basically means that the PAR2 spec allows up to 32768 input blocks and 65536 recovery blocks in an archive. It's rare to need more than 32768 recovery blocks (particularly since it exceeds input), but it's allowed according to spec, so I thought it'd make sense in par2cmdline.

mdnahas commented 5 years ago

Right, sorry for screwing that up. I'll add it to the list of things to fix, if I have time after this library_dev branch is merged.

mdnahas commented 5 years ago

This is not a bug, but this old code generates a warning on gcc 8.3.
(Found by @BlackIkeEagle -- I'm just including it here so it doesn't get lost.)

src/reedsolomon.h: In instantiation of ‘bool ReedSolomon<g>::Compute(NoiseLevel, std::ostream&, std::ostream&) [with g = Galois<16, 69643, short unsigned int>; std::ostream = std::basic_ostream<char>]’:
src/par2creator.cpp:734:41:   required from here
src/reedsolomon.h:254:9: warning: ‘void* memset(void*, int, size_t)’ clearing an object of type ‘ReedSolomon<Galois<16, 69643, short unsigned int> >::G’ {aka ‘class Galois<16, 69643, short unsigned int>’} with no trivial copy-assignment; use assignment or value-initialization instead [-Wclass-memaccess]
   memset(leftmatrix, 0, outcount * incount * sizeof(G));
   ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from src/par2cmdline.h:218,
                 from src/par2creator.cpp:21:
src/galois.h:57:7: note: ‘ReedSolomon<Galois<16, 69643, short unsigned int> >::G’ {aka ‘class Galois<16, 69643, short unsigned int>’} declared here
 class Galois
       ^~~~~~
In file included from src/par2cmdline.h:222,
                 from src/par2creator.cpp:21:
src/reedsolomon.h:262:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of type ‘ReedSolomon<Galois<16, 69643, short unsigned int> >::G’ {aka ‘class Galois<16, 69643, short unsigned int>’} with no trivial copy-assignment; use assignment or value-initialization instead [-Wclass-memaccess]
     memset(rightmatrix, 0, outcount *outcount * sizeof(G));
     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from src/par2cmdline.h:218,
                 from src/par2creator.cpp:21:
src/galois.h:57:7: note: ‘ReedSolomon<Galois<16, 69643, short unsigned int> >::G’ {aka ‘class Galois<16, 69643, short unsigned int>’} declared here
 class Galois
       ^~~~~~
mdnahas commented 5 years ago

We may mishandle blocksizes for small recovery sets In commandline.cpp, when (blockcount > totalsize) then blocksize=4. This should be when (blockcount*4 >= totalsize), right?

I looked into this and totalsize is actually divided by 4 already. Not sure why. Its confusing. So, the code for this works fine as it is; the code is just unclear.

mdnahas commented 5 years ago

FYI, I found out that Valgrind has a profiler: valgrind --tool=callgrind --simulate-cache=yes ./par2 create foo.par input*.txt)

LONGMULTIPLY is set by default in par2cmdline.h. LONGMULTIPLY creates a table in galois.h and enables an alternate version of ReedSolomon::InternalProcess.

On small inputs, LONGMULTIPLY slows things down. But it's certainly faster by the time you get to 100MB (a CD's worth of MP3 files). LONGMULTIPLY: 1.9 seconds (clock time) 14.3 seconds (CPU time) (on an 8 core processor) no LONGMULTIPLY: 6.5 seconds (clock time) 47.6 seconds (CPU time)

I ran Valgrind's profiles when encoding a single MP3 file. LONGMULTIPLY used 39% fewer instructions and had 90% fewer L1 cache misses. Bravo!

I'm not sure LONGMULTIPLY should be a #define option. I might move it into its own function, so that I can run unit tests to compare it to the "short" code. We might even want to switch between the two, if we know the input is small and the "short" code is going to be faster.

boggeroff commented 3 years ago

I'd add incomplete documentation.

What is the default block size? What are the limits? Max file size? Max file count? Do par2 files find their data files by absolute or relative paths? Plus documenting known issues and limitations.

These infos are essential for usage, aren't they?

animetosho commented 3 years ago

Are you talking about PAR2 or par2cmdline? This topic is about par2cmdline limitations, but you seem to be asking about documentation on the PAR2 format (unless you're suggesting par2cmdline should also include documentation on PAR2).

To answer your questions though:

What is the default block size?

There is none.
par2cmdline does actually select a default of targeting 2000 input blocks (same as -b2000) if you don't give any indication of a block size. Personally, I think it's somewhat of a weird decision and think that it shouldn't be relied upon (and probably makes sense to remove this default entirely).

Max file size?

2^64 - 1 bytes
(this is fairly standard across many formats/utilities which use 64-bit length specifiers)

Max file count?

32768 input files per recovery set.

Do par2 files find their data files by absolute or relative paths?

In terms of the filenames specified in the PAR2, I don't think this is defined anywhere. Canonically, most clients prefer relative paths (as it helps with distribution), but absolute paths aren't ruled out, so either can be used.
par2cmdline has the rather awkward -B switch to control this.

These infos are essential for usage, aren't they?

This is subjective opinion, but personally I wouldn't consider it essential. May be handy to know, and the specifications list out most of the details if you want to know them.

boggeroff commented 3 years ago

Thanks for the prompt response, it is highly appreciated. I did skim over the max file count stat. I also found the max block count in the spec which is the same number. And had I attempted to actually use the program and not just read and read about it, I'd have realized the default 2000 source block count.

It seems to me still though that block counts should have a primary role in the interface (by saying that the default should be removed, did you also mean this by that?), as when processing small files concatenated (not sure if that's what's actually happening; but I'm glad there is something of this sort in the background), large differences in recovery data size can occur. The arch site does not mention the block size factor when discussing this feature. And unless the user does tests, he might think that when suppying multiple files, all files will, sort of, be independently processed and will suffer the "overhead of small files".

According to my tests one has to really keep in mind block counts when processing files <10MiB, especially with larger than default redundancy levels. One can't either just default a smaller block count, because larger files >1GiB should be dealt with large block counts, so I presume. Thinking of minimizing damage to motion picture and such. 100x10KiB files with defaults (except -n1) results in 62 580 par2 and 495 436 bytes long vol*par2 files. With -b200: 26 580 and 157 888. As for a single file with the same overall size: 40 124, 339 248; 4 404, 70 424. It's interesting the -b200 recovery data size is 32% of the -b2000 size with 100x10KiB but with 1x1MiB this stat is 21%. The same tests run but with 1000 times larger files we get roughly 100% for these stats. With only 10 times larger it is around 67% for the singular file case. I hope the fact that I fallocated these files do not hinder the results.

animetosho commented 3 years ago

I might suggest creating a new topic if you want to talk about aspects of PAR2 itself as it's somewhat unrelated to the topic here.

I did skim over the max file count stat. I also found the max block count in the spec which is the same number.

Non-empty files must consume at least one block, so yes, the figures are related that way.

by saying that the default should be removed, did you also mean this by that?

I think 2000 is a very arbitrary figure which doesn't have a lot of meaning or sensibility (and considering the rest of your post, likely a key factor behind the confusion you're facing).
It may be possible to pick a reasonable default, but I don't think it can be just an arbitrary number.

when processing small files concatenated (not sure if that's what's actually happening; but I'm glad there is something of this sort in the background)

It's similar to how files are laid out on a file system - sizes are effectively rounded up to the chosen block size.

According to my tests one has to really keep in mind block counts when processing files <10MiB, especially with larger than default redundancy levels. One can't either just default a smaller block count, because larger files >1GiB should be dealt with large block counts, so I presume

Perhaps it'd be easier for you to think in terms of block size instead of count. If you use a constant size, the count will automatically scale based on the input.

It's interesting the -b200 recovery data size is 32% of the -b2000 size with 100x10KiB but with 1x1MiB this stat is 21%.

The problem with your example is that the total amount of data is only 1MB, which means that at 2000 blocks, each block is only around 500 bytes. Basically, most of what you're measuring is metadata overhead.

Try putting 2000x 500 byte files on any file system - it'll consume a lot more than 1MB of disk space.

boggeroff commented 3 years ago

What's done is done, I'll try not to raise new items. Although elaboring on what's been said: Earlier you brought up whether or not par2cmdline should include PAR2 spec's documentation, I'm trying to point to the idea, that yes, it should include more of it, because I think it is beneficial to the project to lower the bar of entry by not delegating more important aspects of usage and the format to other sites.

I think 2000 is a very arbitrary figure which doesn't have a lot of meaning or sensibility (and considering the rest of your post, likely a key factor behind the confusion you're facing). It may be possible to pick a reasonable default, but I don't think it can be just an arbitrary number.

I am confused, but less and less so, and not so much in this regard. I was putting forward a use case where unintuitive behavior can be observed. According to arch wiki, the user should be afraid of using par2 for small files because of this and that. par2cmdline doesn't state how multiple inputs remedy this situation, which is a crucial usage aspect. Deciding if files should be zipped/tarred/iso'd etc beforehand is a question most users develop I think.

Perhaps it'd be easier for you to think in terms of block size instead of count. If you use a constant size, the count will automatically scale based on the input.

I didn't wanna mix and match the two terms, I do see they are just inversly proportional and aid the same cause. Hard to choose a default either way.

I am considering contributing to the project (under a different name maybe). I suspect the -R option pars files one by one and not whole levels of directories. Built-in option to process smaller files per directory at once could be really useful. But I am new and I'm not completely aware of where this project wants to get to (I'm aware of par2deep too but it doesn't support multithreading, a fact which is repelling), so it might be that I write scripts on top of it only. But why should supplementary scripting by users be the norm when the project itself could be extended that benefits all.

Unable to set output file path when supplying only one input file seems like an arbitrary limitation as well.

animetosho commented 3 years ago

I suspect the -R option pars files one by one and not whole levels of directories. Built-in option to process smaller files per directory at once could be really useful.

Directories are just a collection of files (and directories), so "files one by one" and "whole directories" sound synonymous to me.
I have no clue what you mean by "at once" there. If you mean that it should automatically ZIP/TAR small files or similar, I feel it's a little out-of-scope, but it sounds like it could be worthy of consideration.

I'm not completely aware of where this project wants to get to

I think it's mostly just being kept alive at this stage.

I'm aware of par2deep too but it doesn't support multithreading, a fact which is repelling

My understanding is that it's just a wrapper around par2cmdline, which means that par2cmdline's multi-threading shouldn't be impacted?

But why should supplementary scripting by users be the norm when the project itself could be extended that benefits all.

Supplementary scripting can benefit all as well. Alternatively, nothing stops you from forking this project, if you're concerned with clashing against its goals (of which I don't think there are much), or consulting with / forking a different PAR2 client, or even starting your own from scratch.

Unable to set output file path when supplying only one input file seems like an arbitrary limitation as well.

Not exactly sure what you mean there, but it doesn't sound right.

boggeroff commented 3 years ago

Directories are just a collection of files (and directories), so "files one by one" and "whole directories" sound synonymous to me. I have no clue what you mean by "at once" there. If you mean that it should automatically ZIP/TAR small files or similar, I feel it's a little out-of-scope, but it sounds like it could be worthy of consideration.

Considering the total size of recovery files, according to my tests, par2 c files.par2 file1 file2 and par2 c file1 par2 c file2 are not the same. By at once I meant the underlying mechanics that's present with a par2 invocation with multiple files supplied as opposed to one with a singular file supplied.

My understanding is that it's just a wrapper around par2cmdline, which means that par2cmdline's multi-threading shouldn't be impacted?

I seem to have misinterpreted this issue. Indeed the underlying par2cmdline threading is in effect. The particular user might have experienced the bottleneck which he contributed to suboptimal threading because of the large block count / small files issue. And the non-threading CLI is marginally relevant. There are other limitations though, but it's not always clear to me at this point if its an upstream one or it's with par2deep (like with this).

Not exactly sure what you mean there, but it doesn't sound right.

I was getting this message: You must specify a list of files when creating. But can't reproduce now.

animetosho commented 3 years ago

Considering the total size of recovery files, according to my tests, par2 c files.par2 file1 file2 and par2 c file1 par2 c file2 are not the same

Well yeah - as mentioned above, the defaults allocate blocks differently in such cases.
I suspect you're still being confused over what the defaults exactly do, and it's probably best not to do comparisons without specifying an exact block size (i.e. always specify -s flag when creating PAR2s).

There are other limitations though, but it's not always clear to me at this point if its an upstream one or it's with par2deep (like with this).

They sound like they could be issues worth including in this topic.

boggeroff commented 3 years ago

Well yeah - as mentioned above, the defaults allocate blocks differently in such cases.

I picked my block counts in each case