Parchive / par2cmdline

Official repo for par2cmdline and libpar2
http://parchive.sourceforge.net
GNU General Public License v2.0
726 stars 75 forks source link

performance with hundreds of small files #163

Closed sitaramc closed 3 years ago

sitaramc commented 3 years ago

Hi

I was trying to run par2 on one of my "borg" (a backup tool) repositories. The repo is 3646 MB. For whatever reason, it has 1762 files, of which 1744 are exactly 17 bytes each!

Par2's performance in the presence of so many small files is quite sub-optimal. Here're some number of time taken and space used (total size of all par2 files):

If I run par2 as-is on the repository (default redundancy 5%)

3851.39s u 3.65s s 702% 9:08.45 | 0 sw 0 maj 36063 min | 0 txt 0 dat 141 max
1491 MB (40% of original data, for 5% redundancy)

If I tar up the repo and run par2 on the single tar file:

516.79s u 1.81s s 601% 1:26.17 | 0 sw 0 maj 33887 min | 0 txt 0 dat 134 max
188 MB (just over 5%)

Similarly, if I run it only on files not equal to 17 bytes

500.60s u 1.73s s 679% 1:13.88 | 0 sw 1 maj 34525 min | 0 txt 0 dat 135 max
total size: 188 MB (just over 5%)

Are there any tips for dealing with this and making par2 have the performance characteristics of the second or third examples above, but more directly?

animetosho commented 3 years ago

My general first recommendation with performance concerns is to try a more performance-oriented PAR2 client like MultiPar or ParPar and see how well it works for you.

In your case, a problem with including lots of small files is that PAR2 requires files to be block aligned. This means, if your block size is set to 750KB, for example, each 17 byte file gets effectively expanded to 750KB and processed that way. In other words, your 1744*17 = 28.95KB of data is treated as if it were actually 1.25GB in size (assuming 750KB block size).
Typical recommendation would be to use a smaller block size, or merge all files into one, as you've done via TAR.

If neither of those work for you, still, give the alternative clients a try. They likely have the same issues here as par2cmdline (they're subject to the same PAR2 limitations after all), but with a faster baseline speed, they may fall into acceptable territory for you.

sitaramc commented 3 years ago

On Sun, Aug 29, 2021 at 02:46:58AM -0700, Anime Tosho wrote:

My general first recommendation with performance concerns is to try a more performance-oriented PAR2 client like MultiPar or ParPar and see how well it works for you.

In your case, a problem with including lots of small files is that PAR2 requires files to be block aligned. This means, if your block size is set to 750KB, for example, each 17 byte file gets effectively expanded to 750KB and processed that way. In other words, your 1744*17 = 28.95KB of data is treated as if it were actually 1.25GB in size (assuming 750KB block size).
Typical recommendation would be to use a smaller block size, or merge all files into one, as you've done via TAR.

If neither of those work for you, still, give the alternative clients a try. They likely have the same issues here as par2cmdline (they're subject to the same PAR2 limitations after all), but with a faster baseline speed, they may fall into acceptable territory for you.

I understand the logic/constraint better now; thanks!

For various reasons this needs to be installable on Linux without having to build, so both those alternatives are no-go for me.

I'll work something out with par2. For my immediate need this kind of tar-ing up works so I'll just script it properly. For other future needs I'll keep in mind that files with a wide distribution of sizes, especially with many at the lower end, may need to be handled specially.

Thanks again!

sitaram

mdnahas commented 3 years ago

FYI, in addition to the blocksize, there is a lot of per-file overhead in Par2. So, for every 17-byte file, there is at least 192 bytes of overhead. So, even if you set the blocksize to 20 bytes, you'll still see more than a 10 times expansion in storage.

Putting everything into a single TAR file will help with that too.

sitaramc commented 3 years ago

On Mon, Aug 30, 2021 at 06:23:38PM +0000, Michael Nahas wrote:

FYI, in addition to the blocksize, there is a lot of per-file overhead in Par2. So, for every 17-byte file, there is at least 192 bytes of overhead. So, even if you set the blocksize to 20 bytes, you'll still see more than a 10 times expansion in storage.

Putting everything into a single TAR file will help with that too.

thank you! It's a pretty easy workaround, so I'm fine with doing that long term.

mkruer commented 3 years ago

@mdnahas I see you are looking into improvements for PAR3, maybe as part of that spec there is a way to stream small files into larger chunks for better efficiently.