Can source data be read just once?

jetelina commented 4 years ago

Hi, the file sizes are getting bigger and I often create PAR2 for 30 GB file.

There are two stages "Computing file hash" and "Creating recovery data", so I observed that source data are read twice.
For small files this was not important, but with huge files this becomes significant.

I have tried to read the PAR2 format specification, but wasn't clever from that.

The question: Is 2nd stage depending on result from 1st stage? If yes, can this be filled/calculated later?

I mean to read source data once and feed them simultaneously into 2 seperate functions.

One calculating the MD5 and second creating the recovery data.
"MD5 Hash of packet" is unknown until complete source data are read, so it can be replaced with placeholder like "ffffffffffffffffffffffffffffffff".
When all data are read and complete MD5 calculated the placeholder will be replaced with real MD5 checksum.

This would change speed from "read twice" to "read once and update PARchives on the end".

Thank you for this great tool

Yutaka-Sawada commented 4 years ago

Hello, jetelina.

Is 2nd stage depending on result from 1st stage?

No, it doesn't. 1st stage calculates hash of every source files, and write packet data on Recovery Files. 2nd stage calculates recovery data, and write it on Recovery Files. It's possible to change the task to be simultaneously, as you say.

Because I use small files mostly, I had no problem in current two-pass processing. When source files are enough small, first reading puts file data on disk cache. So, second reading has no access to file, but read data from cache on RAM. Then, I prefer simple implementation. (And, I was lazy to change, when there was no problem for me.)

This would change speed from "read twice" to "read once and update PARchives on the end".

If file access is a bottle-neck of whole task, it will be so. When many users handle so large file data often, it may be worth to try. (I cannot confirm that it will be faster.) The most problem is that, I'm lazy. Also, it will require such user's help to test speed on his PC. (process speed depends on usage and PC environment, such like file size or RAM size.)

There is a PAR tool, ParPar by animetosho, which seems to implement one-pass processing. Did you try it ? When you are satisfied with ParPar, I don't need to improve my PAR client.

If you (or other users who read this thread) want to help changing MultiPar still, post reply again please. I will try, though I cannot confirm to success.

savchenko commented 4 years ago

If you (or other users who read this thread) want to help changing MultiPar still, post reply again please.

Well, MultiPar provides convenient GUI for an end-user. Do you plan to develop it in future?

Slava46 commented 4 years ago

Hi, I can help you with tests as I did before (hope you're remember hehe) if it'll so improve speed as it could be ;). I do some big files like ~50-100-150 GB. Threadripper 1950X and 32 GB RAM in 4 channel mode (8 gb x4). SSD / HDD, of course on SSD much faster.

If you (or other users who read this thread) want to help changing MultiPar still, post reply again please. I will try, though I cannot confirm to success.

Yutaka-Sawada commented 4 years ago

Do you plan to develop it in future?

Yes. But, I cannot test so large file on my PC.

I can help you with tests as I did before

Thanks Slava46. Many users have helped me very much over an old MultiPar web-forum ago. I could not build current MultiPar without their help. Though I don't know well about GitHub, GitHub's "Issues" feature seems to provide the similar feature as web-forum. If someone has a request or bug report, he may post an issue. I will reply as possible as I can.

Now, I consider how to implement the one-pass processing. It seems to be possible, only when recovery data is smaller than RAM size.

For example, you have 30 GB of source files. When you create 1 GB recovery files, par2j.exe reads each source block one by one, and adds it on all calculate recovery blocks at once. Because source blocks are aligned on a file, it's possible to calculate Hash value simultaneously. The process flow is like below; [read source block 1] -> [read source block 2] -> [read source block 3]

On the other hand, if you create 10 GB recovery files (bigger than RAM size), it's too complex. par2j.exe splits each block virtually, and treats each group independently. In this case, it reads source blocks with skip, such like random access mode.

[read 1st source fragment 1] -> [read 1st source fragment 2] -> [read 1st source fragment 3] -> [read 2nd source fragment 1] -> [read 2nd source fragment 2] -> [read 2nd source fragment 3] -> [read 3rd source fragment 1] -> [read 3rd source fragment 2] -> [read 3rd source fragment 3]

By keeping all intermediate state, it's possible to calculate Hash value of each block. But, it's hard to calculate MD5 Hash value of each source file. MD5 calculation requires sequential file access mode. (Or I don't know a something good trick to calculate MD5 hash by random access.) So, one-pass processing is impossible for large redundancy.

This may be a fault of PAR2 design. It was not designed for one-pass processing. I see my code and try possible implementation for small redundancy case.

Yutaka-Sawada commented 4 years ago

I made a sample version of 1-pass processing. I write how it works.

[ Old flow ]

1) Read source files to calculate hash and checksum. 2) Write packets on recovery files. 3) Read source files to calculate recovery data. 4) Write recovery data on recovery files. (It reads source files 2 times and writes recovery files 2 times.)

[ New flow ]

1) Read source files to calculate hash, checksum, and recovery data. 2) Write packets and recovery data on recovery files. (It reads source files 1 time and writes recovery files 1 time.)

At this time, this sample isn't optimized and slow on my PC. Calculating hash and checksum after reading a file seems to be slow. Multi-threading may help speed. It may be hard to improve speed by their complex style. I'm not sure how is the speed at real usage.

I put the sample package (par2j_sample_2020-05-08.zip) in "MultiPar_sample" folder on my OneDrive space. Please test it with large source files. (No need to test sometimes, as it will take time.) If someone try to test, read "HowTo.txt" in the package at first. The sample of 1-pass mode is test version. Don't use it for normal usage with MultiPar.

jetelina commented 4 years ago

This is great. I will test it!

Yutaka-Sawada commented 4 years ago

I made a new sample version of 1-pass processing. (par2j_sample_2020-05-15.zip) It works with MultiPar GUI now.

When source data is 3.3 GB, 1-pass processing is slightly faster on my PC. Even when source data fits on disk cache, writing recovery files 1 time may help speed. But, it's hard to optimize. GPU may not adapt to 1-pass processing. While it's possible to add one-thread for hashing, it would require triple buffering and complex. At this time, I'm not sure the worth.

jetelina commented 4 years ago

Tested version par2j_sample_2020-05-15.zip (I have noticed that there are 2 new versions after test) Machine Core2Quad(4 cores), 4 GB RAM, GPU acceleration disabled testfile.bin 25,1 GB(windows explorer size), normal file with content, not-random or zero file

_debug 2 pass

disk read constant 95 MB/s computing hash, 110 MB/s creating recovery

elapsed time 9 min 21 sec (from GUI)

hash 310.296 sec
read   247.796 sec
write  0.203 sec
sub-thread[0] : total loop = 367439
1st encode 59.409 sec, 366419 loop, 1526 MB/s
2nd encode 0.268 sec, 1020 loop, 941 MB/s
sub-thread[1] : total loop = 360360
1st encode 58.521 sec, 360013 loop, 1522 MB/s
2nd encode 0.252 sec, 347 loop, 340 MB/s
sub-thread[2] : total loop = 201
2nd encode 0.221 sec, 201 loop, 225 MB/s
sub-thread[3] : total loop = 0
2nd encode 0.094 sec, 0 loop, 0 MB/s
total  249.328 sec

Possible bug in total time? (if you add hash + read you will get over 9 minutes)

_sample 1 pass (straight to creating recovery)

disk read around 90-98 MB/s creating recovery

elapsed time 4 min 55 sec (from GUI)

hash 0.000 sec
read   292.985 sec
write  0.125 sec
sub-thread[0] : total loop = 367507
1st encode 65.433 sec, 366596 loop, 1386 MB/s
2nd encode 0.201 sec, 911 loop, 1121 MB/s
sub-thread[1] : total loop = 360247
1st encode 64.745 sec, 359948 loop, 1375 MB/s
2nd encode 0.201 sec, 299 loop, 368 MB/s
sub-thread[2] : total loop = 244
2nd encode 0.201 sec, 244 loop, 300 MB/s
sub-thread[3] : total loop = 2
2nd encode 0.156 sec, 2 loop, 3 MB/s
total  294.219 sec

_debug and _sample PARs are identical. This great improvement.

Yutaka-Sawada commented 4 years ago

Thanks jetelina for test. From the log, I found that; file access speed must be the bottle-neck on your PC. 1st encode = encoding while file reading 2nd encode = encoding after file reading The part of "1st encode" is much longer in the both result. It means that CPU is waiting file access. In your case, 1-pass processing seems to be faster.

I have noticed that there are 2 new versions after test

There is no problem. I update a little to treat error. New version (par2j_sample_2020-05-17.zip) makes Index File. It doesn't delete tempoary files after failed creation. It cannot append recovery data to archive. At this time, sample version doen't have full feature.

Possible bug in total time?

It's a debug output of encoder function. Time of hash function is different. So, "elapsed time" on GUI is the actual total time.

Yutaka-Sawada commented 4 years ago

I made new sample versions of 1-pass processing. (par2j_sample_2020-05-23.zip) I tried multi-threading to read source files. While main-thread calculates hash & checksum of blocks, sub-thread reads source files (and calclates checksum or hash).

This is a problem of balancing tasks. I'm not sure which is fast, because it depends on PC. There is no noticeable difference on my PC with 3.3 GB data set. If someone has time to test them, please try.

Yutaka-Sawada commented 4 years ago

I made new sample versions of 1-pass processing. (par2j_sample_2020-05-31.zip) They switch processing mode (1-pass or 2-pass) automatically. When 1-pass processing is possible, it uses the mode. Because I don't know which is fast for when or what case, the sample always prefer 1-pass processing. When drive is very fast like SSD, RAM size is large, or CPU is very fast, 1-pass processing may become slower. I'm not sure the speed factor.

1-pass processing doesn't support GPU, single source block, append archive, or too big source files. In such cases, it switchs to (normal) 2-pass processing mode. If someone has time to test them, please test the speed. If sample version (1-pass processing mode) is noticeably fast, please post the case. (PC stat such like, RAM size, HDD, CPU) At this time, sample version seems to be fast on jetelina's PC. (It looks like a old PC with slow HDD, and CPU has 4-core)

jetelina commented 4 years ago

I have tested par2j_sample_2020-05-31.zip with the same testfile.bin 25,1 GB file as before.

Here are the results: par2j64_debug.exe elapsed time 00:09:04 (GUI)

hash 298.484 sec
read   242.390 sec
write  0.218 sec
sub-thread[0] : total loop = 368372
 1st encode 57.384 sec, 367352 loop, 1584 MB/s
 2nd encode 0.310 sec, 1020 loop, 814 MB/s
sub-thread[1] : total loop = 359401
 1st encode 57.027 sec, 359080 loop, 1558 MB/s
 2nd encode 0.278 sec, 321 loop, 285 MB/s
sub-thread[2] : total loop = 227
 2nd encode 0.202 sec, 227 loop, 278 MB/s
sub-thread[3] : total loop = 0
 2nd encode 0.139 sec, 0 loop, 0 MB/s
total  243.812 sec

par2j64_sample1.exe elapsed time 0:04:52 (GUI)

hash (not present)
read   290.047 sec
write  0.187 sec
sub-thread[0] : total loop = 365983
 1st encode 65.634 sec, 365080 loop, 1376 MB/s
 2nd encode 0.298 sec, 903 loop, 749 MB/s
sub-thread[1] : total loop = 361774
 1st encode 64.965 sec, 361464 loop, 1376 MB/s
 2nd encode 0.298 sec, 310 loop, 257 MB/s
sub-thread[2] : total loop = 240
 2nd encode 0.268 sec, 240 loop, 221 MB/s
sub-thread[3] : total loop = 3
 2nd encode 0.077 sec, 3 loop, 9 MB/s
total  291.344 sec

par2j64_sample2R.exe elapsed time 0:04:46 (GUI)

hash   207.467 sec
write  0.188 sec
sub-thread[0] : total loop = 369386
 1st encode 62.022 sec, 368463 loop, 1470 MB/s
 2nd encode 0.157 sec, 923 loop, 1454 MB/s
sub-thread[1] : total loop = 358359
 1st encode 61.213 sec, 358081 loop, 1447 MB/s
 2nd encode 0.157 sec, 278 loop, 438 MB/s
sub-thread[2] : total loop = 249
 2nd encode 0.157 sec, 249 loop, 392 MB/s
sub-thread[3] : total loop = 6
 2nd encode 0.076 sec, 6 loop, 19 MB/s
read   228.558 sec
total  284.500 sec

par2j64_sample3C.exe elapsed time 0:04:47 (GUI)

hash   120.158 sec
write  0.265 sec
sub-thread[0] : total loop = 365716
 1st encode 59.951 sec, 364849 loop, 1506 MB/s
 2nd encode 0.220 sec, 867 loop, 975 MB/s
sub-thread[1] : total loop = 362046
 1st encode 59.500 sec, 361807 loop, 1504 MB/s
 2nd encode 0.204 sec, 239 loop, 289 MB/s
sub-thread[2] : total loop = 237
 2nd encode 0.173 sec, 237 loop, 339 MB/s
sub-thread[3] : total loop = 1
 2nd encode 0.124 sec, 1 loop, 1 MB/s
read   282.546 sec
total  285.610 sec

par2j64_sample4M.exe elapsed time 0:04:47 (GUI)

hash   62.120 sec
write  0.266 sec
sub-thread[0] : total loop = 366289
 1st encode 59.209 sec, 365504 loop, 1527 MB/s
 2nd encode 0.265 sec, 785 loop, 733 MB/s
sub-thread[1] : total loop = 361476
 1st encode 58.757 sec, 361264 loop, 1521 MB/s
 2nd encode 0.265 sec, 212 loop, 197 MB/s
sub-thread[2] : total loop = 235
 2nd encode 0.234 sec, 235 loop, 248 MB/s
sub-thread[3] : total loop = 0
 2nd encode 0.076 sec, 0 loop, 0 MB/s
read   282.166 sec
total  285.609 sec

All par2 files are identical to these created by debug version.

For me looks best par2j64_sample2R.exe, but 1 second difference is marginal. More test results would be good, also results from 16 core CPUs with SSDs would be cool.

Yutaka-Sawada commented 4 years ago

All par2 files are identical to these created by debug version. For me looks best par2j64_sample2R.exe, but 1 second difference is marginal.

Thanks jetelina for tests. Multi-threading for hashing seems to be worthless in your case. Simple single-thread version is enough. From the log, CPU (only 2-core are used) is waiting the finish of file reading anyway. Calculation cost of hash value is almost ignorable, as compared to file asccess speed. I don't know why the drive is so slow. While 1-pass processing is good to treat large files on a slow drive, it's difficult to predict the speed.

I may include the sample version as a tool for the special case like your PC. If a user think his PC's HDD is slow, he can try the 1-pass version. When 1-pass processing is possible and faster in his case, he will use it instead of standard 2-pass version. A user needs to replace EXE files manually and test speed by himself. Is this way ok ?

jetelina commented 4 years ago

I have 3 notes/ideas: a) 1-pass processing is superior ▪ The access to HDD(or even SSD) is slow and expensive, so reading the data just once must be "always better". Small files may fit into HDD cache, larger files may fit into RAM and OS cache, so the result may not be visible. But I think that in principle 1-pass is better design.

b) Not enabled by default ▪ This is new feaure and design change, so let's be cautious and test it more. It is "experimental".

c) It is so beautiful to hide ▪ If you hide par2j64-1pass.exe in program directory nobody would find it. Also it would lead to existence of 2 versions, which one(the non-default) will propably die. ▪ How about merging it into par2j.exe and par2j64.exe and give it command line parametr for example /1p? Default behavior would be not changed, but easy to enable. ▪ The best would be to add option into MultiPar GUI - Options, so that enabling would be user friendly.

Yutaka-Sawada commented 4 years ago

Thanks jetelina for advice.

I think that in principle 1-pass is better design.

I agree that 1-pass processing is faster for (slow) HDD. From my test, 1-pass mode is mostly faster than 2-pass mode, except very small files (like several MB). But, 1-pass mode is impossible in some cases. I need to improve it to support GPU, and require more tests of different cases. I will try more.

Also it would lead to existence of 2 versions, which one(the non-default) will propably die.

Yes, I won't maintain 2 versions. Currently sample application can switch processing mode between 1-pass and 2-pass. Old (current standard) version has 2-pass mode only. New (future standard) version will switch 1-pass and 2-pass mode automatically. But sample doesn't support full feature yet.

How about merging it into par2j64.exe and give it command line parametr for example /1p?

It's possible to add command to switch encoders manually. But, I don't want to add new option on MultiPar GUI. If 1-pass mode is known to be faster than 2-pass mode in a case, it should be enabled at the case always. I want simple setting. That was why I requested users help of testing new feature.

Yutaka-Sawada commented 4 years ago

I made new 1-pass processing sample for GPU. GPU option is available for them. Though Intel GPU didn't improve speed so much on my PC, it was not slow. I updated MultiPar GUI for new 1-pass mode output. It supports new order of progress. Because I removed dummy output from sample versions, old MultiPar GUI doesn't parse the output. I put the set of new sample package (par2j_sample_2020-06-07.zip) in "MultiPar_sample" folder on my OneDrive space.

I tested with 22.4 GB data, too. I found that 1-pass processing is much faster for very large files on my PC. The time difference of 1-pass and 2-pass mode is almost same as the first file access time. The math was quite simple. I post my test result. (1st reading time includes hashing time.)

[ 13% redundancy for 186 MB on HDD (8 GB RAM, 4-core i5-3470) ] debug(2-pass) : 3.4 sec (1st reading is 1.6 sec, writing is 1.2 sec.) Without cache debug(2-pass) : 2.3 sec (1st reading is 0.6 sec, writing is 1 sec.) On cache sample(1-pass) : 1.8 sec

[ 23% redundancy for 3.3 GB on HDD (8 GB RAM, 4-core i5-3470) ] debug(2-pass) : 3 min 7 sec (1st reading is 28 sec, writing is 30 sec. Without cache) debug(2-pass) : 2 min 49 sec (1st reading is 17 sec, writing is 31 sec. On cache) debug(2-pass) : 2 min 52 sec (1st reading is 11 sec, writing is 33 sec. On cache) debug(2-pass) : 2 min 34 sec (1st reading is 17 sec, writing is 31 sec. Intel GPU) sample(1-pass) : 2 min 10 sec sample(1-pass) : 2 min 9 sec (Sub-thread for reading) sample(1-pass) : 2 min 7 sec (Intel GPU)

[ 2% redundancy for 22.4 GB on HDD (8 GB RAM, 4-core i5-3470) ] debug(2-pass) : 6 min 30 sec (1st reading is 3 min 7 sec, writing is 9 sec.) debug(2-pass) : 6 min 30 sec (1st reading is 3 min 6 sec, writing is 10 sec. Intel GPU) sample(1-pass) : 3 min 14 sec

[ 10% redundancy for 22.4 GB on HDD (8 GB RAM, 4-core i5-3470) ] debug(2-pass) : 12 min 10 sec (1st reading is 3 min, writing is 56 sec.) debug(2-pass) : 12 min 3 sec (1st reading is 3 min 6 sec, writing is 60 sec. Intel GPU) sample(1-pass) : 8 min 23 sec sample(1-pass) : 7 min 36 sec (Intel GPU)

When redundancy is low, encoding time (calculation cost) is less than file reading time (file access cost). I think that jetelina would set a small redundancy in his usage. That might be why HDD looks so slow as compared to CPU. Actually, the task of CPU was very small and finished quickly. When source files on HDD are large and creating recovery files with small redundancy, 1-pass processing will become fast relatively. Therefore 1-pass processing won't be fast on SSD. (It's possible to check SSD and switch mode, if 1-pass mode is slow on SSD.) I refer test result of others.

[ 25.1 GB on HDD by jetelina (4 GB RAM, 4-core Core2Quad) ] debug(2-pass) : 9 min 21 sec (1st reading is 5 min 10 sec) debug(2-pass) : 9 min 4 sec (1st reading is 4 min 58 sec) sample(1-pass) : 4 min 55 sec sample(1-pass) : 4 min 52 sec sample(1-pass) : 4 min 46 sec (Sub-thread for reading)

[ 10% redundancy for 60 GB on SSD by Slava46 (32 GB RAM, 16-core Threadripper 1950X) ] debug(2-pass) : 13 min 8 sec (1st reading is 2 min 20 sec) sample(1-pass) : 13 min 9 sec

When CPU is waiting file reading, calculation power isn't so important. I sets half threads for 1st encode while file reading. When I set 2% redundancy (or in jetelina's test), even half cores finished almost encoding. So, GPU cannot help speed, when CPU finishes task already. When I set 10% redundancy, 1st encode was around 23%. (so, 2nd encode treats 77% task.) In this case, GPU may help speed at 2nd encode after file reading. Because my Intel GPU is slow, there was no big improvement.

Though GPU version looks fast at 1-pass mode in my result, it seems to be caused by file writing cache. It finished creation, before file writing was complete really. (HDD was busy for a while, even when creation end.) This 1-pass mode's incident may become problem at writing on USB memory. (User must not remove a USB drive until writing end.)

Yutaka-Sawada commented 4 years ago

From Slava46's test, 1-pass processing and GPU was slower on SSD. Though I could not determine the reason, I suspect that memory usage might affect total speed. 1-pass mode requires more memory than 2-pass mode. To process same task with less working memory, 1-pass mode loops more times. While this isn't problem for CPU, many small task may be inefficient for GPU. GPU requires heavy task to perform full speed.

I made a new sample, which checks a drive (SSD or HDD), and it switches processing mode (1-pass or 2-pass). If a drive is HDD, it prefers 1-pass mode. (When there isn't enough RAM, it uses 2-pass mode still.) If a drive is SSD, it uses 2-pass mode always. I put the sample package (par2j_sample_2020-06-11.zip) for someone interested in the behavior.

I plan to release this sample as next beta version "1.3.1.0", if jetelina doesn't object. (I will remove debug output for release build.) 1-pass processing File IO will be enabled by default for HDD users. (There will be no difference for SSD users.) I will write a release note, which includes caution. Though it's experimental version yet, I cannot test all use cases myself. While many users try the beta version for their usage, they may find bug or problem, and I will fix or solve. Is this way OK ?

jetelina commented 4 years ago

You are correct. I use 0,10 % redundancy for this big files. (I use it mainly to prevent small or single bit errors for the backup archive)

For smaller files ~ 1 GB I use 10 % redundancy.
For special important small files ~ 10 MB I use 200 % redundancy.
I keep recovery data under 100 MB.

Now I understeand.

If there is higher redundancy (more CPU/GPU computation is needed), thus HDD/SSD speed is less important.
If there is low redundancy (less CPU/GPU power is needed), then more HDD/SSD speed is important.

I have not tested "par2j_sample_2020-06-11.zip", but I have no objection to release beta. It needs to be tested by more users.

r6472279 commented 4 years ago

Hello Yutaka sawada

I am the user of your software, This is my hardware

The main board is Gigabyte x79 ud7, 32g DDR3 1600 (8GB × 4), Core i7 3960x 6 core 12 thread, 3.3GHz frequency, which has been over frequency to 4.1ghz, and the actual shutdown of hyper thread is 6c6t

The main hard disk is Intel 520 240g, and the data storage hard disk is plexor m8vc 512g SSD

I often use your software to create PAR2 repair files for files about 30-100gb in size

But in the process of using, I found that even if I used SSD, and I can be sure that the read speed of SSD is not the bottleneck. In the process of creating PAR2 file, I used the hard disk read rate monitoring, and found that in the first hash and the second creation of the software, the threads used are both 2 threads. You can see the attached picture

https://imgur.com/6U0hj23 https://imgur.com/eiq0A3V https://imgur.com/gMrNFr5 https://imgur.com/u06awXU

My CPU is i7 3960x 6C 12t, the original frequency is 3.3GHz, and the hyper thread is closed

Under this condition, I read 45gb files, and the hash rate is only 330MB / s. at the same time, I checked the core occupancy rate through the task manager, and found that only two of the cores are running, and other threads are idle. Therefore, it is a bit like a waste of CPU performance to speculate that the software does not call other cores, and the hard disk cannot run at full speed

After that, I overclocked the processor to 4.1ghz,

Under this condition, reading 45gb files, the hash rate has reached 400MB / S +, still calling two cores as hash

In this case, I have encountered qbittorrent, which has a separate setting option of asynchonous I / O Threads. If this option is set low, the performance of SSD will not be fully developed. However, if it is set high, the performance of SSD will be fully developed. I have tried to set the reading rate to 260MB/s+, when it is set to 4. If it is set to 12, the reading rate of hard disk can reach 530mb / S (my SSD is sata3.0 interface rate, and reading 550MB/s is the limit value)

Therefore, I would like to ask if you can increase the hash thread of the software, or increase the number of cores called more

Another problem is that at the same frequency, I used E5 2687w, 8core 16threads, but I used 8 cores or more. The original file is 40 + GB, but the time to create PAR2 file has not been reduced. It's almost the same as the time to use i7 3960x. Do you have a better solution

r6472279 commented 4 years ago

At the same time, I found that multipar does not support multi-core very well. It seems that on processors larger than 8 cores, the speed of generating PAR2 files is not much different from that of 6 cores

In addition, I also have E5 2697v2 Dual CPU，ASUS z9pe-d16 serve mainboard， 128gb DDR3 1600 RECC memory . If you need, I can also do some more software tests for you

By the way, Thank you for your efforts again : )

Yutaka-Sawada commented 4 years ago

Thanks r6472279 for usage report.

I would like to ask if you can increase the hash thread of the software, or increase the number of cores called more

As you find already, par2j's hashing function uses only 2-thread. This is because my PC had 2-cores. Though some users with high-end PC helped me, I could not test so much. Even when more threads were used, slow HDD might be the bottle-neck.

There are 4 tasks at hashing function; File reading, File MD5, Block MD5, and Block CRC-32. Currently I put these tasks on 2 threads like below. Main thread : Block MD5, Block CRC-32 Sub thread : File reading, File MD5 This was the best setting on my old PC (2-core CPU). If CPU has more cores and HDD is slow, using 3 threads like below may be good. Main thread : Block MD5, Block CRC-32 Sub1 thread : File reading Sub2 thread : File MD5 But, this won't help your case, as you use fast SSD. Or using 4 threads like below may be good for many core CPU and SSD. Main thread : Block MD5 Sub1 thread : File reading Sub2 thread : File MD5 Sub3 thread : Block CRC-32 There might be old source code to test more threads some years ago. I will try them on my new PC.

There is another way to improve hashing speed. Because SSD is good at random access, calculating hash of multiple files will be good. (It requires switch functions for HDD or SSD.) While this method is simple and easy to use more threads, it isn't available for a single source file, like your case.

but the time to create PAR2 file has not been reduced. It's almost the same as the time to use i7 3960x. Do you have a better solution

I search those CPU's spec on Intel homepage. E5 2687w : Clock = 3.1 GHz, Core = 8, Cache = 20 MB, Bus speed = 8GT/s, Memory speed = 51.2 GB/s i7 3960x : Clock = 3.3 GHz, Core = 6, Cache = 15 MB, Bus speed = 5GT/s, Memory speed = 51.2 GB/s I feel that memory speed might be the bottle-neck in this case. The data flow is like below; Source data on SSD -> CPU's multiple cores -> Recovery data on SSD

Image a elevator and deliverymen, who brings many packages through there. 6-Core CPU means 6 deliverymen. 8-Core CPU means 8 deliverymen. When there are more deliverymen, they can bring more packages at once. 8 deliverymen will bring 33% more packages than 6 deliverymen in total.

But, there is a limit of capacity on the elevator. How many deliverymen can ride the elevator at once is the problem. If the capacity is 10, sending 8 deliverymen can be faster than 6 deliverymen. (The required time to bring 120 packages is 120 / 8 = 15, or 120 / 6 = 20.) If the capacity is 5, sending 8 deliverymen is same speed as 6 deliverymen, because exceeded deliverymen just wait the elevator returns back. (The required time to bring 120 packages is 120 / 5 = 24.)

When CPU's total calculation speed reaches memory speed (or something bottle-neck), using more cores won't improve speed anymore. There may be difference, when you set more number of blocks to require more calculation cost. CPU's shared cache or threads' synchronization cost may affect, too. Basically multi-threading is good at heavy task with small memory usage to get full CPU power. PAR2 calculation is similar to memory copy (light task with large memory usage).

At the same time, I found that multipar does not support multi-core very well.

You are right. I didn't optimize my par2j for many core CPU. It's difficult to make a feature, which I cannot test it myself. Though I appreciate that many users helped me very much to test their cases, it's not easy to try still. (And also, I'm lazy to try new things, hehe.)

If you need, I can also do some more software tests for you

Thank you. Because it will require more time to improve hashing function, I want to release current sample implementation (1-pass processing) at first. I tested the behavior of HDD on my PC already, and it worked well in the tested case. I'm waiting test result of SSD by Slava46 now. I got 2 results, and 1 more is unknown yet. 1) 1-pass mode isn't fast on SSD (mostly same speed) 2) 1-pass mode and GPU is slow on SSD (maybe memory usage effect) 3) Speed comparison of "ReadAll" and "ReadSome" methods on SSD is unknown.

If you have time, please test the sample on your PC with SSD. 2 users test on different PCs would be good. It's sample package (par2j_sample_2020-06-13.zip) in "MultiPar_sample" folder on OneDrive. Testing 3 cases is enough; par2j64_debug.exe : case of "ReadSome" method par2j64_debug2.exe : case of "ReadAll" method par2j64_sample1.exe : case of 1-pass mode (ReadSome) Creating 10% redundancy from 40 GB source data is good. If file data is too large, "ReadSome" method doesn't work. If file data is too small (fits on disk cache), there is no difference.

r6472279 commented 4 years ago

very appreciate for answer my question

Next, I will try to overclock the memory to test whether the memory bandwidth will also affect the generation time of PAR2 for now, i used only 32GB（G.skill 8GB DDR3 1600*4）Dual channel memory mode i will try use 128GB or more memory simulated as hard disk ,Test whether there is bottleneck in pure memory

And use more core and high frequency processors to test different verification results of MultiPar

Yutaka-Sawada commented 4 years ago

I released version 1.3.1.0 today. It will be a good improvement for HDD users, who creates recovery files from big source files. Thanks jetelina for idea and test on HDD. Thanks Slava46 for test on SSD. I hope that I didn't mistake badly, hehe.

Yutaka-Sawada commented 4 years ago

I refined old hash function, which uses 3 threads. I tested them by erasing Disk Cache (RAMMap's empty command). There was no difference between 2 threads and 3 threads version on 4-core CPU. I don't know why.

At this time, 2 thread version seems to be enough speed. If there are multiple source files on SSD, calculating hash of 2 files with 4 threads will be possible solution. I put the sample package (par2j_hash_2020-06-23.zip) in "MultiPar_sample" folder on OneDrive.

Yutaka-Sawada commented 4 years ago

It's impossible to use multi-threading for faster calculation of one MD5. Refer this article. When calculating 4 MD5 hashes at once, SSE may improve speed by 2 times. But, PAR2 has 2 MD5. So, this method is useless. Thus, there is no good idea of software improvement for r6472279's case. (calculating hash of single source file)

I found a report, which might indicate that Intel/AMD architecture CPU isn't good at using many cores. Some graph about "Throughput vs Cores" are interesting in Linear scalability subject. As number of using cores are many, improvement becomes small.

Currently, as I wrote yesterday, one possible solution may be hashing multiple files at once on SSD. This method is useless for HDD. (Many seeking is bad and slow on HDD.) I'm not sure the speed of multiple reading on SSD, because I don't know how SSD's cache (or prefetch) works. If someone has a PC with SSD and 4 (or more) core CPU, and wants to help implementing the hash function, please post here or send e-mail to me. Then, I will make a sample for him.

Yutaka-Sawada commented 4 years ago

I made a sample, which calculates hash of multiple source files at once. It doesn't change function between HDD or SSD yet. I tested the behavior on my PC with HDD. When files exist on disk cache already, it's 1.5 ~ 2 times faster than single version. But, when it's first time (no disk cache), it's 4 or more times slower than single version. Because this method requires many seek to read multiple files at the same time, it's very bad (slow and harmful) for HDD. I'm not sure, how it works on SSD. If someone has a PC with SSD and 4 (or more) core CPU, and wants to test, please post here or send e-mail to me. At this time, it should be fast, when there are many source files on RAM disk.

HardcoreGames commented 4 years ago

I had to use somebody else's tool to add 2% recovery blocks to a 1.12 TB collection of 2802 files. I use 8192 blocks and it took 15 hours and 75% of 32GB along with saturating my CPU.

It's high time you moved to 64-bit for better access to more memory.

Yutaka-Sawada commented 4 years ago

add 2% recovery blocks to a 1.12 TB collection of 2802 files.

Though I don't know what tool you used, the data size might be too large. I don't recommend to treat so large data in PAR2. PAR2 doesn't support partial recovery. It means that you require whole 1.12 TB data set to repair a single file's several bytes error.

I recommend that you classify files in some groups, and create recovery files for each group separately. For example, by setting 10 groups, each group contains average 280 files (115 GB). As 2% redundancy of 280 files is 5.6 files, it will be enough protection still. Even when 5 files are erased completely , you can recover them from other 275 files. Though total creating time is same, repairing time may become 10 times faster !

It's high time you moved to 64-bit for better access to more memory.

MultiPar includes 64-bit version of PAR2 client. "par2j.exe" is 32-bit. "par2j64.exe" is 64-bit. When a user uses 64-bit OS, MultiPar GUI calls 64-bit version automatically.

HardcoreGames commented 4 years ago

I found a tool called phpar2 which is command line only and after fiddling with the thing I was able to get it working

http://paulhoule.com/phpar2/index.php

The package came with a 64-bit build, which i used but it was demanding using 75% of my RAM and loading my R5 3600 hard. It took 16 hours but it eventually said done.

Multipar was not able to do that

Yutaka-Sawada commented 4 years ago

Multipar was not able to do that

Because I cannot treat so big data myself, I never tested MultiPar for the case. As I uses a 500 GB HDD, I don't know how TB level data is processed. If there is a bug or problem, please report the incident with detail. Then, I will fix it.

HardcoreGames commented 4 years ago

I use vast NAS class storage so working with 5PB is not a problem.

My PC has a pair of 8TB hard disks and a pair of 4TB disks.

Yutaka-Sawada commented 4 years ago

I'm not sure what HardcoreGames wants. He used phpar2, which had no problem. While he could not use MultiPar in his usage, he didn't tell what was bad. Thus, I don't know how was the problem. I cannot solve a problem, when I don't know where is the wrong point. MultiPar has a feature to take log of processing. Screen-shot is helpful, too.

By the way, phpar2 is a good PAR2 client, which supports Multi-Threading. My PAR2 client's MMX encoder and MD5 hashing function are based on his code. So, there may be no speed difference on old (MMX age) CPU. If there is a problem in phpar2, there is new par2cmdline. If someone wants fast speed, there is another PAR2 creating tool, ParPar. They use OpenMP for multi-threading.

HardcoreGames commented 4 years ago

I have the source code that came with the phpar2 but it is already tuned with some assembler that seems to be aimed at the Core 2 class processors. All I know is that the linear algebra is memory demanding with larger data sets so I am considering for my next box installing even more RAM to handle the workload.

I tested MultiPar and it was able to check the output which took about 3 hours. I am considering testing a simulated error by removing one file and testing the tool.

When I attempted to load 2802 files, Multipar was not able to do that.

Yutaka-Sawada commented 4 years ago

All I know is that the linear algebra is memory demanding with larger data sets so I am considering for my next box installing even more RAM to handle the workload.

When data size is larger than RAM size, PAR2 client splits file data into smaller pieces virtually. Then, it calculates each piece set one by one. For example, 1000 GB = 20 GB 50 times. If you increase RAM size of your PC, it causes less number of iteration. Double size RAM means half number of iteration, as 1000 GB = 40 GB 25 times. Because HDD is slow at random access mode, many iteration may become slow by file access time generally. File access time is the bottle neck, and large RAM can decrease the slow access effect. But, I don't know the speed difference in the case of so large data set.

I tested MultiPar and it was able to check the output which took about 3 hours.

So, verification by MultiPar was possible. It took 3 hours to calculate hash of 1.12 TB files.

When I attempted to load 2802 files, Multipar was not able to do that.

This seems to point creation task. I'm not sure what is the "load" action. You could not select those files at MultiPar GUI ? or par2j64.exe didn't start something task ? Did it show error message ? Is there a log output of par2j64.exe ? Screen-shot or detailed explanation is required to know what happened. I cannot test your case by myself. If you want to fix a bug of MultiPar, you need to post your experience.

HardcoreGames commented 4 years ago

I used phpar2 with a command line:

phpar2 c -b8192 6-pieces.par2 *.emd

15-16 hours later it was done, my rig was loaded hard but it did the job

One issue I noticed with the par2 files the recovery blocks double with each successive file, which leads to some very large files, this suggests limiting the size and making a few more 6-pieces.vol007+08 which is managable and simply make more volxxx+8 instead of doubling them over and over

ghost commented 4 years ago

I skimmed through this issue, but I am surprised that you even got it to run at all with these file sizes. I tried creating PAR2 files for a rather large file with v1.3.1.1 (latest) and I just get the error "Error: cannot parse output (0x01)". Probably shouldn't hijack this and create my own issue, but I thought that some of the "speed improvements" made for this issue could have caused this bug. My system certainly has a constraint on RAM, but phpar2/par2cmdline loads the file and creates the recovery files just fine.

EDIT: Yes, v1.3.1.0 caused my issue. As you said yourself @Yutaka-Sawada, "Because this new implementation isn't tested so much, there may be a bug or problem." I have no good errors except the one I provided above, but if you want me to debug it myself then I can do so. For now I will stick with v1.3.0.

HardcoreGames commented 4 years ago

PAR2 works best with archives that are sliced into several volumes. When I backup files to BD I use smaller archive chunks and a higher level of redundancy so that if there is a problem I can still recover the backups.

7-zip can carve up a chunk of data into any sized piece you want. Smaller pieces are easier to calculate the PAR2 recovery blocks.

ghost commented 4 years ago

Yeah I had to learn that lesson the hard way. Had a mismatch at around 60% into the process of creating recovery files (CUDA calculations were probably messed up due to undervolting) and had to start over again.. Yikes.

Do you use M-Discs by any chance? I heard that their BDs are no different from what's on the market right now, but that their DVDs are actually good for long-term storage.

HardcoreGames commented 4 years ago

No just standard BD media. I am using BD50 dual layer discs for archival backups.

Yutaka-Sawada commented 4 years ago

I tried creating PAR2 files for a rather large file with v1.3.1.1 (latest) and I just get the error "Error: cannot parse output (0x01)".

Thanks Voczi for bug report. The error indicates that it cannot recognize output at all. There may be something problem in par2j's rare output. Or unknown error might happen in checking SSD drive. Now, I need to read the output for your case.

Please save the log and send it to me by e-mail. You can save log by checking an option "Log output of clients" in "Common options" section on "Client behavior" tab of MultiPar Option window. You can see the log saved folder by clicking "Open this user's save folder" in "Folder location" section on "System settings" tab. I will read an incompatible point in the log, and try to fix this problem. I may ask more test of the incident later.

ghost commented 4 years ago

Please save the log and send it to me by e-mail.

You should have received it by mail now. Hopefully I sent it to the right e-mail address.

HardcoreGames commented 4 years ago

I tried creating PAR2 files for a rather large file with v1.3.1.1 (latest) and I just get the error "Error: cannot parse output (0x01)".

Thanks Voczi for bug report. The error indicates that it cannot recognize output at all. There may be something problem in par2j's rare output. Or unknown error might happen in checking SSD drive. Now, I need to read the output for your case.

Please save the log and send it to me by e-mail. You can save log by checking an option "Log output of clients" in "Common options" section on "Client behavior" tab of MultiPar Option window. You can see the log saved folder by clicking "Open this user's save folder" in "Folder location" section on "System settings" tab. I will read an incompatible point in the log, and try to fix this problem. I may ask more test of the incident later.

I have long used 64-bit Windows from XP x64 onwards and I have compiled all my software for x64 in principle to motivate the adoption of Win64.

I have 32GB at present but my motherboard can have up to 128GB max. DDR5 is able to do 256GB and 512GB and eventually 1TB RAM but servers will be in the front of the line.

Yutaka-Sawada commented 4 years ago

I got the log file from Voczi. Thank you.

I found a bad point in my source code. There is a switch for 2 routes; "err > 0" or "err < 0". I forgot the case of "err == 0". This happens, when drive is HDD and RAM size isn't enough. I added some lines to treat this, and I hope the bug will be fixed.

I made a sample to test the problem. I put the sample (par2j_sample_2020-10-03.zip) in "MultiPar_sample" folder on OneDrive. Please test it in your usage.

ghost commented 4 years ago

New logs have been mailed to you.

Yutaka-Sawada commented 4 years ago

Thanks Voczi for test. I found another bug in my code. I forgot to modify old code, when I tried multi-reading on SSD. Though I stopped multi-reading files, some functions were incompatible. I fixed the bad point. Because this problem may happen on more users, I put samples in public space. If someone sees the same error, please try the fixed version.

I made a sample to test the problem. I put the sample (par2j_sample_2020-10-04.zip) in "MultiPar_sample" folder on OneDrive. Please test it in your usage.

ghost commented 4 years ago

Good job! Works flawlessly now. vmconnect_yCil4DyNJR Thank you for taking your time to fix the bug. This should probably be safe to publish to master now, but perhaps you would like to do some testing on your own too. Whatever may be the case, thank you again!

Yutaka-Sawada commented 4 years ago

I released new version as v1.3.1.2 today. Thanks Voczi for bug report and test.

HardcoreGames commented 4 years ago

I released new version as v1.3.1.2 today. Thanks Voczi for bug report and test.

download link?

Yutaka-Sawada commented 4 years ago

You may download them from GitHub or OneDrive.

ghost commented 4 years ago

I released new version as v1.3.1.2 today. Thanks Voczi for bug report and test.

Thank you for taking your time to fix the issues. Is "MultiPar_par2j_1312.7z" available somewhere (source code for new version)?

Yutaka-Sawada / MultiPar

Can source data be read just once? #3