Optimization for CPU's shared L3 cache

Yutaka-Sawada commented 1 year ago

I came up with a way to improve speed for CPU's L3 cache. By limiting number of accessing source blocks to a few, a thread may share block data with other threads on L3 cache. This optimization method requires shared L3 cache for multiple cores on a CPU.

I made some sample applications to find how many blocks is good. From tests on my PC (Core i5-10400, 256 KB L2 cache and 12 MB L3 cache), calculating 48 ~ 64 blocks seems to be faster than others. From tests on neighbor's PC (Core i5-4460, 256 KB L2 cache and 6 MB L3 cache), calculating 16 ~ 32 blocks seems to be faster than others. I assume that comparison of L3 cache size against L2 cache size may affect. But, I'm not sure these incidents adapt to other CPUs.

I'm looking for helpers to get more results of the experiment. I put the sample (TestBlock_2023-08-31.zip) in "MultiPar_sample" folder on OneDrive. If someone wants to aid development, please test with them. I want to know which ones are faster than others on what CPU spec. (Because debug output is long, no need to post whole result.) When my inference is good on most CPUs, I will implmenet the new method.

animetosho commented 1 year ago

I created a random 2200MB file, and ran the following batch file:

@del output.txt
@for %%i in (par2j*.exe) do (
echo ===== %%i ===== >> output.txt
%%i c /ss1048576 /rr10 out test2200m.bin >> output.txt
@del out.*
)

Note that I also renamed your EXEs so that they're alphabetical.

AMD FX 8320: 4 modules / 8 cores, L2: 2MB/module, L3: 8MB shared
AMD Athlon II X2 245: 2 cores, L2: 2MB shared, L3: none
Intel Core i5 3320M: 2 cores / 4 threads, L2: 256KB/core, L3: 3MB shared
Intel Core i7 12700K (P-cores only): 8 cores / 16 threads, L2: 1.25MB/core, L3: 25MB shared

VTune stats on the 12700K with par2j_32, on a 4100MB file. I focused on the multiplication function:

vtune

There's not much memory bound. Port utilization seems to be good.

Yutaka-Sawada commented 1 year ago

I created a random 2200MB file, and ran the following batch file:

Thank you for long tests. Different types of example is helpful. It brought new insight to me.

From reading your test results; AMD FX 8320: 8 blocks seems to be faster. AMD Athlon II X2 245: 8 blocks seems to be faster. Intel Core i5 3320M: 8 ~ 16 blocks seems to be faster. Intel Core i7 12700K: 32 blocks seems to be faster.

While Intel CPUs are similar property with my tests, AMD CPU are different. I think this is because AMD CPUs have large L2 cache as compared to Intel CPUs. When L2 cache size is enough large, L2 cache optimization may work better than L3 cache. I see 2 points to remark.

Rate of "L3 cache szie / L2 cache size" may affect L3 cache optimization. AMD FX 8320: 8 MB / 2 MB = 4 AMD Athlon II X2 245: 0 / 1 MB = 0 Intel Core i5 3320M: 3 MB / 0.25 MB = 12 -> This may cause 12 blocks is good for L3 cache optimization. Intel Core i7 12700K: 25 MB / 1.25 MB = 20 -> This may cause 20 blocks is good for L3 cache optimization.
Many splited L2 cache may affect L2 cache optimization. AMD FX 8320: 128 KB 16 -> This may cause 8 blocks is good for L2 cache optimization. AMD Athlon II X2 245: 64 KB 16 -> This may cause 8 blocks is good for L2 cache optimization. Intel Core i5 3320M: 32 KB 8 Intel Core i7 12700K: 128 KB 10

Though this is only an assumption, combination of these 2 factors would be important. I will test more and try to implement new method little by little. Because adapting new method to GPU is difficult, I need to test GPU later.

Yutaka-Sawada commented 1 year ago

I made samples of new optimization for CPU cache. While it's faster than old versions on my PC, it may not be fast on other PCs. If someone is interested in the speed difference, he may test it. I includes debug versions and source codes in the package. I put the sample (MultiPar_sample_2023-09-24.zip) in "MultiPar_sample" folder on OneDrive. They are available at github's "alpha" and "source" directries, too.

I may change these new versions still. If someone wants to improve speed on his PC, he may try "TestBlock_2023-08-31.zip" in the "MultiPar_sample" on OneDrive. Because I cannot test fast GPU, it may become slower than old versions. If there is a problem or failure in these samples, please report it.

supply9243 commented 1 year ago

In version V1.3.3.0, pure CPU calculation speed is already a little faster than CPU+GPU.😆 Laptop configuration: i7-10750H + RTX 2060

Slava46 commented 1 year ago

AMD Ryzen Threadripper 1950X MSI 3070 SSD NVme (PCIe 3.0) 70 GB source files, memory 37% of 32 GB DDR4 (seems more memory usage don't increase speed), 33% redundancy

1.3.2.9 version: CPU low GPU enabled: 08:34 min CPU high GPU disabled: 12:36 min

1.3.3.0 beta: CPU low GPU enabled: 24:38 min (as I understand in this version GPU decreased for testing CPU more). CPU high GPU disabled: 9:22 min CPU high GPU enabled: 09:01 min

So perfomance for CPU only increased significantly. My GPU still faster but difference not so big as were before.

Yutaka-Sawada commented 1 year ago

Thanks K2M74 and Slava46 for tests. While CPU only calculation becomes faster than old version, GPU calculation requires very fast GPU to be faster than CPU only calculation. It seems that CPU oriented optimization is bad for GPU. When a GPU is enough fast, getting the best performance of the GPU would be important. I will change PAR2 calculation method for GPU in next version.

Slava46 commented 1 year ago

When a GPU is enough fast, getting the best performance of the GPU would be important.

This. Yeah, if you have at least not top but any middle GPU it could be faster than good CPU (but anyway my 16-core CPU almost 6 years old already, but beated last year model middle GPU just little with new calculation). And yeah, calculation method for GPU and CPU seems should be different.

Interesting would be compare top CPU and top GPU out recent year with last technologies.

supply9243 commented 1 year ago

When using MultiPar to create a PAR2 file, par2j64.exe CPU usage can only be between 60-80%, most of the time at 60%. The overall system CPU is slightly higher than par2j64.exe, but much lower than 100%. The hardware environment has been turned on except for the GPU. Is there any setting error that prevents the program from fully utilizing the CPU performance? 2023-10-13_14-17-06

Slava46 commented 1 year ago

Agreed, for me the same, CPU using just for ~50-60% instead GPU using 90-95%.

Yutaka-Sawada commented 1 year ago

calculation method for GPU and CPU seems should be different.

I'm writing new method, which uses GPU thread prior to CPU calculation threads. I removed all CPU based optimization to get full speed of GPU. While CPU calculation threads become 10% ~ 20% speed, GPU thread becomes 130% speed. Without CPU's cache & SIMD optimization, CPU calculation becomes very slow. It uses less number of threads for easy memory access, too. Though the GPU method is 5 ~ 7 times slower than CPU only method on my PC, it may be different for very fast GPU. I will public debug versions in a few days. (maybe tomorrow)

When using MultiPar to create a PAR2 file, par2j64.exe CPU usage can only be between 60-80%, most of the time at 60%. CPU using just for ~50-60% instead GPU using 90-95%.

While i7-10750H has 6 cores, it tries to run 12 threads at once. While Ryzen Threadripper 1950X has 16 cores, it tries to run 32 threads at once. By default, my PAR2 client uses the number of cores. So, CPU usage percent on Windows Task Manager doesn't become 100%. Though it's possible to use 100% power of CPU by using more threads, it would not improve speed.

I think that there may be a bottle-neck somewhere. Memory access speed is important than CPU's calculation speed (using more threads). This is why CPU's L3 cache optimization improved speed. CPU's cache memory is much faster than PC's RAM. When you see low CPU usage percent on your PC, it means that the CPU runs efficiently. By seeing this from positive side, it performs full speed by using only 60% power of the CPU. It's good for power management. =)

You may test speed of different number of threads by setting "lc" option at command-line. From my experience, average of # cores and # threads seems to result 100% CPU usage.

Yutaka-Sawada commented 1 year ago

I made a new sample, in which I implemented new GPU method. I put the sample (MultiPar_sample_2023-10-14.zip) in "MultiPar_sample" folder on OneDrive. I don't know that it's fast on your PC. It's 5 ~ 7 times slow on my PC, which uses Intel UHD Graphics 630 (integrated GPU). If your GPU is very fast, it may run faster than old versions. When someone tests the speed, I will change the GPU method more from the result.

Slava46 commented 1 year ago

MultiPar_sample_2023-10-14 75 GB source files, extras features enabled all. GPU enabled CPU low: 13:02 GPU enabled CPU high: 12:22

1.3.2.9: GPU enabled CPU low: 11:02

So, seems old version for GPU still faster. Also new last beta looks like more stable for CPU high enabled than previous version for me (because an some old versions were a lot of Slice Mismatch errors for High CPU).

supply9243 commented 1 year ago

My conclusion is consistent with Slava46

Test file: 25.9GB

V1.3.2.8：3:10 V1.3.2.9：3:10 V1.3.3.0：4:02 V1.3.3.0 (2023-10-14):：4:04

CPU is high, CPU+GPU settings are all on

Yutaka-Sawada commented 1 year ago

So, seems old version for GPU still faster. My conclusion is consistent with Slava46

Thanks Slava46 and K2M74 for tests. My trials might be useless. I returned the GPU method to be similar to old versions. I removed CPU cache optimization as compared to CPU only method. It may use GPU thread more often. It will be mostly same speed as that of v1.3.2.9.

I put the sample (MultiPar_sample_2023-10-15.zip) in "MultiPar_sample" folder on OneDrive. If it's noticeably slower than v1.3.2.9, there may be a bad point somewhere in my code. Because CPU only method becomes much faster at v1.3.3.0, worth of GPU acceleration (CPU & GPU method) would become small.

supply9243 commented 1 year ago

Time under the same settings: 3:16

The advantage of GPU+CPU compared to previous versions is already very small, but it can also save 20% of time (although it causes power consumption to double🤣). If there's no better way, maybe use the old method while using the GPU. When no GPU is selected, a new method is used.

In addition: Is it possible to increase the speed of reading files when creating verification files? It only reaches a maximum of 500+MB/s on NVMe SSD (generally it can reach more than 2000MB/s). Or do redundant calculations at the same time while validating the whole file, they are two independent parts (I don't know if it can be done).

Slava46 commented 1 year ago

The same 75 Gb files as previous, settings, hardware (SSD drive) etc. MultiPar_sample_2023-10-15

GPU enabled CPU low: 09:56 (2nd try 09:44) - GPU periodically 100% load, so it's much faster for a full minute than were 1.3.2.9 - 11:02 or MultiPar_sample_2023-10-14 - 13:02 !

GPU enabled CPU high: 12:26 - but in this case when CPU high going some freezing Windows and other programs and Windows going slow but not hole working time but periodically and CPU using 65-70% - it's more than before (~50-60%). GPU almost 90-95-100% full load it's seems fine because GPU only and CPU low it's fine. For MultiPar_sample_2023-10-14 it was without so freezing for GPU enabled CPU high.

CPU high, GPU disabled (just curious because my previous test for those was for different 70 Gb files), no freezing: 10:34

So, freezing Windows looks like when CPU high GPU enabled only, when CPU low GPU enabled or CPU high GPU disabled it's fine.

P.S. So today sample version is good, GPU could be much more faster than before and seems model of GPU value because K2M74 have RTX 2060 and no difference but my one year modern MSI 3070 reached more than 1 minute faster than 1.3.2.9. But anyway NVidia 30xx series much faster than 20xx series. Interesting compare with 40xx new series.

500+MB/s on NVMe SSD (generally it can reach more than 2000MB/s).

Actually fine NVMe SSD based on just PCIe 3.0 reached 3300 MB/s but based on PCIe 4.0 ~7000 MB/s and on PCIe 5.0 now can reached 11000-12000 MB/s. Anyway if it could be few more times faster it would be already better than now.

Yutaka-Sawada commented 1 year ago

Thanks Slava46 and K2M74 for quick tests. Your experience helped me very much. I put new sample (MultiPar_sample_2023-10-16.zip) in "MultiPar_sample" folder on OneDrive. It should be same (or slightly faster) speed as compared to old versions (v1.3.2). Because I tried all possible ways in my mind now, I won't be able to improve GPU method anymore.

If there's no better way, maybe use the old method while using the GPU.

Oh, I see. I gave up to adapt CPU L3 cache optimization to GPU method. I returned the GPU method to use CPU L2 cache optimization only. CPU threads may run in 40 ~ 50% speed as compared to CPU only method. Though the GPU method is slower than CPU only method, it's 10% faster than old GPU method on my PC. (The difference may come from improvement of AVX2 code.)

Is it possible to increase the speed of reading files when creating verification files?

Yes, I can. I increased max number of threads to read files on SSD. It used 3 threads ago. It uses 4 threads in new version.

But, I could not see any difference on my PC. While it can read files from RAM (disk cache) in 1900 MB/s, it reads files from SSD in 1200 MB/s. I feel that Random Access Speed seems to be slow on SSD (however the difference should be smaller than HDD). SSD's Catalog spec speed may be for Sequential Access. I don't know how is the speed of your SSD.

So, freezing Windows looks like when CPU high GPU enabled only, when CPU low GPU enabled or CPU high GPU disabled it's fine.

I feel that there might be too many memory access on your PC. When it becomes too heavy at High CPU usage, please decrease CPU usage at MultiPar Option slider. I don't know what is the best setting on which PC environment. Users need to test by themselves.

Slava46 commented 1 year ago

Because I tried all possible ways in my mind now, I won't be able to improve GPU method anymore. You made nice improvment for GPU and it's already cool, because 10% faster for MultiPar_sample_2023-10-15 version it's great! Maybe you'll come up with a new idea.

About new sample testing, the same 75 Gb files. MultiPar_sample_2023-10-16: GPU enabled CPU low: 09:19 - seems you improved GPU more! very nice (or maybe it's because of SSD speed up) GPU disabled, CPU high: 09:34 - it's faster too for 1 minute compare with 10-15 sample.

Also I tested MultiPar_sample_2023-10-15 and MultiPar_sample_2023-10-16 for speed on start operations for checking SSD speed: Computing file hash+Constructing recovery files: MultiPar_sample_2023-10-15: 01:16 MultiPar_sample_2023-10-16: the same 01:16 - ~580 MB/s when CPU low GPU enabled (when CPU high it's going around ~800-1000 MB/s).

But on the Creating Recovery main operation SSD speed increase to 730 MB/s at some periods when using GPU or CPU high.

supply9243 commented 1 year ago

On my device it's slower instead: 3:46

The speed problem is that I misunderstood before. It is only 500+MB/s when reading a file, and can reach 1.7GB/s for multiple files, but after adding threads, it increases to 2.3GB/s.

Considering that SSDs are getting faster and faster, it may be possible to allow the software to dynamically increase or decrease the number of threads according to the I/O busyness of the SSD hard disk (SSDs do not need to consider head issues anyway).

However, the code needs to be adjusted as it is mistaken for a virus by Windows Defender. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?name=Trojan%3AWin32%2FScarletFlash.A&threatid=2147722029

Yutaka-Sawada commented 1 year ago

Thanks Slava46 and K2M74 for tests.

On my device it's slower instead: 3:46

There was a difference in GPU function between old v1.3.2.9 and new one. In old version, it calculated a matrix once before loop. I changed it to calculate on time for less memory usage, because there seems to be no difference on my PC. But, it's slow on your PC. So, I returned the code to be old style. I'm not sure that it caused the slow down. If new sample runs faster, it might be the bad point.

I put the sample (MultiPar_sample_2023-10-16b.zip) in "MultiPar_sample" folder on OneDrive. If this is slow still, I don't know what is wrong.

it may be possible to allow the software to dynamically increase or decrease the number of threads according to the I/O busyness of the SSD hard disk

This is difficult for me. animetosho's ParPar seems to work well for file access. My par2j is always slower than ParPar, because ParPar's file access is faster. It would require complex coding to perform such speed. I'm too lazy to try now.

You may see hash calculation speed in new debug versions. There are lines like below;

cpu_num = %d, entity_num = %d, multi_read = %d
hash %d.%03d sec, %d MB/s

At this time, it uses file access buffer of L3 cache size. The speed in Slava46's case may come from CPU's large cache memory.

However, the code needs to be adjusted as it is mistaken for a virus by Windows Defender.

Basically, Windows Defender seems to warn new .EXE file with small difference. Microsoft may think them as an original file and modified trojan by a hacker. I cannot avoid this problem myself. I don't know what was bad nor how to solve the miss-detection.

Slava46 commented 1 year ago

MultiPar_sample_2023-10-16b GPU enabled CPU low: 09:39 - slower than MultiPar_sample_2023-10-16 for me GPU disabled, CPU high: 09:25 - little faster ~9 sec, so the same

Yutaka-Sawada commented 1 year ago

Thanks Slava46 and K2M74 for tests again. GPU method seems to work correctly on K2M74 's PC, because new version (v1.3.3.1) is slightly faster than old one (v1.3.2.9). But, GPU method may be worthless, when CPU only method is almost same speed. CPU's L3 cache optimization works too well.

Because increasing number of threads to read files on SSD for 6-Core CPU was useless, I adjusted the formula to be similar to old versions.

At old versions: 4 ~ 7 cores: 3 threads 8 ~ cores: 4 threads

At new version: 3 cores: 2 threads 4 ~ 6 cores: 3 threads 7 ~ 9 cores: 4 threads 10 ~ 12 cores: 5 threads 13 ~ cores: 6 threads

I will post in my native language because Google Translate to English will misinterpret my meaning

It's OK for me. My English knowledge isn't so high, too. I use Bing online translater sometimes.

GPU enabled CPU low: 09:39 - slower than MultiPar_sample_2023-10-16 for me GPU disabled, CPU high: 09:25 - little faster ~9 sec, so the same

While I test calculation speed on my PC, the results would have +- 10% range. The difference may come from hit rate of CPU Cache memory. Also, file access speed depends on OS's disk cache largely. I use Empty Standby list command of RAMMap to clear disk cache.

Because CPU (Ryzen Threadripper 1950X) on Slava46's PC has 16 cores, low CPU usage may cause the difference. If you set CPU usage to be the highest or second, it may become faster. But, I'm not sure how it is.

Then, I made three samples to test different ways. Though they are almost same speed as previous sample on my PC, it may not on your PC. At first, I changed order of starting threads. Also, I added clEnqueueMigrateMemObjects to start memory transfur to VRAM quickly. This may help GPU start-up somehow. (however I could not see any difference on my PC.)

At old versions: CPU threads at first, then GPU thread next

At new version: GPU thread at first, then CPU threads next

Other two use a complex synchronization method to control CPU threads. Because the calculation style is similar to CPU only method, it would not freeze PC at high CPU usage. But, I felt that the mechanism might be too complex, when the speed was not fast at all on my PC.

I put the sample (MultiPar_sample_2023-10-17.zip) in "MultiPar_sample" folder on OneDrive. If they are not so fast (or even slow), I will return to old source code. When CPU only method is enough fast, GPU acceleration will be worthless.

Slava46 commented 1 year ago

When CPU only method is enough fast, GPU acceleration will be worthless. I don't agree with that because someone could have ordinary CPU but fast GPU or want to calculate on GPU instead of CPU and if both methods will doing fast and good you'll have the choice what to use etc.

Seems SSD speed increased for both of my test method ~+200 MB/s when using data read etc.

MultiPar_sample_2023-10-17 par2j64_Migrate: GPU enabled CPU low: 09:22 GPU disabled, CPU high: 09:20

par2j64_Cache: GPU enabled CPU low: 09:54 GPU disabled, CPU high: 10:10

par2j64_VRAM: GPU enabled CPU low: 19:55 GPU disabled, CPU high: 10:14

supply9243 commented 1 year ago

par2j64_Cache_CPU：03:26 par2j64_Cache_CPU+GPU：02:36

par2j64_Migrate_CPU：03:24 par2j64_Migrate_CPU+GPU：02:49

par2j64_VRAM_CPU：03:25 par2j64_VRAM_CPU+GPU：03:08

Yutaka-Sawada commented 1 year ago

Thanks Slava46 and K2M74 for testing them. Those 3 ways were not fast. When I watched a dull TV anime, I came up with another calculating style. (Because the story was too opportunism, I thought others.)

From your test results, GPU doesn't become slow at a few source blocks. In current GPU method, CPU and GPU calculate different parity blocks from same source blocks. In my new idea, CPU and GPU calculate same parity blocks from different source blocks. This is a big change of calculation style.

Though new GPU method requires more RAM to keep double parity blocks (each for CPU and GPU), CPU threads and GPU thread work independently each other. This means that CPU's L3 cache optimization is available. (It may require at least 3 cores or more on a CPU.) Because it uses GPU's VRAM efficiently, it can work with graphics boards with small VRAM size.

I made a sample encoder at first. The GPU method is 40% faster than old v1.3.2.9 one on my PC. Though it's still slower than CPU only method, it's noticeably faster than all my previous implementations. I put the sample (MultiPar_sample_2023-10-18.zip) in "MultiPar_sample" folder on OneDrive. If the encoder is faster than old others, I will implement the same style decoder. This may be the last hope. I'm tired to try many ways to improve GPU method.

animetosho commented 1 year ago

Another possible strategy is to split the blocks in half (or some other ratio) and CPU/GPU processes each half independently. This avoids double memory allocation, but it's harder to dynamically allocate calculation based on speed, since the ratio chosen is somewhat fixed.

Slava46 commented 1 year ago

MultiPar_sample_2023-10-18 GPU enabled CPU low: 17:51 GPU disabled, CPU high: 10:11

For me for GPU seems it's very slow.

Yutaka-Sawada commented 1 year ago

Another possible strategy is to split the blocks in half (or some other ratio) and CPU/GPU processes each half independently.

Thanks animetosho for the idea. I implemented the similar idea in v1.3.3.0. It splits each block into some chunks. CPU & GPU pick different chunks one by one. If GPU thread is fast, it calculates more chunks than CPU threads. For multi-core CPU's each core, this method works well. But, GPU seems to be slow to treat small chunks. GPU may not reach full speed, when it starts and stops continuously for many light tasks.

For me for GPU seems it's very slow.

Thanks Slava46 for testing new method. From the nature of new calculation style, it would be slow at low CPU usage. It may be sensitive to CPU usage. I refer to your old test results as following.

1.3.3.0 beta: CPU low GPU enabled: 24:38 min (1478 sec) CPU high GPU disabled: 9:22 min (562 sec) CPU high GPU enabled: 09:01 min (541 sec)

MultiPar_sample_2023-10-18: GPU enabled CPU low: 17:51 (1071 sec) GPU disabled, CPU high: 10:11 (611 sec)

Compare relative speed in each case. In v1.3.3.0, "CPU low GPU enabled" is 38% speed of "CPU high GPU disabled". (2.63 times slow) But, when you set CPU usage to high, "CPU high GPU enabled" is 103% speed of "CPU high GPU disabled". While GPU method is very slow at low CPU usage, it's almost same speed at high CPU usage.

Now, in MultiPar_sample_2023-10-18, "CPU low GPU enabled" is 57% speed of "CPU high GPU disabled". (1.75 times slow) Though GPU method is very slow indead, the difference is smaller than the rate of v1.3.3.0. If you set CPU usage to be higher, it may become faster as same as the v1.3.3.0.

At this time, I set same task size for CPU threads and GPU thread. If I put larger task size for GPU thread, GPU may run more efficienctly. But, it's difficult to adjust task size dynamically.

Slava46 commented 1 year ago

I see the mind, tried now MultiPar_sample_2023-10-18 GPU enabled CPU high: 08:55 and now I don't see a lot of freezing like for previous where it were. But I tested before this mode just for MultiPar_sample_2023-10-14 - 12:22 for the same 75 Gb files (first was for 70 Gb) and MultiPar_sample_2023-10-15 - 12:26.

So looks like it's much faster.

So if this method faster than previous but slower when CPU low GPU enabled for this one need to use old fast one for GPU only but when using GPU enabled and CPU high using new thing and it'll be cool for all using variants.

P.S. so looks like for now it's 2023-10-18 version the fast for CPU high GPU enabled But MultiPar_sample_2023-10-16: GPU enabled CPU low: 09:19 - was the fastest for GPU only CPU low if you don't want use high CPU.

But CPU high only was good results for: MultiPar_sample_2023-10-17 / par2j64_Migrate / GPU disabled, CPU high: 09:20 MultiPar_sample_2023-10-16b / GPU disabled, CPU high: 09:25 MultiPar_sample_2023-10-16 / GPU disabled, CPU high: 09:34

Yutaka-Sawada commented 1 year ago

GPU enabled CPU high: 08:55 and now I don't see a lot of freezing like for previous where it were.

Thank you for the trial. It's good to know that new GPU method is faster at high CPU usage. From the spec of Ryzen Threadripper 1950X, it will calculate 64 source blocks each for CPU's L3 cache optimization. When there are 2000 source blocks, 2000 / 64 = 31 tasks. CPU and GPU threads pick them one by one. If GPU is fast, it will pick more tasks than CPU. (So, it cannot work in a few blocks.)

The problem is that GPU becomes slow for small task. When you set lower CPU usage, CPU threads become slow relatively. While GPU thread picks more tasks, it may not reach full speed. (If it starts and stops quickly at small task, it cannot clock up.) Then, total speed at low CPU usage is very slow.

To solve this problem in new sample, I set GPU's task size to be double than CPU's task. Because it's difficult to predict GPU speed, I just tried double size. If CPU threads calculate 64 source blocks at once, GPU thread tries to calculate 128 blocks in the same time. This may improve GPU speed a little. (However it's same speed on my PC.)

I put the sample (MultiPar_sample_2023-10-19.zip) in "MultiPar_sample" folder on OneDrive. I implemented both encoder and decoder now. If the double size method won't improve speed at low CPU usage case, I will return their task size to be same as previous one.

Slava46 commented 1 year ago

MultiPar_sample_2023-10-19 GPU enabled CPU low: 15:32 GPU disabled, CPU high: 09:39 GPU enabled CPU high: 08:48

Yutaka-Sawada commented 1 year ago

Thank you for test. Though setting double size task on GPU is a little faster, it was not enough. I came up with a simple adjusting method for GPU task. At first, it sets same task size to both CPU and GPU. After GPU thread finished the first task, it checks how many tasks CPU threads done. Then, it sets next GPU's task size by rate for remaining blocks. For example, when GPU thread finishes before CPU threads, it will pick 1/3 of remaining blocks next. If GPU runs faster in second time, it will pick additional blocks at later tasks. For example, when GPU speed is 1/3 of CPU, it will put 1/3 blocks on GPU thread at last. This dynamic adjustment will work for both fast and slow GPUs.

I put the sample (MultiPar_sample_2023-10-21.zip) in "MultiPar_sample" folder on OneDrive. Because my using GPU is very slow, I'm not sure that new method works for fast GPU. If it works well or it's acceptable level at least, I want to finish improvement of GPU implementation.

Slava46 commented 1 year ago

Though setting double size task on GPU is a little faster, it was not enough. Yeah, just similar almost the same.

MultiPar_sample_2023-10-21 GPU enabled CPU low: 11:05 GPU disabled, CPU high: 09:29 GPU enabled CPU high: 07:17

Congrats, you found a way to fine improve speed -1.5 min.

supply9243 commented 1 year ago

我这的测试时间不怎么稳定，但我不方便重启电脑，所以仅限参考😓。 V1.3.2.9_CPU+GPU：03：49 V2023-10-16_CPU+GPU：03:42 V2023-10-18_CPU+GPU：03:02 V2023-10-22_CPU+GPU：02:45

Yutaka-Sawada commented 1 year ago

Thanks Slava46 and K2M74 for testing new method. Your effort encouraged me to try many ways. It works well on GPUs, better than my thought. Even when he uses (not so expensive) middle range GPU, it will accelerate a little. When he uses expensive very fast GPU, it becomes noticeably faster. Though it consumes more RAM on PC, GPU acceleration is worth to use now.

There is one point I fix still. When CPU usage is low, it will use GPU more often relatively. Because the result was not so good than old best, the adjustment may require a tweak. To get full speed of GPU quickly, I should have been set higher rate at second task. For example, when GPU thread finishes before CPU threads, it would pick 1/2 of remaining blocks next. Setting larger task in former step may help GPU's early clock up. This would be good for less source blocks case, too. (However this change doesn't make difference for slow GPUs.)

I put the sample (MultiPar_sample_2023-10-22.zip) in "MultiPar_sample" folder on OneDrive. This change affects very fast GPU, which requires large task to be full speed. There may be no difference in case of many source blocks. If there is no serious problem, I adopt this GPU function in next version 1.3.3.1.

supply9243 commented 1 year ago

29.5GB V2023-10-21_CPU+GPU：02:43 V2023-10-22_CPU+GPU：02:44

110GB V2023-10-21_CPU+GPU：20:30 V2023-10-22_CPU+GPU：20:12

Slava46 commented 1 year ago

Mine the same 75 Gb files. MultiPar_sample_2023-10-22 GPU enabled CPU low: 10:43 - little faster GPU disabled, CPU high: 09:08 GPU enabled CPU high: 07:17 - the same

Yutaka-Sawada commented 1 year ago

Thanks Slava46 and K2M74 for testing many times. Without your aid, I could not implement this GPU function. OpenCL perfomance is varied in every graphics boards. We tested the new GPU acceleration method on fast, very fast, and slow GPUs. It will help other users.

Yutaka-Sawada commented 1 year ago

I found a rare unknown problem in GPU method. GPU function seems to freeze rarely once. After cancel the task, it never stops next time for a while. (It's very rare, like one time per one month.) So, I cannot reproduce the failure at this time. It may be a bug of OpenCL or Graphics driver.

While I review my source code to find a fault, I found another mistake. It causes error at using GPU rarely. I fixed the bug. (But, there is unknown freeze problem still.) I put the sample (MultiPar_sample_2023-10-25.zip) in "MultiPar_sample" folder on OneDrive. When someone wants to see behavior of new encoders, he should use the latest sample package.

Slava46 commented 1 year ago

MultiPar_sample_2023-10-25 GPU enabled CPU low: 08:56 GPU disabled, CPU high: 09:20 GPU enabled CPU high: 06:35

Seems this version faster again by almost minute and almost 2 minutes for GPU only, nice!

If difference just by those bugs - results is great after fixed them, because you wrote that fixed bugs were about GPU (if so those bugs influence not so rarely) and with enabled GPU both modes (CPU high and low with GPU enabled) going much faster.

supply9243 commented 1 year ago

V2023-10-22_CPU+GPU：02:33 V2023-10-25_CPU+GPU：02:31

Yutaka-Sawada commented 1 year ago

Thanks Slava46 and K2M74 for test again. In addition to fix a bug, I changed initial GPU task size. When there are many source blocks, it sets double size task on GPU at first.

Seems this version faster again by almost minute and almost 2 minutes for GPU only, nice!

I didn't think such speed difference at GeForce RTX 3070 on Slava46 's PC. Because GeForce RTX 2060 's speed is almost same on K2M74 's PC, default task size might be too small for very fast GPU.

To test larger initial task size, I made new sample of triple size. It sets 1 ~ 3 times larger task size on GPU by number of source blocks. (It set 1 ~ 2 times in old sample of 2023-10-25.) This may get early max speed of very fast GPU like GeForce RTX 3070. (If there is no difference in new version, 2 times larger task size was enough.) I improved the last step of loop for fast GPU, too. When GPU's final task size is too few, it picks all remaining task. But, this change won't make speed difference mostly.

I put the sample (MultiPar_sample_2023-10-26.zip) in "MultiPar_sample" folder on OneDrive. K2M74 doesn't need to test new setting, because old version could not make noticeable difference. In other words, GeForce RTX 2060 would performe its max speed already.

Slava46 commented 1 year ago

MultiPar_sample_2023-10-26 GPU enabled CPU low: 08:56 GPU disabled, CPU high: 09:09 GPU enabled CPU high: 06:43

Yeah seems there no difference.

For compare results I tested for those the same 75 Gb files on 1.3.2.9 and 1.3.3.0 last beta because first tests were on different 70 Gb files also the same Windows 11 22H2 22621.2428 version and 537.58 Nvidia driver as for last tests (since 12.10.23 so all 75 Gb tests relative).

1.3.2.9 version: GPU enabled CPU low: 11:16 GPU disabled, CPU high: 15:58 GPU enabled CPU high: 16:51 Notice that for 1.3.2.9 was some freezing Windows that fine for last samples for CPU high GPU enabled/disabled.

1.3.3.0 beta: GPU enabled CPU low: 27:23 GPU disabled, CPU high: 09:30 GPU enabled CPU high: 10:10

Yutaka-Sawada commented 1 year ago

Thanks Slava46 for confirming whole results. Though each test may include dispersion range, new GPU function is noticeably faster than old versions. I updated GitHub's alpha version. If there is no serious problem, it will become v1.3.3.1 later.

From internet search, Intel Integrated GPU may have a problem in OpenCL. I updated my Intel Graphics driver to be the latest one. But, I'm not sure that the driver problem was solved or not. NVIDIA or AMD GPU would not have such failure.

Slava46 commented 1 year ago

Also from my friend test for the same files and settings AMD 7950x, DDR5 64Gb, Nvidia 3070, SSD So here PCIe 5.0 for videocard/CPU (but as I rememeber 300xx series GPU still using PCIe 4.0), faster memory DDR5 and 16-core CPU too but newest generation. Difference of course would be nice for last CPU's.

2,2-2,3 GBit SSD speed here for files, 50% CPU using.

MultiPar_sample_2023-10-26 GPU on, CPU low: 06:32 GPU off, CPU high: 04:26 GPU on, CPU high: 03:57

And some idea for modernisation program.

For now creating .par2 files waiting for calculation file hash but if you could calculate hash in another thread and create .par2 files right away but no waiting finishing hash calculation the speed could be faster because around 34 seconds going calculating hash for file for this test of 75 Gb files pack.

Another thing is that program reading from hard drive SSD etc for calculation hash after that reading parts for calculation, writing to drive etc again and again but you could read from drive before it would be needed soon so when it'll be needed it already read and ready and that could be faster program speed guess around twice but needed to change your code a lot of course.

Yutaka-Sawada commented 1 year ago

Thanks Slava46 for another test set. It's interesting to see results of different CPU speed. AMD Ryzen9 7950x seems to be much faster than Ryzen Threadripper 1950X. As CPU is faster, effect of GPU becomes less. When CPU is fast, it requires fast GPU for acceleration. Because the effectiveness is relative to both speed, the default setting of "GPU acceleration" is disabled.

if you could calculate hash in another thread and create .par2 files right away but no waiting finishing hash calculation the speed could be faster

I tried this kind of file access in 1-pass processing mode. It's possible only when creating parity data fits in PC's available RAM size. When there are enough RAM and files are on HDD, MultiPar selects the mode. While you use SSD, you may try the HDD mode manually. (If you use HDD, don't try SSD mode, because multi-reading is very slow on HDD.) On MultiPar's Option window, there is "File access mode" item in "Hardware environment" section of "System settings" tab. If you select HDD, it will try 1-pass processing mode (when there is enough memory). But, it's slower than multi-reading over SSD normally. When it uses HDD (1-pass processing) mode, one thread reads file data and calculates hash value. When it uses SSD (2-pass processing with multi-reading) mode, multiple threads read file data and calculate hash value independently. Then, multi-reading over SSD would be faster in total. So, MultiPar prefers SSD mode, when files are put on SSD. I made the manual setting as Option, because it fails to detect drive type of external device like USB.

Another thing is that program reading from hard drive SSD etc for calculation hash after that reading parts for calculation, writing to drive etc again and again but you could read from drive before it would be needed soon

I'm not sure what you say about. At this time, 2-pass processing mode uses Double Buffering at file access. It keeps 2 buffers of same size. While it reads file data to a buffer, it calculates hash value on another buffer. But, it's difficult to determine the best setting (buffer size) for speed. Though Mr. animetosho suggested to use large buffer size ago, I could not see improvement in my implementation. (But, animetosho's ParPar is faster in file access.) It may depend on drive's hardware disk cache size.

Slava46 commented 1 year ago

What about add support AVX512, theoretically it could be much improvement but of course only on new CPU's.

Yutaka-Sawada commented 1 year ago

What about add support AVX512

Because ParPar supports AVX-512, it's possible technically. (I can learn how to use AVX-512 by reading ParPar's source code.) But, AVX-512 isn't so common at this time. My PC 's CPU (Intel Core i5-10400) doesn't support AVX-512, too. I'm difficult to implement fast code for AVX-512 now. Thus, it will be a future task.

MikeSW17 commented 1 year ago

Here's the L3 Cash test data from my System: I don't have time at the moment to analyse/chart the results, but hope they're useful.

CPU: AMD Ryzen 9 7950X M/B: MSI MPG X670E Carbon Wifi RAM: 64Gb OS: Windows 11 Pro

I chose a 2139Mb File and renamed the exe files for alpha-numeric sorting

=============================================

===== par2j_004.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49454 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.3% 69.6% 100.0% hash 3.407 sec, 636 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 96.3% 100.0% write 1.437 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 4, modulo = 0 8.4% 30.9% 53.5% 76.2% 97.3% 100.0% read 0.641 sec write 0.719 sec sub-thread : total loop = 26052 2nd encode 4.422 sec, 26052 loop, 5891 MB/s sub-thread : total loop = 27595 2nd encode 4.422 sec, 27595 loop, 6240 MB/s sub-thread : total loop = 29055 2nd encode 4.406 sec, 29055 loop, 6594 MB/s sub-thread : total loop = 29899 2nd encode 4.406 sec, 29899 loop, 6786 MB/s sub-thread : total loop = 30985 2nd encode 4.406 sec, 30985 loop, 7032 MB/s sub-thread : total loop = 31488 2nd encode 4.390 sec, 31488 loop, 7172 MB/s sub-thread : total loop = 31541 2nd encode 4.374 sec, 31541 loop, 7211 MB/s sub-thread : total loop = 32138 2nd encode 4.374 sec, 32138 loop, 7347 MB/s sub-thread : total loop = 31742 2nd encode 4.358 sec, 31742 loop, 7283 MB/s sub-thread : total loop = 31089 2nd encode 4.358 sec, 31089 loop, 7133 MB/s sub-thread : total loop = 31039 2nd encode 4.343 sec, 31039 loop, 7147 MB/s sub-thread : total loop = 29834 2nd encode 4.343 sec, 29834 loop, 6869 MB/s sub-thread : total loop = 28760 2nd encode 4.327 sec, 28760 loop, 6646 MB/s sub-thread : total loop = 27525 2nd encode 4.327 sec, 27525 loop, 6361 MB/s sub-thread : total loop = 25958 2nd encode 4.343 sec, 25958 loop, 5977 MB/s sub-thread : total loop = 25748 2nd encode 4.327 sec, 25748 loop, 5950 MB/s total 5.828 sec

Created successfully ===== par2j_008.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49443 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.3% 69.4% 100.0% hash 2.969 sec, 730 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 92.1% 100.0% write 1.500 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 8, modulo = 0 16.1% 59.4% 97.0% 99.6% 100.0% read 0.625 sec write 1.438 sec sub-thread : total loop = 25576 2nd encode 2.313 sec, 25576 loop, 11057 MB/s sub-thread : total loop = 28153 2nd encode 2.313 sec, 28153 loop, 12172 MB/s sub-thread : total loop = 29860 2nd encode 2.297 sec, 29860 loop, 12999 MB/s sub-thread : total loop = 30105 2nd encode 2.297 sec, 30105 loop, 13106 MB/s sub-thread : total loop = 30436 2nd encode 2.297 sec, 30436 loop, 13250 MB/s sub-thread : total loop = 30848 2nd encode 2.297 sec, 30848 loop, 13430 MB/s sub-thread : total loop = 30730 2nd encode 2.297 sec, 30730 loop, 13378 MB/s sub-thread : total loop = 31193 2nd encode 2.297 sec, 31193 loop, 13580 MB/s sub-thread : total loop = 30877 2nd encode 2.282 sec, 30877 loop, 13531 MB/s sub-thread : total loop = 31043 2nd encode 2.282 sec, 31043 loop, 13603 MB/s sub-thread : total loop = 30867 2nd encode 2.266 sec, 30867 loop, 13622 MB/s sub-thread : total loop = 29872 2nd encode 2.266 sec, 29872 loop, 13183 MB/s sub-thread : total loop = 29234 2nd encode 2.266 sec, 29234 loop, 12901 MB/s sub-thread : total loop = 28671 2nd encode 2.266 sec, 28671 loop, 12653 MB/s sub-thread : total loop = 26972 2nd encode 2.250 sec, 26972 loop, 11987 MB/s sub-thread : total loop = 26010 2nd encode 2.250 sec, 26010 loop, 11560 MB/s total 4.422 sec

Created successfully ===== par2j_012.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49428 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 33.9% 68.7% 100.0% hash 2.968 sec, 730 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 9.2% 83.8% 100.0% write 2.391 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 12, modulo = 8 27.9% 90.2% 97.1% 99.4% 100.0% read 0.547 sec write 2.859 sec sub-thread : total loop = 24975 2nd encode 1.610 sec, 24975 loop, 15512 MB/s sub-thread : total loop = 29361 2nd encode 1.595 sec, 29361 loop, 18408 MB/s sub-thread : total loop = 31836 2nd encode 1.595 sec, 31836 loop, 19960 MB/s sub-thread : total loop = 31881 2nd encode 1.595 sec, 31881 loop, 19988 MB/s sub-thread : total loop = 31724 2nd encode 1.595 sec, 31724 loop, 19890 MB/s sub-thread : total loop = 31085 2nd encode 1.595 sec, 31085 loop, 19489 MB/s sub-thread : total loop = 30317 2nd encode 1.595 sec, 30317 loop, 19008 MB/s sub-thread : total loop = 29944 2nd encode 1.595 sec, 29944 loop, 18774 MB/s sub-thread : total loop = 29660 2nd encode 1.595 sec, 29660 loop, 18596 MB/s sub-thread : total loop = 29804 2nd encode 1.595 sec, 29804 loop, 18686 MB/s sub-thread : total loop = 29768 2nd encode 1.595 sec, 29768 loop, 18663 MB/s sub-thread : total loop = 29126 2nd encode 1.595 sec, 29126 loop, 18261 MB/s sub-thread : total loop = 28093 2nd encode 1.595 sec, 28093 loop, 17613 MB/s sub-thread : total loop = 29193 2nd encode 1.595 sec, 29193 loop, 18303 MB/s sub-thread : total loop = 27628 2nd encode 1.595 sec, 27628 loop, 17322 MB/s sub-thread : total loop = 26056 2nd encode 1.595 sec, 26056 loop, 16336 MB/s total 5.062 sec

Created successfully ===== par2j_016.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49437 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 33.9% 69.1% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 75.5% 100.0% write 3.609 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 16, modulo = 8 37.9% 97.4% 99.3% 100.0% read 0.547 sec write 1.829 sec sub-thread : total loop = 22456 2nd encode 1.234 sec, 22456 loop, 18198 MB/s sub-thread : total loop = 30151 2nd encode 1.234 sec, 30151 loop, 24434 MB/s sub-thread : total loop = 31504 2nd encode 1.234 sec, 31504 loop, 25530 MB/s sub-thread : total loop = 31071 2nd encode 1.234 sec, 31071 loop, 25179 MB/s sub-thread : total loop = 31944 2nd encode 1.234 sec, 31944 loop, 25887 MB/s sub-thread : total loop = 32000 2nd encode 1.234 sec, 32000 loop, 25932 MB/s sub-thread : total loop = 30854 2nd encode 1.234 sec, 30854 loop, 25004 MB/s sub-thread : total loop = 29853 2nd encode 1.234 sec, 29853 loop, 24192 MB/s sub-thread : total loop = 29923 2nd encode 1.234 sec, 29923 loop, 24249 MB/s sub-thread : total loop = 29708 2nd encode 1.219 sec, 29708 loop, 24371 MB/s sub-thread : total loop = 30100 2nd encode 1.219 sec, 30100 loop, 24693 MB/s sub-thread : total loop = 30183 2nd encode 1.219 sec, 30183 loop, 24761 MB/s sub-thread : total loop = 29970 2nd encode 1.219 sec, 29970 loop, 24586 MB/s sub-thread : total loop = 29789 2nd encode 1.219 sec, 29789 loop, 24437 MB/s sub-thread : total loop = 27405 2nd encode 1.219 sec, 27405 loop, 22482 MB/s sub-thread : total loop = 23538 2nd encode 1.219 sec, 23538 loop, 19309 MB/s total 3.656 sec

Created successfully ===== par2j_032.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49420 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.3% 69.5% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 1.172 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 32, modulo = 24 48.6% 97.0% 99.7% 100.0% read 0.547 sec write 2.250 sec sub-thread : total loop = 21476 2nd encode 0.953 sec, 21476 loop, 22535 MB/s sub-thread : total loop = 30618 2nd encode 0.953 sec, 30618 loop, 32128 MB/s sub-thread : total loop = 31512 2nd encode 0.953 sec, 31512 loop, 33067 MB/s sub-thread : total loop = 31368 2nd encode 0.953 sec, 31368 loop, 32916 MB/s sub-thread : total loop = 31546 2nd encode 0.953 sec, 31546 loop, 33102 MB/s sub-thread : total loop = 31752 2nd encode 0.953 sec, 31752 loop, 33318 MB/s sub-thread : total loop = 31355 2nd encode 0.953 sec, 31355 loop, 32902 MB/s sub-thread : total loop = 29794 2nd encode 0.953 sec, 29794 loop, 31264 MB/s sub-thread : total loop = 29737 2nd encode 0.953 sec, 29737 loop, 31204 MB/s sub-thread : total loop = 29862 2nd encode 0.953 sec, 29862 loop, 31335 MB/s sub-thread : total loop = 29858 2nd encode 0.953 sec, 29858 loop, 31331 MB/s sub-thread : total loop = 30111 2nd encode 0.953 sec, 30111 loop, 31596 MB/s sub-thread : total loop = 30057 2nd encode 0.953 sec, 30057 loop, 31540 MB/s sub-thread : total loop = 30157 2nd encode 0.953 sec, 30157 loop, 31645 MB/s sub-thread : total loop = 29151 2nd encode 0.953 sec, 29151 loop, 30589 MB/s sub-thread : total loop = 22094 2nd encode 0.953 sec, 22094 loop, 23184 MB/s total 3.781 sec

Created successfully ===== par2j_048.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49416 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.3% 69.5% 100.0% hash 2.937 sec, 738 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 75.5% 100.0% write 1.563 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 48, modulo = 8 38.6% 97.5% 99.7% 100.0% read 0.547 sec write 1.453 sec sub-thread : total loop = 24370 2nd encode 1.187 sec, 24370 loop, 20531 MB/s sub-thread : total loop = 29072 2nd encode 1.187 sec, 29072 loop, 24492 MB/s sub-thread : total loop = 30565 2nd encode 1.187 sec, 30565 loop, 25750 MB/s sub-thread : total loop = 31328 2nd encode 1.187 sec, 31328 loop, 26393 MB/s sub-thread : total loop = 31669 2nd encode 1.187 sec, 31669 loop, 26680 MB/s sub-thread : total loop = 31645 2nd encode 1.187 sec, 31645 loop, 26660 MB/s sub-thread : total loop = 31205 2nd encode 1.187 sec, 31205 loop, 26289 MB/s sub-thread : total loop = 29665 2nd encode 1.187 sec, 29665 loop, 24992 MB/s sub-thread : total loop = 29590 2nd encode 1.187 sec, 29590 loop, 24929 MB/s sub-thread : total loop = 29543 2nd encode 1.187 sec, 29543 loop, 24889 MB/s sub-thread : total loop = 29628 2nd encode 1.187 sec, 29628 loop, 24961 MB/s sub-thread : total loop = 29468 2nd encode 1.187 sec, 29468 loop, 24826 MB/s sub-thread : total loop = 29104 2nd encode 1.187 sec, 29104 loop, 24519 MB/s sub-thread : total loop = 29560 2nd encode 1.187 sec, 29560 loop, 24903 MB/s sub-thread : total loop = 29889 2nd encode 1.187 sec, 29889 loop, 25181 MB/s sub-thread : total loop = 24149 2nd encode 1.187 sec, 24149 loop, 20345 MB/s total 3.234 sec

Created successfully ===== par2j_064.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49398 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.0% 69.1% 100.0% hash 2.969 sec, 730 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 75.5% 100.0% write 1.531 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 64, modulo = 56 31.4% 97.0% 97.3% 99.3% 100.0% read 0.547 sec write 2.297 sec sub-thread : total loop = 25724 2nd encode 1.453 sec, 25724 loop, 17704 MB/s sub-thread : total loop = 29620 2nd encode 1.438 sec, 29620 loop, 20598 MB/s sub-thread : total loop = 30736 2nd encode 1.438 sec, 30736 loop, 21374 MB/s sub-thread : total loop = 30360 2nd encode 1.438 sec, 30360 loop, 21113 MB/s sub-thread : total loop = 30857 2nd encode 1.438 sec, 30857 loop, 21458 MB/s sub-thread : total loop = 30759 2nd encode 1.438 sec, 30759 loop, 21390 MB/s sub-thread : total loop = 30594 2nd encode 1.438 sec, 30594 loop, 21276 MB/s sub-thread : total loop = 30088 2nd encode 1.438 sec, 30088 loop, 20924 MB/s sub-thread : total loop = 29897 2nd encode 1.438 sec, 29897 loop, 20791 MB/s sub-thread : total loop = 29954 2nd encode 1.438 sec, 29954 loop, 20830 MB/s sub-thread : total loop = 29868 2nd encode 1.438 sec, 29868 loop, 20771 MB/s sub-thread : total loop = 29592 2nd encode 1.438 sec, 29592 loop, 20579 MB/s sub-thread : total loop = 29976 2nd encode 1.438 sec, 29976 loop, 20846 MB/s sub-thread : total loop = 28716 2nd encode 1.438 sec, 28716 loop, 19970 MB/s sub-thread : total loop = 29165 2nd encode 1.438 sec, 29165 loop, 20282 MB/s sub-thread : total loop = 24542 2nd encode 1.438 sec, 24542 loop, 17067 MB/s total 4.329 sec

Created successfully ===== par2j_080.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49294 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 33.9% 68.5% 100.0% hash 2.984 sec, 726 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 71.4% 100.0% write 2.844 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 80, modulo = 8 32.2% 97.0% 97.7% 99.7% 100.0% read 0.547 sec write 2.547 sec sub-thread : total loop = 26059 2nd encode 1.485 sec, 26059 loop, 17548 MB/s sub-thread : total loop = 29391 2nd encode 1.485 sec, 29391 loop, 19792 MB/s sub-thread : total loop = 30010 2nd encode 1.485 sec, 30010 loop, 20209 MB/s sub-thread : total loop = 29834 2nd encode 1.485 sec, 29834 loop, 20090 MB/s sub-thread : total loop = 30235 2nd encode 1.485 sec, 30235 loop, 20360 MB/s sub-thread : total loop = 30305 2nd encode 1.485 sec, 30305 loop, 20408 MB/s sub-thread : total loop = 30306 2nd encode 1.485 sec, 30306 loop, 20408 MB/s sub-thread : total loop = 29968 2nd encode 1.485 sec, 29968 loop, 20181 MB/s sub-thread : total loop = 30102 2nd encode 1.485 sec, 30102 loop, 20271 MB/s sub-thread : total loop = 29870 2nd encode 1.485 sec, 29870 loop, 20115 MB/s sub-thread : total loop = 29584 2nd encode 1.485 sec, 29584 loop, 19922 MB/s sub-thread : total loop = 29835 2nd encode 1.485 sec, 29835 loop, 20091 MB/s sub-thread : total loop = 30040 2nd encode 1.485 sec, 30040 loop, 20229 MB/s sub-thread : total loop = 29524 2nd encode 1.485 sec, 29524 loop, 19882 MB/s sub-thread : total loop = 29745 2nd encode 1.485 sec, 29745 loop, 20030 MB/s sub-thread : total loop = 25640 2nd encode 1.485 sec, 25640 loop, 17266 MB/s total 4.656 sec

Created successfully ===== par2j_096.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49353 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.3% 69.6% 100.0% hash 2.937 sec, 738 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 88.0% 100.0% write 3.313 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 96, modulo = 56 30.0% 97.0% 98.8% 100.0% read 0.547 sec write 1.812 sec sub-thread : total loop = 27144 2nd encode 1.609 sec, 27144 loop, 16870 MB/s sub-thread : total loop = 28863 2nd encode 1.609 sec, 28863 loop, 17939 MB/s sub-thread : total loop = 29833 2nd encode 1.609 sec, 29833 loop, 18541 MB/s sub-thread : total loop = 29788 2nd encode 1.609 sec, 29788 loop, 18513 MB/s sub-thread : total loop = 30092 2nd encode 1.609 sec, 30092 loop, 18702 MB/s sub-thread : total loop = 30079 2nd encode 1.609 sec, 30079 loop, 18694 MB/s sub-thread : total loop = 30004 2nd encode 1.609 sec, 30004 loop, 18648 MB/s sub-thread : total loop = 29962 2nd encode 1.609 sec, 29962 loop, 18622 MB/s sub-thread : total loop = 29908 2nd encode 1.609 sec, 29908 loop, 18588 MB/s sub-thread : total loop = 29940 2nd encode 1.609 sec, 29940 loop, 18608 MB/s sub-thread : total loop = 29652 2nd encode 1.609 sec, 29652 loop, 18429 MB/s sub-thread : total loop = 29940 2nd encode 1.609 sec, 29940 loop, 18608 MB/s sub-thread : total loop = 29961 2nd encode 1.609 sec, 29961 loop, 18621 MB/s sub-thread : total loop = 29567 2nd encode 1.609 sec, 29567 loop, 18376 MB/s sub-thread : total loop = 28681 2nd encode 1.594 sec, 28681 loop, 17993 MB/s sub-thread : total loop = 27035 2nd encode 1.594 sec, 27035 loop, 16960 MB/s total 4.015 sec

Created successfully ===== par2j_128.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49328 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.2% 69.4% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 83.8% 100.0% write 3.547 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 128, modulo = 120 28.6% 97.0% 97.4% 99.0% 100.0% read 0.546 sec write 2.891 sec sub-thread : total loop = 28200 2nd encode 1.688 sec, 28200 loop, 16706 MB/s sub-thread : total loop = 28617 2nd encode 1.673 sec, 28617 loop, 17105 MB/s sub-thread : total loop = 29712 2nd encode 1.673 sec, 29712 loop, 17760 MB/s sub-thread : total loop = 29560 2nd encode 1.673 sec, 29560 loop, 17669 MB/s sub-thread : total loop = 29641 2nd encode 1.688 sec, 29641 loop, 17560 MB/s sub-thread : total loop = 29684 2nd encode 1.673 sec, 29684 loop, 17743 MB/s sub-thread : total loop = 29728 2nd encode 1.673 sec, 29728 loop, 17769 MB/s sub-thread : total loop = 30167 2nd encode 1.688 sec, 30167 loop, 17871 MB/s sub-thread : total loop = 29672 2nd encode 1.688 sec, 29672 loop, 17578 MB/s sub-thread : total loop = 29968 2nd encode 1.673 sec, 29968 loop, 17913 MB/s sub-thread : total loop = 30024 2nd encode 1.673 sec, 30024 loop, 17946 MB/s sub-thread : total loop = 29343 2nd encode 1.688 sec, 29343 loop, 17383 MB/s sub-thread : total loop = 29633 2nd encode 1.688 sec, 29633 loop, 17555 MB/s sub-thread : total loop = 29790 2nd encode 1.688 sec, 29790 loop, 17648 MB/s sub-thread : total loop = 29441 2nd encode 1.688 sec, 29441 loop, 17441 MB/s sub-thread : total loop = 27266 2nd encode 1.673 sec, 27266 loop, 16298 MB/s total 5.187 sec

Created successfully ===== par2j_160.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49305 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.2% 69.4% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 50.6% 100.0% write 2.766 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 160, modulo = 88 28.6% 85.9% 99.7% 100.0% read 0.532 sec write 0.922 sec sub-thread : total loop = 27697 2nd encode 1.750 sec, 27697 loop, 15827 MB/s sub-thread : total loop = 29068 2nd encode 1.734 sec, 29068 loop, 16764 MB/s sub-thread : total loop = 29110 2nd encode 1.719 sec, 29110 loop, 16934 MB/s sub-thread : total loop = 28888 2nd encode 1.734 sec, 28888 loop, 16660 MB/s sub-thread : total loop = 30010 2nd encode 1.734 sec, 30010 loop, 17307 MB/s sub-thread : total loop = 29451 2nd encode 1.750 sec, 29451 loop, 16829 MB/s sub-thread : total loop = 29992 2nd encode 1.734 sec, 29992 loop, 17296 MB/s sub-thread : total loop = 30128 2nd encode 1.735 sec, 30128 loop, 17365 MB/s sub-thread : total loop = 30012 2nd encode 1.719 sec, 30012 loop, 17459 MB/s sub-thread : total loop = 29923 2nd encode 1.734 sec, 29923 loop, 17257 MB/s sub-thread : total loop = 29425 2nd encode 1.719 sec, 29425 loop, 17118 MB/s sub-thread : total loop = 29737 2nd encode 1.750 sec, 29737 loop, 16993 MB/s sub-thread : total loop = 29894 2nd encode 1.719 sec, 29894 loop, 17390 MB/s sub-thread : total loop = 29897 2nd encode 1.735 sec, 29897 loop, 17232 MB/s sub-thread : total loop = 29040 2nd encode 1.734 sec, 29040 loop, 16747 MB/s sub-thread : total loop = 28174 2nd encode 1.734 sec, 28174 loop, 16248 MB/s total 3.265 sec

Created successfully ===== par2j_192.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49312 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.2% 69.4% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 88.0% 100.0% write 1.391 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 192, modulo = 56 34.3% 97.0% 97.8% 100.0% read 0.547 sec write 1.719 sec sub-thread : total loop = 27851 2nd encode 1.765 sec, 27851 loop, 15780 MB/s sub-thread : total loop = 29205 2nd encode 1.749 sec, 29205 loop, 16698 MB/s sub-thread : total loop = 29253 2nd encode 1.749 sec, 29253 loop, 16726 MB/s sub-thread : total loop = 29443 2nd encode 1.749 sec, 29443 loop, 16834 MB/s sub-thread : total loop = 29827 2nd encode 1.749 sec, 29827 loop, 17054 MB/s sub-thread : total loop = 29808 2nd encode 1.765 sec, 29808 loop, 16888 MB/s sub-thread : total loop = 29678 2nd encode 1.749 sec, 29678 loop, 16969 MB/s sub-thread : total loop = 29748 2nd encode 1.765 sec, 29748 loop, 16854 MB/s sub-thread : total loop = 29754 2nd encode 1.765 sec, 29754 loop, 16858 MB/s sub-thread : total loop = 29824 2nd encode 1.765 sec, 29824 loop, 16897 MB/s sub-thread : total loop = 29448 2nd encode 1.765 sec, 29448 loop, 16684 MB/s sub-thread : total loop = 29840 2nd encode 1.765 sec, 29840 loop, 16907 MB/s sub-thread : total loop = 29699 2nd encode 1.765 sec, 29699 loop, 16827 MB/s sub-thread : total loop = 29797 2nd encode 1.765 sec, 29797 loop, 16882 MB/s sub-thread : total loop = 28989 2nd encode 1.765 sec, 28989 loop, 16424 MB/s sub-thread : total loop = 28284 2nd encode 1.749 sec, 28284 loop, 16172 MB/s total 4.062 sec

Created successfully ===== par2j_256.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49301 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.3% 69.6% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 13.3% 88.0% 100.0% write 2.375 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 256, modulo = 120 34.3% 97.0% 97.2% 97.8% 100.0% read 0.547 sec write 2.765 sec sub-thread : total loop = 27922 2nd encode 1.828 sec, 27922 loop, 15275 MB/s sub-thread : total loop = 29261 2nd encode 1.828 sec, 29261 loop, 16007 MB/s sub-thread : total loop = 29801 2nd encode 1.828 sec, 29801 loop, 16303 MB/s sub-thread : total loop = 29672 2nd encode 1.828 sec, 29672 loop, 16232 MB/s sub-thread : total loop = 29858 2nd encode 1.828 sec, 29858 loop, 16334 MB/s sub-thread : total loop = 29346 2nd encode 1.828 sec, 29346 loop, 16054 MB/s sub-thread : total loop = 29320 2nd encode 1.828 sec, 29320 loop, 16039 MB/s sub-thread : total loop = 29830 2nd encode 1.828 sec, 29830 loop, 16318 MB/s sub-thread : total loop = 29585 2nd encode 1.828 sec, 29585 loop, 16184 MB/s sub-thread : total loop = 29712 2nd encode 1.828 sec, 29712 loop, 16254 MB/s sub-thread : total loop = 30017 2nd encode 1.828 sec, 30017 loop, 16421 MB/s sub-thread : total loop = 29912 2nd encode 1.828 sec, 29912 loop, 16363 MB/s sub-thread : total loop = 29798 2nd encode 1.828 sec, 29798 loop, 16301 MB/s sub-thread : total loop = 29386 2nd encode 1.828 sec, 29386 loop, 16075 MB/s sub-thread : total loop = 29656 2nd encode 1.828 sec, 29656 loop, 16223 MB/s sub-thread : total loop = 27372 2nd encode 1.828 sec, 27372 loop, 14974 MB/s total 5.187 sec

Created successfully ===== par2j_debug.exe ===== Parchive 2.0 client version 1.3.2.9 by Yutaka Sawada

Base Directory : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\" Recovery File : "G:\DevExperiment\MultiPar (Github.Yutaka-Sawada)\TestBlock_2023-08-31\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (49196 MB available)

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 1-pass processing is possible, 2168

matrix size = 8.8 KB 0.0% : Creating recovery slice

read some source blocks, and keep all parity blocks buffer size = 2385 MB, read_num = 2168, round = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608 4.1% 8.5% 13.0% 17.4% 21.9% 26.3% 30.8% partial encode = 724 / 2168 (33.3%), source_off = 0 75.8% 100.0% read 7.594 sec write 0.781 sec sub-thread : total loop = 36363 1st encode 6.287 sec, 16789 loop, 2670 MB/s 2nd encode 1.562 sec, 19574 loop, 12531 MB/s sub-thread : total loop = 36461 1st encode 6.303 sec, 16727 loop, 2653 MB/s 2nd encode 1.562 sec, 19734 loop, 12634 MB/s sub-thread : total loop = 36500 1st encode 6.303 sec, 16766 loop, 2660 MB/s 2nd encode 1.562 sec, 19734 loop, 12634 MB/s sub-thread : total loop = 35342 1st encode 6.303 sec, 15768 loop, 2501 MB/s 2nd encode 1.546 sec, 19574 loop, 12661 MB/s sub-thread : total loop = 36345 1st encode 6.256 sec, 16771 loop, 2680 MB/s 2nd encode 1.546 sec, 19574 loop, 12661 MB/s sub-thread : total loop = 37236 1st encode 6.256 sec, 17502 loop, 2797 MB/s 2nd encode 1.546 sec, 19734 loop, 12764 MB/s sub-thread : total loop = 39419 1st encode 6.272 sec, 19845 loop, 3164 MB/s 2nd encode 1.562 sec, 19574 loop, 12531 MB/s sub-thread : total loop = 56670 1st encode 6.257 sec, 36936 loop, 5903 MB/s 2nd encode 1.562 sec, 19734 loop, 12634 MB/s sub-thread : total loop = 19734 2nd encode 1.562 sec, 19734 loop, 12634 MB/s sub-thread : total loop = 19574 2nd encode 1.546 sec, 19574 loop, 12661 MB/s sub-thread : total loop = 19413 2nd encode 1.546 sec, 19413 loop, 12557 MB/s sub-thread : total loop = 19574 2nd encode 1.546 sec, 19574 loop, 12661 MB/s sub-thread : total loop = 19413 2nd encode 1.546 sec, 19413 loop, 12557 MB/s sub-thread : total loop = 19413 2nd encode 1.546 sec, 19413 loop, 12557 MB/s sub-thread : total loop = 19574 2nd encode 1.562 sec, 19574 loop, 12531 MB/s sub-thread : total loop = 19413 2nd encode 1.546 sec, 19413 loop, 12557 MB/s total 9.969 sec

Created successfully

MikeSW17 commented 1 year ago

A second run:

Just in case it is relevent, here is the same test run from an SSD drive rather than the first report from a slower HDD.

CPU: AMD Ryzen 9 7950X M/B: MSI MPG X670E Carbon Wifi RAM: 64Gb OS: Windows 11 Pro

===== par2j_004.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50348 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 35.4% 71.2% 100.0% hash 2.875 sec, 753 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.063 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 4, modulo = 0 10.3% 32.3% 54.5% 76.7% 97.7% 100.0% read 0.547 sec write 0.344 sec sub-thread : total loop = 26420 2nd encode 4.484 sec, 26420 loop, 5892 MB/s sub-thread : total loop = 28665 2nd encode 4.468 sec, 28665 loop, 6415 MB/s sub-thread : total loop = 29855 2nd encode 4.452 sec, 29855 loop, 6706 MB/s sub-thread : total loop = 29874 2nd encode 4.468 sec, 29874 loop, 6686 MB/s sub-thread : total loop = 30399 2nd encode 4.484 sec, 30399 loop, 6779 MB/s sub-thread : total loop = 30739 2nd encode 4.468 sec, 30739 loop, 6880 MB/s sub-thread : total loop = 30509 2nd encode 4.452 sec, 30509 loop, 6853 MB/s sub-thread : total loop = 30662 2nd encode 4.452 sec, 30662 loop, 6887 MB/s sub-thread : total loop = 30492 2nd encode 4.452 sec, 30492 loop, 6849 MB/s sub-thread : total loop = 30449 2nd encode 4.452 sec, 30449 loop, 6839 MB/s sub-thread : total loop = 30081 2nd encode 4.468 sec, 30081 loop, 6732 MB/s sub-thread : total loop = 29832 2nd encode 4.484 sec, 29832 loop, 6653 MB/s sub-thread : total loop = 29331 2nd encode 4.436 sec, 29331 loop, 6612 MB/s sub-thread : total loop = 29005 2nd encode 4.452 sec, 29005 loop, 6515 MB/s sub-thread : total loop = 27485 2nd encode 4.468 sec, 27485 loop, 6151 MB/s sub-thread : total loop = 26652 2nd encode 4.484 sec, 26652 loop, 5943 MB/s total 5.437 sec

Created successfully ===== par2j_008.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50359 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.3% 69.6% 100.0% hash 2.938 sec, 737 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.062 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 8, modulo = 0 20.7% 64.7% 99.5% 100.0% read 0.532 sec write 0.328 sec sub-thread : total loop = 25621 2nd encode 2.265 sec, 25621 loop, 11312 MB/s sub-thread : total loop = 28330 2nd encode 2.233 sec, 28330 loop, 12687 MB/s sub-thread : total loop = 30066 2nd encode 2.249 sec, 30066 loop, 13369 MB/s sub-thread : total loop = 29877 2nd encode 2.249 sec, 29877 loop, 13284 MB/s sub-thread : total loop = 30354 2nd encode 2.233 sec, 30354 loop, 13593 MB/s sub-thread : total loop = 30571 2nd encode 2.249 sec, 30571 loop, 13593 MB/s sub-thread : total loop = 30008 2nd encode 2.233 sec, 30008 loop, 13438 MB/s sub-thread : total loop = 30345 2nd encode 2.233 sec, 30345 loop, 13589 MB/s sub-thread : total loop = 30586 2nd encode 2.249 sec, 30586 loop, 13600 MB/s sub-thread : total loop = 30509 2nd encode 2.249 sec, 30509 loop, 13565 MB/s sub-thread : total loop = 30407 2nd encode 2.249 sec, 30407 loop, 13520 MB/s sub-thread : total loop = 30678 2nd encode 2.233 sec, 30678 loop, 13738 MB/s sub-thread : total loop = 30325 2nd encode 2.217 sec, 30325 loop, 13678 MB/s sub-thread : total loop = 29816 2nd encode 2.233 sec, 29816 loop, 13352 MB/s sub-thread : total loop = 27563 2nd encode 2.202 sec, 27563 loop, 12517 MB/s sub-thread : total loop = 25392 2nd encode 2.233 sec, 25392 loop, 11371 MB/s total 3.188 sec

Created successfully ===== par2j_012.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50339 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.5% 69.8% 100.0% hash 2.937 sec, 738 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.047 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 12, modulo = 8 29.5% 92.8% 100.0% read 0.547 sec write 0.328 sec sub-thread : total loop = 23114 2nd encode 1.578 sec, 23114 loop, 14648 MB/s sub-thread : total loop = 29070 2nd encode 1.578 sec, 29070 loop, 18422 MB/s sub-thread : total loop = 30607 2nd encode 1.578 sec, 30607 loop, 19396 MB/s sub-thread : total loop = 30866 2nd encode 1.578 sec, 30866 loop, 19560 MB/s sub-thread : total loop = 31606 2nd encode 1.578 sec, 31606 loop, 20029 MB/s sub-thread : total loop = 31408 2nd encode 1.578 sec, 31408 loop, 19904 MB/s sub-thread : total loop = 30579 2nd encode 1.578 sec, 30579 loop, 19378 MB/s sub-thread : total loop = 30472 2nd encode 1.578 sec, 30472 loop, 19311 MB/s sub-thread : total loop = 30261 2nd encode 1.578 sec, 30261 loop, 19177 MB/s sub-thread : total loop = 30369 2nd encode 1.563 sec, 30369 loop, 19430 MB/s sub-thread : total loop = 30286 2nd encode 1.563 sec, 30286 loop, 19377 MB/s sub-thread : total loop = 30560 2nd encode 1.563 sec, 30560 loop, 19552 MB/s sub-thread : total loop = 30179 2nd encode 1.563 sec, 30179 loop, 19308 MB/s sub-thread : total loop = 30108 2nd encode 1.563 sec, 30108 loop, 19263 MB/s sub-thread : total loop = 27276 2nd encode 1.563 sec, 27276 loop, 17451 MB/s sub-thread : total loop = 23688 2nd encode 1.563 sec, 23688 loop, 15155 MB/s total 2.516 sec

Created successfully ===== par2j_016.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50325 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.4% 69.5% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.047 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 16, modulo = 8 37.9% 99.5% 100.0% read 0.546 sec write 0.344 sec sub-thread : total loop = 22272 2nd encode 1.204 sec, 22272 loop, 18498 MB/s sub-thread : total loop = 29905 2nd encode 1.204 sec, 29905 loop, 24838 MB/s sub-thread : total loop = 31016 2nd encode 1.204 sec, 31016 loop, 25761 MB/s sub-thread : total loop = 31189 2nd encode 1.219 sec, 31189 loop, 25586 MB/s sub-thread : total loop = 31610 2nd encode 1.204 sec, 31610 loop, 26254 MB/s sub-thread : total loop = 31473 2nd encode 1.204 sec, 31473 loop, 26141 MB/s sub-thread : total loop = 30992 2nd encode 1.204 sec, 30992 loop, 25741 MB/s sub-thread : total loop = 30345 2nd encode 1.219 sec, 30345 loop, 24894 MB/s sub-thread : total loop = 30142 2nd encode 1.204 sec, 30142 loop, 25035 MB/s sub-thread : total loop = 30136 2nd encode 1.219 sec, 30136 loop, 24722 MB/s sub-thread : total loop = 30080 2nd encode 1.204 sec, 30080 loop, 24984 MB/s sub-thread : total loop = 30635 2nd encode 1.219 sec, 30635 loop, 25132 MB/s sub-thread : total loop = 30286 2nd encode 1.204 sec, 30286 loop, 25155 MB/s sub-thread : total loop = 29905 2nd encode 1.219 sec, 29905 loop, 24533 MB/s sub-thread : total loop = 27965 2nd encode 1.204 sec, 27965 loop, 23227 MB/s sub-thread : total loop = 22496 2nd encode 1.219 sec, 22496 loop, 18455 MB/s total 2.156 sec

Created successfully ===== par2j_032.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50328 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.3% 69.5% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.062 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 32, modulo = 24 50.1% 100.0% read 0.547 sec write 0.328 sec sub-thread : total loop = 20364 2nd encode 0.938 sec, 20364 loop, 21710 MB/s sub-thread : total loop = 30115 2nd encode 0.922 sec, 30115 loop, 32663 MB/s sub-thread : total loop = 31832 2nd encode 0.922 sec, 31832 loop, 34525 MB/s sub-thread : total loop = 31512 2nd encode 0.922 sec, 31512 loop, 34178 MB/s sub-thread : total loop = 32148 2nd encode 0.922 sec, 32148 loop, 34868 MB/s sub-thread : total loop = 31736 2nd encode 0.922 sec, 31736 loop, 34421 MB/s sub-thread : total loop = 31662 2nd encode 0.922 sec, 31662 loop, 34341 MB/s sub-thread : total loop = 30268 2nd encode 0.938 sec, 30268 loop, 32269 MB/s sub-thread : total loop = 30250 2nd encode 0.938 sec, 30250 loop, 32250 MB/s sub-thread : total loop = 30312 2nd encode 0.938 sec, 30312 loop, 32316 MB/s sub-thread : total loop = 30321 2nd encode 0.938 sec, 30321 loop, 32326 MB/s sub-thread : total loop = 30407 2nd encode 0.938 sec, 30407 loop, 32417 MB/s sub-thread : total loop = 30197 2nd encode 0.938 sec, 30197 loop, 32193 MB/s sub-thread : total loop = 29909 2nd encode 0.938 sec, 29909 loop, 31886 MB/s sub-thread : total loop = 28596 2nd encode 0.906 sec, 28596 loop, 31563 MB/s sub-thread : total loop = 20820 2nd encode 0.938 sec, 20820 loop, 22196 MB/s total 1.860 sec

Created successfully ===== par2j_048.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50324 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.1% 69.3% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.047 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 48, modulo = 8 38.6% 99.8% 100.0% read 0.546 sec write 0.344 sec sub-thread : total loop = 24688 2nd encode 1.188 sec, 24688 loop, 20781 MB/s sub-thread : total loop = 30707 2nd encode 1.188 sec, 30707 loop, 25848 MB/s sub-thread : total loop = 30855 2nd encode 1.188 sec, 30855 loop, 25973 MB/s sub-thread : total loop = 30962 2nd encode 1.188 sec, 30962 loop, 26063 MB/s sub-thread : total loop = 31575 2nd encode 1.188 sec, 31575 loop, 26579 MB/s sub-thread : total loop = 31219 2nd encode 1.188 sec, 31219 loop, 26279 MB/s sub-thread : total loop = 30974 2nd encode 1.188 sec, 30974 loop, 26073 MB/s sub-thread : total loop = 29439 2nd encode 1.188 sec, 29439 loop, 24781 MB/s sub-thread : total loop = 29413 2nd encode 1.188 sec, 29413 loop, 24759 MB/s sub-thread : total loop = 29577 2nd encode 1.172 sec, 29577 loop, 25237 MB/s sub-thread : total loop = 29167 2nd encode 1.172 sec, 29167 loop, 24887 MB/s sub-thread : total loop = 29082 2nd encode 1.172 sec, 29082 loop, 24814 MB/s sub-thread : total loop = 29669 2nd encode 1.172 sec, 29669 loop, 25315 MB/s sub-thread : total loop = 29200 2nd encode 1.172 sec, 29200 loop, 24915 MB/s sub-thread : total loop = 28784 2nd encode 1.172 sec, 28784 loop, 24560 MB/s sub-thread : total loop = 25139 2nd encode 1.172 sec, 25139 loop, 21450 MB/s total 2.109 sec

Created successfully ===== par2j_064.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50234 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.0% 69.2% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.062 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 64, modulo = 56 34.3% 97.9% 100.0% read 0.547 sec write 0.328 sec sub-thread : total loop = 25909 2nd encode 1.407 sec, 25909 loop, 18414 MB/s sub-thread : total loop = 29019 2nd encode 1.407 sec, 29019 loop, 20625 MB/s sub-thread : total loop = 30686 2nd encode 1.407 sec, 30686 loop, 21810 MB/s sub-thread : total loop = 30683 2nd encode 1.407 sec, 30683 loop, 21808 MB/s sub-thread : total loop = 30749 2nd encode 1.407 sec, 30749 loop, 21854 MB/s sub-thread : total loop = 30528 2nd encode 1.407 sec, 30528 loop, 21697 MB/s sub-thread : total loop = 30637 2nd encode 1.407 sec, 30637 loop, 21775 MB/s sub-thread : total loop = 29910 2nd encode 1.407 sec, 29910 loop, 21258 MB/s sub-thread : total loop = 29789 2nd encode 1.407 sec, 29789 loop, 21172 MB/s sub-thread : total loop = 29803 2nd encode 1.407 sec, 29803 loop, 21182 MB/s sub-thread : total loop = 29583 2nd encode 1.407 sec, 29583 loop, 21026 MB/s sub-thread : total loop = 29583 2nd encode 1.407 sec, 29583 loop, 21026 MB/s sub-thread : total loop = 29600 2nd encode 1.392 sec, 29600 loop, 21265 MB/s sub-thread : total loop = 29594 2nd encode 1.407 sec, 29594 loop, 21034 MB/s sub-thread : total loop = 29504 2nd encode 1.407 sec, 29504 loop, 20970 MB/s sub-thread : total loop = 24872 2nd encode 1.407 sec, 24872 loop, 17677 MB/s total 2.344 sec

Created successfully ===== par2j_080.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50284 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.3% 69.5% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.047 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 80, modulo = 8 32.2% 97.0% 100.0% read 0.547 sec write 0.328 sec sub-thread : total loop = 27333 2nd encode 1.516 sec, 27333 loop, 18030 MB/s sub-thread : total loop = 27777 2nd encode 1.516 sec, 27777 loop, 18323 MB/s sub-thread : total loop = 30014 2nd encode 1.516 sec, 30014 loop, 19798 MB/s sub-thread : total loop = 30138 2nd encode 1.516 sec, 30138 loop, 19880 MB/s sub-thread : total loop = 30165 2nd encode 1.516 sec, 30165 loop, 19898 MB/s sub-thread : total loop = 30185 2nd encode 1.516 sec, 30185 loop, 19911 MB/s sub-thread : total loop = 30050 2nd encode 1.516 sec, 30050 loop, 19822 MB/s sub-thread : total loop = 30014 2nd encode 1.500 sec, 30014 loop, 20009 MB/s sub-thread : total loop = 30024 2nd encode 1.500 sec, 30024 loop, 20016 MB/s sub-thread : total loop = 30040 2nd encode 1.500 sec, 30040 loop, 20027 MB/s sub-thread : total loop = 29206 2nd encode 1.500 sec, 29206 loop, 19471 MB/s sub-thread : total loop = 29775 2nd encode 1.500 sec, 29775 loop, 19850 MB/s sub-thread : total loop = 29785 2nd encode 1.500 sec, 29785 loop, 19857 MB/s sub-thread : total loop = 29907 2nd encode 1.516 sec, 29907 loop, 19728 MB/s sub-thread : total loop = 29195 2nd encode 1.516 sec, 29195 loop, 19258 MB/s sub-thread : total loop = 26840 2nd encode 1.516 sec, 26840 loop, 17705 MB/s total 2.438 sec

Created successfully ===== par2j_096.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50280 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.2% 69.4% 100.0% hash 2.954 sec, 733 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.046 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 96, modulo = 56 34.3% 97.0% 100.0% read 0.547 sec write 0.328 sec sub-thread : total loop = 27088 2nd encode 1.578 sec, 27088 loop, 17166 MB/s sub-thread : total loop = 29091 2nd encode 1.563 sec, 29091 loop, 18612 MB/s sub-thread : total loop = 29614 2nd encode 1.578 sec, 29614 loop, 18767 MB/s sub-thread : total loop = 29667 2nd encode 1.578 sec, 29667 loop, 18800 MB/s sub-thread : total loop = 30094 2nd encode 1.563 sec, 30094 loop, 19254 MB/s sub-thread : total loop = 30032 2nd encode 1.578 sec, 30032 loop, 19032 MB/s sub-thread : total loop = 30158 2nd encode 1.578 sec, 30158 loop, 19112 MB/s sub-thread : total loop = 29844 2nd encode 1.563 sec, 29844 loop, 19094 MB/s sub-thread : total loop = 29870 2nd encode 1.563 sec, 29870 loop, 19111 MB/s sub-thread : total loop = 30036 2nd encode 1.563 sec, 30036 loop, 19217 MB/s sub-thread : total loop = 29652 2nd encode 1.563 sec, 29652 loop, 18971 MB/s sub-thread : total loop = 30039 2nd encode 1.578 sec, 30039 loop, 19036 MB/s sub-thread : total loop = 29897 2nd encode 1.563 sec, 29897 loop, 19128 MB/s sub-thread : total loop = 29524 2nd encode 1.578 sec, 29524 loop, 18710 MB/s sub-thread : total loop = 29138 2nd encode 1.578 sec, 29138 loop, 18465 MB/s sub-thread : total loop = 26704 2nd encode 1.563 sec, 26704 loop, 17085 MB/s total 2.516 sec

Created successfully ===== par2j_128.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50271 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.6% 69.8% 100.0% hash 2.937 sec, 738 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.047 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 128, modulo = 120 34.3% 97.0% 100.0% read 0.547 sec write 0.328 sec sub-thread : total loop = 27947 2nd encode 1.687 sec, 27947 loop, 16566 MB/s sub-thread : total loop = 28859 2nd encode 1.687 sec, 28859 loop, 17107 MB/s sub-thread : total loop = 29289 2nd encode 1.687 sec, 29289 loop, 17362 MB/s sub-thread : total loop = 29626 2nd encode 1.687 sec, 29626 loop, 17561 MB/s sub-thread : total loop = 29460 2nd encode 1.687 sec, 29460 loop, 17463 MB/s sub-thread : total loop = 29759 2nd encode 1.687 sec, 29759 loop, 17640 MB/s sub-thread : total loop = 29936 2nd encode 1.687 sec, 29936 loop, 17745 MB/s sub-thread : total loop = 29840 2nd encode 1.687 sec, 29840 loop, 17688 MB/s sub-thread : total loop = 29627 2nd encode 1.687 sec, 29627 loop, 17562 MB/s sub-thread : total loop = 30068 2nd encode 1.687 sec, 30068 loop, 17823 MB/s sub-thread : total loop = 30147 2nd encode 1.671 sec, 30147 loop, 18041 MB/s sub-thread : total loop = 29597 2nd encode 1.687 sec, 29597 loop, 17544 MB/s sub-thread : total loop = 29856 2nd encode 1.687 sec, 29856 loop, 17698 MB/s sub-thread : total loop = 29175 2nd encode 1.687 sec, 29175 loop, 17294 MB/s sub-thread : total loop = 29724 2nd encode 1.687 sec, 29724 loop, 17619 MB/s sub-thread : total loop = 27538 2nd encode 1.687 sec, 27538 loop, 16324 MB/s total 2.625 sec

Created successfully ===== par2j_160.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50267 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.4% 69.6% 100.0% hash 2.937 sec, 738 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.063 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 160, modulo = 88 28.6% 85.9% 100.0% read 0.547 sec write 0.343 sec sub-thread : total loop = 27866 2nd encode 1.703 sec, 27866 loop, 16363 MB/s sub-thread : total loop = 29144 2nd encode 1.703 sec, 29144 loop, 17113 MB/s sub-thread : total loop = 28820 2nd encode 1.719 sec, 28820 loop, 16766 MB/s sub-thread : total loop = 29962 2nd encode 1.719 sec, 29962 loop, 17430 MB/s sub-thread : total loop = 30150 2nd encode 1.703 sec, 30150 loop, 17704 MB/s sub-thread : total loop = 29492 2nd encode 1.719 sec, 29492 loop, 17157 MB/s sub-thread : total loop = 29975 2nd encode 1.719 sec, 29975 loop, 17437 MB/s sub-thread : total loop = 30204 2nd encode 1.719 sec, 30204 loop, 17571 MB/s sub-thread : total loop = 29920 2nd encode 1.719 sec, 29920 loop, 17405 MB/s sub-thread : total loop = 30389 2nd encode 1.719 sec, 30389 loop, 17678 MB/s sub-thread : total loop = 29191 2nd encode 1.719 sec, 29191 loop, 16981 MB/s sub-thread : total loop = 28880 2nd encode 1.719 sec, 28880 loop, 16800 MB/s sub-thread : total loop = 29734 2nd encode 1.719 sec, 29734 loop, 17297 MB/s sub-thread : total loop = 29520 2nd encode 1.719 sec, 29520 loop, 17173 MB/s sub-thread : total loop = 29447 2nd encode 1.719 sec, 29447 loop, 17130 MB/s sub-thread : total loop = 27755 2nd encode 1.719 sec, 27755 loop, 16146 MB/s total 2.641 sec

Created successfully ===== par2j_192.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50236 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.4% 69.6% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.047 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 192, modulo = 56 34.3% 97.0% 100.0% read 0.532 sec write 0.328 sec sub-thread : total loop = 26853 2nd encode 1.797 sec, 26853 loop, 14943 MB/s sub-thread : total loop = 29263 2nd encode 1.797 sec, 29263 loop, 16284 MB/s sub-thread : total loop = 29638 2nd encode 1.797 sec, 29638 loop, 16493 MB/s sub-thread : total loop = 29504 2nd encode 1.812 sec, 29504 loop, 16283 MB/s sub-thread : total loop = 29873 2nd encode 1.797 sec, 29873 loop, 16624 MB/s sub-thread : total loop = 29846 2nd encode 1.797 sec, 29846 loop, 16609 MB/s sub-thread : total loop = 30047 2nd encode 1.797 sec, 30047 loop, 16721 MB/s sub-thread : total loop = 29969 2nd encode 1.812 sec, 29969 loop, 16539 MB/s sub-thread : total loop = 29975 2nd encode 1.797 sec, 29975 loop, 16681 MB/s sub-thread : total loop = 30538 2nd encode 1.812 sec, 30538 loop, 16853 MB/s sub-thread : total loop = 29727 2nd encode 1.797 sec, 29727 loop, 16543 MB/s sub-thread : total loop = 29231 2nd encode 1.797 sec, 29231 loop, 16267 MB/s sub-thread : total loop = 29777 2nd encode 1.797 sec, 29777 loop, 16570 MB/s sub-thread : total loop = 29801 2nd encode 1.812 sec, 29801 loop, 16446 MB/s sub-thread : total loop = 29255 2nd encode 1.797 sec, 29255 loop, 16280 MB/s sub-thread : total loop = 27153 2nd encode 1.797 sec, 27153 loop, 15110 MB/s total 2.734 sec

Created successfully ===== par2j_256.exe ===== Parchive 2.0 client version 1.3.3.0 by Yutaka Sawada

L3 cache: 32768 KB (16 way) L2 cache: 1024 KB (8 way) Limit size of Cache Blocking: 128 KB Core count: logical, physical, use = 32, 16, 16 Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50219 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -14 0.0% : Computing file hash 34.4% 69.3% 100.0% hash 2.953 sec, 734 MB/s 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% write 0.047 sec 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 partial encode = 0 / 217 (0.0%), read = 2168, skip = 0 part_start = 0 + 0, part_now = 217 cover_num = 256, modulo = 120 34.3% 97.0% 100.0% read 0.531 sec write 0.329 sec sub-thread : total loop = 28153 2nd encode 1.984 sec, 28153 loop, 14190 MB/s sub-thread : total loop = 29072 2nd encode 1.984 sec, 29072 loop, 14653 MB/s sub-thread : total loop = 29855 2nd encode 1.983 sec, 29855 loop, 15055 MB/s sub-thread : total loop = 29617 2nd encode 1.968 sec, 29617 loop, 15049 MB/s sub-thread : total loop = 29060 2nd encode 1.984 sec, 29060 loop, 14647 MB/s sub-thread : total loop = 29314 2nd encode 1.999 sec, 29314 loop, 14664 MB/s sub-thread : total loop = 30097 2nd encode 1.984 sec, 30097 loop, 15170 MB/s sub-thread : total loop = 30111 2nd encode 1.999 sec, 30111 loop, 15063 MB/s sub-thread : total loop = 29896 2nd encode 1.968 sec, 29896 loop, 15191 MB/s sub-thread : total loop = 30010 2nd encode 1.984 sec, 30010 loop, 15126 MB/s sub-thread : total loop = 29612 2nd encode 1.984 sec, 29612 loop, 14925 MB/s sub-thread : total loop = 29487 2nd encode 1.999 sec, 29487 loop, 14751 MB/s sub-thread : total loop = 29504 2nd encode 1.983 sec, 29504 loop, 14878 MB/s sub-thread : total loop = 29441 2nd encode 1.984 sec, 29441 loop, 14839 MB/s sub-thread : total loop = 29584 2nd encode 1.968 sec, 29584 loop, 15032 MB/s sub-thread : total loop = 27635 2nd encode 1.984 sec, 27635 loop, 13929 MB/s total 2.922 sec

Created successfully ===== par2j_debug.exe ===== Parchive 2.0 client version 1.3.2.9 by Yutaka Sawada

Base Directory : "X:\TestBlock\" Recovery File : "X:\TestBlock\out.par2" CPU thread : 16 / 32 CPU cache limit : 128 KB, 2048 KB CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (50198 MB available), Fast SSD

Input File count : 1 Input File total size : 2272935936 Input File Slice size : 1048576 Input File Slice count : 2168 Recovery Slice count : 217 Redundancy rate : 10.00% Recovery File count : 1 Slice distribution : 0, uniform (until 217) Packet Repetition limit : 0

read_block_num = 2168 2-pass processing is selected, -12 0.0% : Computing file hash 34.0% 69.1% 100.0% 0.0% : Making index file 100.0% 0.0% : Constructing recovery file 100.0% 0.0% : Creating recovery slice

matrix size = 8.8 KB get_io_size: part_min = 64, part_max = 217

read all source blocks, and keep some parity blocks buffer size = 2385 MB, io_size = 1048592, split = 1 cache: limit size = 131072, chunk_size = 116512, split = 9 prog_base = 484995, unit_size = 1048608, part_num = 217 5.1% partial encode = 206 / 2168 (9.5%), read = 2168, skip = 0 52.5% 90.5% 100.0% read 1.843 sec write 0.328 sec sub-thread : total loop = 30802 1st encode 1.781 sec, 4424 loop, 2484 MB/s 2nd encode 2.328 sec, 26378 loop, 11331 MB/s sub-thread : total loop = 31257 1st encode 1.781 sec, 4661 loop, 2617 MB/s 2nd encode 2.344 sec, 26596 loop, 11346 MB/s sub-thread : total loop = 31051 1st encode 1.781 sec, 4673 loop, 2623 MB/s 2nd encode 2.328 sec, 26378 loop, 11331 MB/s sub-thread : total loop = 30967 1st encode 1.781 sec, 4589 loop, 2576 MB/s 2nd encode 2.328 sec, 26378 loop, 11331 MB/s sub-thread : total loop = 31753 1st encode 1.781 sec, 4721 loop, 2650 MB/s 2nd encode 2.328 sec, 27032 loop, 11612 MB/s sub-thread : total loop = 31451 1st encode 1.781 sec, 4855 loop, 2726 MB/s 2nd encode 2.344 sec, 26596 loop, 11346 MB/s sub-thread : total loop = 31922 1st encode 1.781 sec, 5544 loop, 3112 MB/s 2nd encode 2.344 sec, 26378 loop, 11253 MB/s sub-thread : total loop = 37174 1st encode 1.781 sec, 11232 loop, 6306 MB/s 2nd encode 2.328 sec, 25942 loop, 11143 MB/s sub-thread : total loop = 27032 2nd encode 2.344 sec, 27032 loop, 11532 MB/s sub-thread : total loop = 26814 2nd encode 2.344 sec, 26814 loop, 11439 MB/s sub-thread : total loop = 25942 2nd encode 2.344 sec, 25942 loop, 11067 MB/s sub-thread : total loop = 27032 2nd encode 2.344 sec, 27032 loop, 11532 MB/s sub-thread : total loop = 26378 2nd encode 2.344 sec, 26378 loop, 11253 MB/s sub-thread : total loop = 27250 2nd encode 2.344 sec, 27250 loop, 11625 MB/s sub-thread : total loop = 26596 2nd encode 2.344 sec, 26596 loop, 11346 MB/s sub-thread : total loop = 27032 2nd encode 2.344 sec, 27032 loop, 11532 MB/s total 4.578 sec

Created successfully

Yutaka-Sawada / MultiPar

Optimization for CPU's shared L3 cache #99