Yutaka-Sawada / MultiPar

Parchive tool
971 stars 42 forks source link

Option to Increase checksum performance #21

Closed cncr04s closed 3 years ago

cncr04s commented 3 years ago

I've been using PAR and Multipar for a long time now. I consider it an essential tool to help me avoid data loss. I appreciate all current performance options, however there is one that is missing! Computer specs have been steadily increasing over the years and with SSD's that read and write multi GB/s and systems with up to 128 cpu cores and TB's of ram, I find that initial checksumming, or verification seems to be more or less single threaded. With PAR's chunk feature it should be able to execute checksums on more than one chunk at a time from a disk. An option to specify multi threaded IO count would greatly improve performance for me as I have PC specs mentioned.

Yutaka-Sawada commented 3 years ago

I tried such multi hashing files ago. I wrote some code for creating PAR2 file for multiple source files. Though it was fast for files on RAM, I'm not sure the method was good or not for SSD actually. Because I don't have SSD yet, I could not test so much. I planed to retry again, when I buy SSD. Anyway I need to replace my HDD to SSD in a few years (my HDD is old age).

If you have time and want to test the sample function right now, I will make and upload a sample version. Or you may wait future official release after I buy SSD and test myself.

cncr04s commented 3 years ago

I am more than happy to test myself, I have plenty of testing platforms of my choosing.

Yutaka-Sawada commented 3 years ago

I made a sample version. I don't know how it runs on SSD. (It's very slow on HDD, hehe.) Because it's not tested so much, there may be something problem, strange behavior, or odd result. If you see such problem, please notice me.

I put the sample (par2j_hash_2021-01-19.zip) in "MultiPar_sample" folder on OneDrive.

JohnLGalt commented 3 years ago

I'll give it a test run as well. I recently built myself a nice AMD Ryzen 9 3950X-based machine - 128 GB RAM, 3x 1 TB PCIe4 NVMe drives. It's aeons faster than my previous Core i7 965 EE-based machine, to be sure.

https://valid.x86.fr/tf5prr

Support for multi-threaded use will definitely be good.

cncr04s commented 3 years ago

The _multi fails for me

Base Directory : "O:\" Recovery File : "O:\test.par2" CPU thread : 16 / 32 CPU cache : 1024 KB per set CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : 6/8 (79012 MB available)

Input File count : 63 Input File total size : 22206177915 Input File Slice size : 2778028 Input File Slice count : 7998 Recovery Slice count : 0 Redundancy rate : 0.00% Recovery File count : 0

check_seek_penalty = 0 cpu_num = 16, entity_num = 62, multi_read = 12 0.0% : Computing file hash 10560.7%

Yutaka-Sawada commented 3 years ago

I'm sorry for the bug. I forgot to set max number of threads in the old sample. (It worked for my 4-core CPU.) I set the value to be 12 now. I put the sample (par2j_hash_2021-01-21.zip) in "MultiPar_sample" folder on OneDrive.

If it works, you may change number of using cores to see the difference. I don't know how many threads is fast.

cncr04s commented 3 years ago

That one is working, and its much faster, at least on creation. Verification still seems limited. Following is an nvme ssd check_seek_penalty = 0 cpu_num = 16, entity_num = 86, multi_read = 12 ... 100.0% hash 32.297 sec, 2605 MB/s

Following is a sata ssd check_seek_penalty = 0 cpu_num = 16, entity_num = 183, multi_read = 12 ... 100.0% hash 129.407 sec, 395 MB/s

Following is a 9x raid6 HDD array over 40G fiber check_seek_penalty = 2 cpu_num = 16, entity_num = 17, multi_read = 12 ... 100.0% hash 113.735 sec, 341 MB/s

Yutaka-Sawada commented 3 years ago

That one is working, and its much faster, at least on creation.

Thank you for tests. I didn't implement multi-verification yet. (I'm not sure that it's possible.) Multi-reading & hashing seems to be fast on SSD or Raid HDD system. (Because I don't know your PC's base line, I cannot say how fast.)

Then, I want to know where is the best performance point. Using too many threads is useless normally. There would be a bottle-neck in data transfer or drive's cache system. If you have time, please try different number of threads for a same data set.

single version = 1 thread /lc3 = 3 cores -> 2 threads. /lc4 = 4 cores -> 3 threads. /lc6 = 6 cores -> 4 threads. /lc8 = 8 cores -> 6 threads. ... When the speed doesn't improve so much, using threads is too many. I will set using thread rate or max value by the information.

JohnLGalt commented 3 years ago

Baseline being older 1.3.1.3 benchmarked, and then compared to latest sample?

I can also do that lol.

Sorry it's taking me longer than normal for turnaround - I had some systems issues I needed to fix.

But I have a 16c / 32t CPU. I'll probably need to do a rather large file set.

Yutaka-Sawada commented 3 years ago

You use "par2j64_single.exe" to see base-line (single thread) speed. It's same function as old, just I enable debug output. It's good to see the time of hashing function only.

You use "par2j64_multi.exe" to see multi-thread version's speed. It's important to see how many threads perform how fast. Note, you need to clear disk cache (temporary data on RAM) to get real access speed of SSD. When PC's RAM is large, Windows OS seems to keep very large disk cache. I use "RamMap by Sysinterals" to clear cache memory.

I feel that number of CPU cores has limited effect than SSD's access speed, cache size, or data transfer speed. Normally, recent CPU is much faster than a data storage speed. Though I set current max using thread to be 12 in the sample, it doesn't mean that using 12 threads is 12 times faster than using single thread. I don't know property of general SSD so much. At least, SSD's random access speed is much faster than HDD. Multi reading gives random access and is very slow on my HDD.

cncr04s commented 3 years ago

Verification should be the same principal as creation, however with the multi threaded nature maybe you would need to read the chunks sequentially out of a buffer chunks*threads so that you can at some point report on the completion in a serialized way, some thread synchronization required. I don't know how the software is written so I could be off base, but I've done other similar code to what I suggested.

As for baselines, here they are in order as before, should be the same file sizes though I re ran the sata ssd one on multi, files shouldn't be cached on this controller. on the nvme its for sure faster. I have my own record for the hdd array as that was over a network, the multi was faster. Your 'single' exe is also faster then the normal exe I have in there from the installer, so in these cases its only marginally slower then the multi version but at least 40% faster then I am used to. I'll probably keep using the multi version myself, though an option to specify the number in the options GUI would let us determine if we feel there is any benefit as apposed to choosing some value. The only case here is clearly for a really fast nvme ssd it would benefit massively from being enabled automatically. There are ways to detect ssds from windows API too. There shouldn't be any performance loss from being enabled automatically on an ssd either way. My suggestion is to detect ssd using available means for each source file drive, if an ssd, enable, otherwise if hdd let the user choose through GUI option.

nvme ssd Input File total size : 86793688906 100.0% hash 184.047 sec

sata ssd Input File total size : 52090709816 100.0% hash 139.687 sec

9x raid6 array hdd over fiber Input File total size : 40723958770 100.0% hash 130.187 sec

sata ssd multi again Input File total size : 52090709816 check_seek_penalty = 0 cpu_num = 16, entity_num = 182, multi_read = 12 100.0% hash 104.844 sec, 473 MB/s

JohnLGalt commented 3 years ago

You use "par2j64_single.exe" to see base-line (single thread) speed. It's same function as old, just I enable debug output. It's good to see the time of hashing function only.

You use "par2j64_multi.exe" to see multi-thread version's speed. It's important to see how many threads perform how fast. Note, you need to clear disk cache (temporary data on RAM) to get real access speed of SSD. When PC's RAM is large, Windows OS seems to keep very large disk cache. I use "RamMap by Sysinterals" to clear cache memory.

I feel that number of CPU cores has limited effect than SSD's access speed, cache size, or data transfer speed. Normally, recent CPU is much faster than a data storage speed. Though I set current max using thread to be 12 in the sample, it doesn't mean that using 12 threads is 12 times faster than using single thread. I don't know property of general SSD so much. At least, SSD's random access speed is much faster than HDD. Multi reading gives random access and is very slow on my HDD.

Good info to know. I just opened the par2j archives (both old and new) and saw that the instructions are in there as well. I'll be sure to follow.

I'll test with both 1.3.1.3 and the last program sample that you made from 18 Jan as well, to see if there is any sort of a difference at all - I suspect that there should not be anything significantly difference just because you added / modified code for dialog box generation, but who knows?

Out of curiosity, are you planning on offering both single and multi par2j executables, and enabling a setting in MP to allow the user to pick which they want to use? That would be best, I think, with a simple routine to copy single to par2j.exe if use checks single, or multi to par2j.exe if user checks multi, but I am not a full-time programmer / developer, so obviously what I think may not be the best solution....

For the tests I will conduct, I've created a .7z archive of 21 GB of music files, using 7-Zip and broken into 21 files of ~ 100M archive parts. I'll model my tests similarly to cncr04s' sampling above, just including and additional program executable (so probably double number of tests).

JohnLGalt commented 3 years ago

Original 1.3.1.3 with original par2j executable, file located on SATA III SanDisk 1 TB SSD (MultiPar application is extracted from .ZIP archive and located on NVMe PCIe Gen 4 enabled SSD). System specs: https://valid.x86.fr/tf5prr

MultiPar.log

Is this what you want Yutaka?

If so, I'll conduct the rest of the tests using the exact same file set. I noticed that the default CL arguments were -rf0 instead of -rn0 and that the core count was -lc32 (my true number of threads). If I need to adjust parameters before proceeding, please let me know.

Yutaka-Sawada commented 3 years ago

Thanks Mr. cncr04s for tests. From your result, 5 or 6 threads seems to be enough for hashing. SSD's data transfer speed may be the bottle-neck.

Verification should be the same principal as creation

I will try in future. Because each thread needs to update available slices at a time, the usage of multi-threading is limited at verification. Maybe checking complete files is possible to be processed independently. It will be faster, when there are many complete source files on SSD.

Your 'single' exe is also faster then the normal exe I have in there from the installer

I'm afraid that faster case might read file data from disk cache on RAM. Windows OS keeps file data on RAM automatically. Even when file size is larger than RAM size, OS may keep partial data on RAM. But, I don't know well about such behavior on Windows 10.

My suggestion is to detect ssd using available means for each source file drive, if an ssd, enable, otherwise if hdd let the user choose through GUI option.

Currently user cannot know that par2j recognizes the drive (SSD or HDD) and processing mode. I will modify output format later. It may be safe to use SSD mode (multi-reading), only when it detects that the base directory is SSD. It's difficult to recognize Raid System or network drive construction. I may consider a manual (hidden) option for special case as advanced setting.

though an option to specify the number in the options GUI would let us determine

I write how is CPU usage on MultiPar Options. There is a slider "CPU usage" in "Hardware environment" section on "System settings" tab. The 5 levels means, /lc12 (quarter), /lc13 (half), /lc14 (75%), /lc15 (100%), and /lc0 (Hyper Threading). Currently par2j supports max 16 cores (threads). Even when your PC's CPU can process 32 threads at once, par2j uses max 16 threads at full speed.

The option detail is written on "Command_par2j.txt". When your CPU has 16 physical Cores, /lc12 option uses 4 Cores and /lc13 option uses 8 Cores. So, please test 2 more setting, "CPU usage" slider to be lowest (/lc12) case and one right from lowest (/lc13) case. If speed improvement is small, par2j doesn't need to use many threads, and I will set max value to be small. For slow SSD, 2 threads might be enough.

Yutaka-Sawada commented 3 years ago

I suspect that there should not be anything significantly difference just because you added / modified code for dialog box generation, but who knows?

You can use MultiPar GUI v1.3.1.3 as front end for the sample par2j versions. When it requires later GUI, I will include it.

Out of curiosity, are you planning on offering both single and multi par2j executables, and enabling a setting in MP to allow the user to pick which they want to use?

No, it will be one .EXE file. Because there is no option switch to select processing mode (HDD or SSD) currently, I made 2 versions for debug. Release version will switch automatically. (It's possible to detect SSD already.) Though I can add a new option on GUI, I'm not favorable. Because selecting SSD mode for normal HDD cause very slow hashing, that option would not be kind for users. Manual hidden option (edit .INI file directly) may be possible.

Is this what you want Yutaka?

Yes, making PAR2 Index file is enough. Thank you. With GUI Option, you can change /lc option by "CPU usage" slider in "Hardware environment" section on "System settings" tab. Please test 3 cases, lowest (left most, /lc12) case, one right from lowest (/lc13) case, and highest (right most, full speed).

/rf0 option means "number of recovery file is 0". The GUI omits /rn0 or /rr0 as 0 value. /lc32 option means "GPU acceleration is enabled". It has no effect for file hashing in this case.

Yutaka-Sawada commented 3 years ago

I found that SSD's data transfer speed difference is huge between SATA and NVMe. SATA 3.0 SSD 's data transfer speed is around 500 ~ 600MB/s. NVMe's data transfer speed is around 1500 ~ 4500 MB/s. This matches Randy's test results. For SATA SSD, multi reading & hashing 2 files is enough. For NVMe SSD, more threads may work. But, it's difficult to determine the SSD's speed.

Yutaka-Sawada commented 3 years ago

I made a sample, which detect SSD and change file access mode automatically. It doesn't distinguish SATA and NVMe yet, as I don't know a good method. I tried STORAGE_ADAPTER_DESCRIPTOR's BusType in the sample. My SATA HDD is BusType = 11. SATA SSD would be 11, too. I don't know how is the value for NVMe SSD. If you have NVMe SSD, please test and post how is the device info.

I made a sample GUI, which can change file access mode by editing .INI file manually. Because this is made for debug usage, I may change this method later. Only when automatic detection failed, you may try manual setting.

I put the sample (par2j_hash_2021-01-25.zip) in "MultiPar_sample" folder on OneDrive.

animetosho commented 3 years ago

Sorry to intrude, but I thought you might at least find this interesting:

Here's an idea I've been toying around to solve this - this does require your read process and hashing/computation to be decoupled (which may require some effort if the two are too tightly coupled).

You have one read thread, which reads chunks/blocks into a queue. You have multiple hashing threads which pull chunks out of the queue and perform hashing.
If the disk is faster than the CPU, the read queue will fill up, and the reader thread will stop until there is free space in the queue. If the disk is slower than the CPU, the read queue will mostly be empty as the reader thread tries to add entries as fast as possible.
The approach remains optimal on harddisks, as there's only one read going on at a time, and can automatically scale up to SSDs, as multiple threads process incoming items.

Whilst block hashes can be trivially computed in parallel, PAR2's file MD5 is problematic as you can only parallelise across multiple files (in my case, I use a 2-buffer SIMD approach to compute the block and file MD5 together, so I can't really do block hashing in parallel either). To minimise seeks whilst reading multiple files, you can use multiple queues - one for each file - and only start filling other files' queues if existing ones are full.
For example, the reader thread will try filling file A's queue, and when that gets full, it starts reading from file B and fills that queue. Once file B's queue is full, it tries filling file A's queue again. If both queues are full, it starts reading file C and filling its queue, etc.

This eliminates the need to determine the optimal number of threads to use, and problems with trying to determine storage speed (decoupled reading also has other advantages too).

I did also look at what could be done to speed up MD5 itself, and was surprised to find that the existing fastest MD5 implementations can be beaten with a simple trick. It's only about 5% faster though - still, free performance for a small change.

Yutaka-Sawada commented 3 years ago

Thank you for advice. As I'm not good at optimizing code, your aid was helpful. However I could not improve mine as compared to your ParPar, hehe.

You have one read thread, which reads chunks/blocks into a queue. You have multiple hashing threads which pull chunks out of the queue and perform hashing.

This is an interesting idea. When I buy SSD, I will try to improve more.

Currently I use 2 threads for one file access on HDD. 1st thread for reading and MD5 calculation. 2nd thread for MD5 and CRC-32 calculation. Both threads include hashing task. While I tried some multi-threading style, this seemed to be the fastest (and simple) on my PC's HDD. You may think that processing both reading and hashing in a thread is slower than reading only, there was no speed difference from my test. I felt that CPU's cache might affect this result. If hashing is done right after reading, the data may stay on CPU's cache, and it would be fast.

I did also look at what could be done to speed up MD5 itself

I refer the article and modified my using MD5 code. While I use a simple C language MD5 code, rotl (inline assembly) works fine. Though I implement your trick for G function only, it seems to become 1.5% faster than before. Thanks ! Because I don't use manual optimized assembly code, Visual C compiler seemed to treat the trick good. This comparison is final hashing speed, which includes RAM data access and CRC-32 calculation. So, simple MD5 hashing speed may be more faster.

I use a 2-buffer SIMD approach to compute the block and file MD5 together

I'm curious that parallel calculating MD5 is how fast. I heard that SSE2 MD5 would be 2 times slower, but it could calculate four MD5 hashes at once. Because PAR2 requires two MD5 hashes, I thought that there might be less speed improvement. You seem to use AVX512 for MD5. Without AVX technology, is SIMD approach worth to try ? To support old PC and OS, I don't want to use recent special CPU feature so much. (Also, my using Visual Studio 2008 doesn't support AVX.)

cncr04s commented 3 years ago

Visual Studio 2019 Community version is free. It can easily replace 2008. You can keep both if you use some functionality bought with that version if you need but I've never had any problem with the community version not having what I needed.

Yutaka-Sawada commented 3 years ago

I started to use Visual Studio 2008 to support old Windows 2000 or Windows XP. Though I tried Visual Studio 2010, the IDE was slow. The newest Visual Studio to support Windows XP seems to be Visual Studio 2017. But, I don't know how many users still use Windows XP nowadays.

To support AVX2, Visual Studio 2012 or later seems to be required. To support AVX512, Visual Studio 2017 or later seems to be required. Anyway I cannot test AVX2 nor AVX512 on my PC. At this time, I use GCC to support AVX. When I use C language only, GCC is the simplest. As an open source project, GCC may be good for other developers.

animetosho commented 3 years ago

I felt that CPU's cache might affect this result. If hashing is done right after reading, the data may stay on CPU's cache, and it would be fast.

I wouldn't worry too much about CPU cache when talking about I/O as it's too slow for it to really matter.
You don't really have much control over the cache in this case either, because the read buffer is filled by the kernel which could be on a different core (and needs a context switch regardless).

For your particular case, you could have three threads - one read thread, one file hash thread, and a MD5+CRC32 thread. The benefit of the approach is that the two hashing threads can be scaled up to more cores, whilst you retain one read thread.
The change probably won't have much of an effect on harddisks. The idea is to remain as good on harddisks, whilst scaling up to faster storage.

I heard that SSE2 MD5 would be 2 times slower

Looking at the speed for a single hash in 2-buffer arrangement, generally 10-20% slower in my tests, though it can vary by CPU. In the link, they don't provide details on what CPU was tested, so hard to say.

Firstly, there's actually two ways to do this: use SSE2 to compute two hashes together, or just double the number of scalar instructions to enable two interleaved MD5s to be done in parallel.

SSE2 approach: the main downside is CPUs which have high latency SSE instructions (like AMD K10/Bulldozer having 2 cycle latency on these) - if they're using such a CPU, this could be the reason for their slowdown. The next issue is the lack of a rotate instruction (though you can cheat a little for 2 buffer), which impacts speed a bit.

2x scalar approach: the main downside is on CPUs with less instruction level parallelism (like low power cores such as Intel Atom). On modern cores, this likely performs slightly better than SSE2 due to a native rotate instruction.
This approach should always be faster than computing 2x MD5s separately (on a single thread).

So to get the best performance, you'll need both unfortunately, and choose the method depending on CPU. It's probably best to stick with 2x scalar approach, and switch to SSE2 for narrower cores.

You seem to use AVX512 for MD5

AVX512 is interesting in that it's faster even for single buffer MD5, and by a fair amount (+23% measured). Which also means that you get 2 buffer for free, making it better still.

But measurements I gave earlier (+5%) is without AVX512, as most CPUs don't support it. If AVX512 is available, it's always the best choice.

Though I implement your trick for G function only, it seems to become 1.5% faster than before

Thanks for trying it and reporting your results!

The newest Visual Studio to support Windows XP seems to be Visual Studio 2017

Note that you can actually install the VS2017 XP build tools via Visual Studio 2019.
Many applications have dropped support for Windows XP these days (in fact, dropping Windows 7 is also a thing now), so I personally wouldn't bother, but it's there if you need it.

The current Visual Studios insist on an online installer, if that's an issue for you (though there is still a way to do it offline I think).
Can't comment much about slowness though.

Yutaka-Sawada commented 3 years ago

I tried to implement SSE2 code for MD5 function. I planned to calculate two MD5 at once. They are File's MD5 and Block's MD5. But, I found a problem of data alignment. MD5 compresses 64-bytes data to resulting 16-bytes hash. When block size (slice size) isn't multiple of 64, File's MD5 may become different alignment from Block's MD5.

For example, block size is 1000-bytes. The first block's MD5 compresses 15 times (64 * 16 = 960) and finishes last 40 byte. Then, second block's MD5 starts from offset 1000. At this time, file's MD5 keeps 40-bytes in internal buffer to wait more 24-bytes to compress. While block's MD5 has no data yet, file's MD5 has 40-bytes already. Because their input data size is different, it's impossible to compress two hashes at once. I cannot come up with a good solution.

JohnLGalt commented 3 years ago

I made a sample, which detect SSD and change file access mode automatically. It doesn't distinguish SATA and NVMe yet, as I don't know a good method. I tried STORAGE_ADAPTER_DESCRIPTOR's BusType in the sample. My SATA HDD is BusType = 11. SATA SSD would be 11, too. I don't know how is the value for NVMe SSD. If you have NVMe SSD, please test and post how is the device info.

I made a sample GUI, which can change file access mode by editing .INI file manually. Because this is made for debug usage, I may change this method later. Only when automatic detection failed, you may try manual setting.

I put the sample (par2j_hash_2021-01-25.zip) in "MultiPar_sample" folder on OneDrive.

I will give the newest test version a shot.

As to SSD versus NVMe speed comparisons, I can give you both, as I have a regular SATA III SSD as well as 3 NMVe (PCIe generation 4) drives, with speeds as high as 5 Gbps read and 4 Gbps write.

To be sure - do I need to test only the data residing on SSD, or also executable for MultiPar on same drive as data? Or test 4 ways, like:

  1. MultiPar (test version) on NVMe, data on SSD
  2. MP and data both on NVMe
  3. MP on SSD, data on NVMe
  4. MP and data both on SSD

I can do this as I use .ZIP of binaries extracted to folder, instead of installing MP, currently folder is on NVMe but I can copy to SSD folder.

However, Windows 10 User profile is also on NVMe (different drive from OS), which means the %TEMP% folder resides on NVMe, so that might make the above 4-way comparison moot, right?

animetosho commented 3 years ago

When block size (slice size) isn't multiple of 64, File's MD5 may become different alignment from Block's MD5.

Yeah, handling misalignment is a bit annoying.

For your example, after 960 bytes have been processed in the first PAR2 block, behaviour needs to diverge a little. The block hash needs to be finalised, whilst the file hash needs to be put on hold.
When you get the second PAR2 block, you can just first process the overlapping MD5 block (40 bytes + 24 bytes) by itself, then continue with processing the file/block hash via SIMD at different offsets. In other words, the file hash will start at 1024, whilst the block hash will start at 1000. With this, you can process a further 960 bytes using SIMD, so the file hash will end at position 1984 whilst the block hash ends at 1960. Again, finalise the block hash and carry over the remaining 16 bytes in the file hash when processing the third PAR2 block.
The third PAR2 block is a little interesting as you'll need to handle a case where the offsets drift too far apart. After processing the overlapped PAR2 block, the file hash would start at 2048, and block hash at 2000. This time, you can only process 896 bytes via SIMD, so the file hash ends at 2944 and block hash at 2896. To get the block hash, you need to process a full MD5 block by itself, to bring it to position 2960, then finalise with the last 40 bytes.

Yutaka-Sawada commented 3 years ago

I will give the newest test version a shot.

Thank you.

To be sure - do I need to test only the data residing on SSD, or also executable for MultiPar on same drive as data?

My implemented auto-detection refers a drive of "base directory". When the directory is "C:\something~", it checks C drive. When the directory is "D:\something~", it checks D drive. It picks the first character (drive letter) simply. It doesn't support network drive format.

So, you can test each drive with one MultiPar instance. Changing base directory (directory of source files) is enough. When data files are put on SSD, it checks the drive and will recognize the type. When a drive is detected as SSD, the word "SSD" is appended after "Memory usage" line on log file. Then, a user (and me) can know the drive type by reading log.

When you want to test access mode for HDD on your SSD as a base line speed, you may use "par2j64_single.exe" in old sample package, or use new MultiPar.exe and "FileAccess=1" line on MultiPar.ini file.

Yutaka-Sawada commented 3 years ago

When you get the second PAR2 block, you can just first process the overlapping MD5 block (40 bytes + 24 bytes) by itself, then continue with processing the file/block hash via SIMD at different offsets.

Oh, I see. I understood the method. Offset can be different, even when same data is input. I tested such model in C language, and it seems to work in my usage. (I just use it for a limited case only, PAR2.) I will test with SSE2 register later. Thank you very much. You are always wiser than me, hehe.

Yutaka-Sawada commented 3 years ago

I implemented MD5 with SSE2, and trying to make parallel version. I found that SSE2 version is much faster than normal C code. As C code, animetosho 's "Dependency shortcut in G function" is 102.2% speed than original MD5 code by Paul Houle. So, speed up is 2.2%. It was a small improvement.

I tested my SSE2 version (single MD5 calculation) for comparison. The speed is 128.8% than C version. 28.8% speed up !? While this is nice, I feel that it's too good. My current using C code might be very slow or SSE2 code happens to be fast on recent CPU. Or, I may miss something. Then, I want to ask Mr. Anime Tosho to check my code, but there is no issue page on his github page. I don't know well about github. Is there a method to contact him over github or e-mail ?

I put the sample MD5 code (md5_sse2_2021-01-31.7z) in "Tool" folder on OneDrive. The C code is basically same as what I use in MultiPar. I will improve more for paralell calculation, because 64-bit version is very slow still.

animetosho commented 3 years ago

I can read your comments here, so no need to reach out specifically 😆
If you prefer to take the conversation elsewhere, you can start a topic here.

SSE2 for one buffer should theoretically be slower than scalar, due to lack of a rotate instruction. You'll need to check the assembly to find out why it's faster. I don't have VS2008 to see what it's generating.

You can use the theoretical speed limit to gauge whether something is too fast (or too slow). For MD5, it's 4.75 cycles per byte (cpb), or 4.5 cpb if the G function shortcut is being used. For SSE2, add 1 cbp due to lack of rotate (so 5.75 cpb or 5.5 cpb).
So if your CPU's maximum frequency (best to lock it if possible) is 4GHz, the MD5 cannot be faster than 4000/4.5 = 888.89MB/s scalar, or 727.27MB/s SSE2.
Of course, speed will be slower if the compiler doesn't optimise correctly.

Your code looks fine and seems to work for me. I do see a duplicate w = in MD5STEP in phmd5s1.c, though I don't think it makes any difference.

By the way, for F1 you want to use the alternative representation (delays x dependency):

#define F1(x, y, z) (((y ^ z) & x) ^ z)
#define F1(x, y, z) _mm_xor_si128(_mm_and_si128(_mm_xor_si128(y, z), x), z)

I don't know whether the compiler will do anything differently, but it's better to XOR the last two parameters for F3 first (again, delays x dependency):

#define F3(x, y, z) _mm_xor_si128(x, _mm_xor_si128(y, z))

I adopted the code into my MD5 speed tester application, compiled using GCC 9.3.0 (note, my methods are using assembly) - here's the results:

PH    :    794.4 MB/s   (Phmd5Process)
PH1   :    660.0 MB/s   (Phmd5Process1)
Std   :    772.9 MB/s   (OpenSSL-esque implementation)
GOpt  :    817.9 MB/s   (G function shortcut applied)
GHOpt :    813.2 MB/s   (G+H shortcut)
NoLEA :    804.3 MB/s   (remove LEA instruction)
NoL-G :    846.3 MB/s   (remove LEA + G shortcut)
NoL-GH:    848.0 MB/s   (remove LEA + G+H shortcut)
AVX512:    947.1 MB/s   (AVX512)

I tried with Clang 11, and it looks like the scalar version is much slower than GCC. It seems like GCC can transform F1 as I describe above, whereas Clang doesn't, which makes it slower. In this case, SSE's PANDN instruction would help. There's probably more to that though, but I haven't really checked the assembly.

Yutaka-Sawada commented 3 years ago

Your code looks fine and seems to work for me. I do see a duplicate w = in MD5STEP in phmd5s1.c, though I don't think it makes any difference.

Thank you for test and finding my mistake.

for F1 you want to use the alternative representation (delays x dependency): but it's better to XOR the last two parameters for F3 first (again, delays x dependency):

Oh, I see. I modified code as you mentioned. It became even faster on my PC. Thanks !

It seems that speed may be varied by Compiler or CPU age. I didn't know that recent GCC was so good. Anyway, I'm satisfied that there was no problem in my code.

Yutaka-Sawada commented 3 years ago

I improved my MD5 code to become faster with SSE2 on both 32-bit and 64-bit. Thanks Mr. Anime Tosho to help.

I put the sample MD5 code (md5_sample_2021-02-03.zip) in "Tool" folder on OneDrive. The C code is basically same as what I use in MultiPar. I will change MD5 code to use SSE2 version in MultiPar later. If SSE2 version is slower than old C version on your PC, please report your using CPU. Then, I may switch function by checking CPU ID.

Yutaka-Sawada commented 3 years ago

I implemented new MD5 SSE2 function in par2j. The file hashing becomes around 20% faster on files on RAM (OS's disk cache). But, there is no difference for files on HDD.

In the sample, I used OS's built-in FILE_FLAG_OVERLAPPED system instead of controlling sub-thread. It's simpler to read file data synchronously. But, I could not see any speed improvement, while HDD access speed is slower than hash calculation speed. Maybe it causes less CPU usage. If there is no problem, I will use the method in other parts.

I didn't test so much yet. There may be a bug in the sample. I put the samples (par2j_hash_2021-02-06.zip in "MultiPar_sample" folder) and (md5_sample_2021-02-05.zip in "Tool" folder) on OneDrive. If there is a problem (such like slow) on your PC, please report the incident.

animetosho commented 2 years ago

This eliminates the need to determine the optimal number of threads to use, and problems with trying to determine storage speed

Sorry to bring this up, but if you're still interested, I actually recently implemented this.

The design is limited by the number of read buffers that are allocated (because each thread would need its own buffer to hash from, and there's also queued buffers).
The code isn't too different from the original plan - each open file has its own hashing thread, which pulls buffers off the file's queue and hashes the data.
The read process first tries to fill the queue of the currently 'active' file. Once that's filled, it'll switch to the next opened file with an empty queue (or opening a new one if no queues are empty).

On a hard disk that's slower than MD5 hashing, this maintains sequential access (if data isn't cached to RAM) because the hashing thread pulls buffers from the queue faster than the reader can add them, so the reader keeps reading the same file.
If re-running the process, with the disk files cached in RAM, the process will automatically use multiple threads, because it's getting data faster than it can hash it. This is a key advantage over detecting disk type, which can't easily tell if files on a hard disk are actually cached in RAM.

On the downside, this does introduce two tuning parameters: maximum number of read buffers, and maximum file queue size. I've found that the read queue may need to be relatively large to avoid unnecessary HDD seeking (around ~20MB of queued data on this system), so can end up using a fair bit of RAM.
Another issue is that SSDs may prefer multiple random reads at a time. This design only ever does one read at a time, as it doesn't know the disk type.
As such, it may still be worth it to detect the disk type, and lower the queue size for SSDs, and/or enable concurrent reading.

To be clear, I'm not asking you to write it - I'm just commenting here in case you still find the idea interesting.

Yutaka-Sawada commented 2 years ago

Thank you for the idea. But, I have no plan to improve my code at this time. I maintain par2j mostly in bug fix only.

JohnLGalt commented 2 years ago

I suppose it is not possible to substitute parpar as the CLI executable to use with MP, is it?

animetosho commented 2 years ago

Unless @Yutaka-Sawada intends to add it, no, as it and par2j have different interfaces/options/features.

Were you just looking for a GUI interface over using the CLI?

JohnLGalt commented 2 years ago

Unless @Yutaka-Sawada intends to add it, no, as it and par2j have different interfaces/options/features.

Were you just looking for a GUI interface over using the CLI?

Sorry, I thought I replied to this already.

In answer - well, specifically MP, but a decent GUI for parpar would be good as well.

Yutaka-Sawada commented 2 years ago

In answer - well, specifically MP, but a decent GUI for parpar would be good as well.

If you want to select source files over MultiPar GUI, it's not so difficult. Usage such like, drag & drog source files on a batch file. It's possible to call another .EXE file (par2cmdline, ParPar, or anything). But, it's difficult to change settings, as Mr. Anime Tosho wrote. Preview feature is difficult, too. Furthermore the creating task will be automatic only. (cannot pause nor cancel) So, it's same as using a batch file.

If you want to change settings and see how they will become, you should Mr. Anime Tosho to make a new GUI interface for ParPar. That will be much better decent one.

animetosho commented 2 years ago

specifically MP, but a decent GUI for parpar would be good as well.

Unfortunately I don't have access to the source code for MultiPar, so there's little I can do to change it. But as @Yutaka-Sawada pointed out, it's probably best not to bind it to MultiPar anyway.

So I decided put together a GUI, which I'm guessing is the next best thing. Has a few rough edges at the moment, but hopefully covers most of the typical use-cases.

Getting back to this topic, if you want to play around with configuration relating to hashing on multiple threads, look in the Options dialog -> Performance Tuning. Specifically the 'Read buffers' and 'Hash queue' options are relevant.
From what I understand, par2j doesn't have a hash queue, so if you want to simulate par2j-like behaviour, set the 'Hash queue' to 1 and 'Read buffers' to the number of threads (e.g. 2 for SSD). This 'simulation' isn't exactly the same, due to design differences (I think par2j issues concurrent reads, whilst ParPar only ever issues one read at a time), but probably the closest approximation.
As for what the options exactly do, I've put info in the textbox tooltips to give a hopefully not-too-confusing explanation.

In terms of ParPar's defaults, the 'Hash queue' value of 5 is mostly tuned to my hard disk, where SSDs may work better with smaller values (or alternatively, a higher number of 'Read buffers' if higher memory usage isn't an issue). Although the point of this approach is to adjust thread allocation based on the speed of the disk (as opposed to detecting the disk type), balancing that with RAM usage is a little tricky.

Hopefully this gives us something to experiment with, should the concept prove to be worthwhile (and if @Yutaka-Sawada ever wants to look at it in the future).

Yutaka-Sawada commented 2 years ago

Thanks Mr. Anime Tosho for the description of hashing task. It's interesting method, and will work well for multiple files' hashing. It, however, looks like complex for me, who is a lazy programmer. If a user requests speed, I would suggest ParPar to him.

Yutaka-Sawada commented 2 years ago

I was interested in the large buffer size of ParPar's design. Though the reading method is different, I tested how large buffer size was good on my PC.

When it uses single-read mode or it reads file data from disk cache (RAM), buffer size doesn't affect hash calculation speed. I compared 64 KB and 512 KB. The single-reading speed is around 500 MB/s. The single-reading on RAM speed is around 530 MB/s. The multi-reading 4 files on RAM speed is around 1900 MB/s.

When it uses multi-read mode on SSD, buffer size affects hash calculation speed very much. I tested 64 KB, 128 KB, 256 KB, 512 KB, 768 KB, and 1 MB.

For example, reading 2 files at once. 64 KB : around 460 MB/s 128 KB : around 650 MB/s 256 KB : around 850 MB/s 512 KB : around 900 MB/s 768 KB : around 1000 MB/s 1 MB : around 1000 MB/s

For example, reading 3 files at once. 64 KB : around 650 MB/s 128 KB : around 900 MB/s 256 KB : around 1000 MB/s 512 KB : around 1300 MB/s 768 KB : around 1400 MB/s 1 MB : around 1400 MB/s

For example, reading 4 files at once. 64 KB : around 750 MB/s 128 KB : around 1000 MB/s 256 KB : around 1200 MB/s 512 KB : around 1400 MB/s 768 KB : around 1400 MB/s 1 MB : around 1450 MB/s

When buffer size is small, multi-reading speed happens to be slow. Multi-reading requires at least 128 KB buffer to be faster than single-reading. 768 KB buffer seems to be enough size on my PC. Because my using SSD doesn't have cache DRAM, it uses NVMe's Host Memory Buffer. Then, the limit size might come from CPU's L3 cache size. But, I'm not sure.

I found that calculating 3 files was max speed on my PC. Though my SSD's sequential reading speed spec is 2000 MB/s, random access reading speed seems to be slower like 1500 MB/s. Because it's difficult to determine a SSD's speed, it will switch number of files by CPU's number of Cores; When CPU has 8 or more Cores, it calculates hash of 4 files at once. When CPU has 7 or less Cores, it calculates hash of 3 files at once. When SSD isn't NVMe, it calculates hash of 2 files at once.

animetosho commented 2 years ago

The point of the 4MB buffer was actually for hard disks, so it's surprising you see much of a benefit for SSDs. Large buffers help encourage sequential reads (particularly if the hard disk is being used by another program at the same time).

Sequential reading is still faster than random reads on SSDs, but the difference is much smaller compared to hard disks. Maybe there's some context switch inefficiencies going on (check thread CPU usage to see if each thread is using 100% of the CPU core). CPU cache doesn't sound like a likely issue.

Yutaka-Sawada commented 2 years ago

Large buffers help encourage sequential reads

Yes, it seems to work in background. I set FILE_FLAG_SEQUENTIAL_SCAN flag at calling CreateFile Win32API. Windows OS may read more data on disk cache than specified reading data size. Even when an application reads 64 KB data from file, Windows OS would read more size and put them on disk cache. That may be why buffer size doesn't affect speed at single-read mode (sequential reading a file).

But, there is no noticeable difference, when I set FILE_FLAG_SEQUENTIAL_SCAN or FILE_FLAG_RANDOM_ACCESS flags at multi-reading files. I don't know how the pre-reading cache works actually. Anyway, thanks Mr. Anime Tosho for some idea. Though I cannot implement a complex system, I improved mine somehow.

animetosho commented 2 years ago

Yeah, FILE_FLAG_SEQUENTIAL_SCAN will help, particularly in your case since it allows reading in the background. (in my case, Javascript doesn't have direct access to the Windows API, and async reading naturally happens in the background. Also, using specific buffer sizes gives more control to the application, over relying on the OS)

If you want to test how well it works with a busy disk, you can do a file copy whilst doing a PAR2 create on the same disk. For hard drives, this is where larger buffers should show the most benefit.

If you're interested in tweaking I/O though, take a look at MapViewOfFile. I don't think it would make much of a difference (it mostly saves a kernel -> user memory copy), but it might help if you're ultimately reading the whole file into memory anyway.