High random disk reads when creating recovery data

gareth41 commented 2 years ago

Creating recovery data from a large set of files > 1TB results in very high disk load / lots of random non-sequential disk reads. This is an issue if you're using a single spinning disk as the actuator arm has to constantly seek for each read, the result is 100% disk utilization, a very slow read rate of 10MB/s, and extra stress on the drive - it will also take around 30 hours to create recovery data for a file set totaling 1TB.

This is not an issue if you're using an SSD, and the problem can somewhat be mitigated with raid0. Although using raid0 you'll still need a large number of disks in the array to get any acceptable read rate, brining the total time needed to create the recovery files to under 6 hours for a 1TB file set.

However not everyone has access to large capacity SSD's or raid0 arrays. Is it possible to have MultiPar create the recovery data while reading sequentially from the file set? This will significantly speed things up when only a single spinning disk is involved.

Yutaka-Sawada commented 2 years ago

Is it possible to have MultiPar create the recovery data while reading sequentially from the file set?

It depends on both recovery data size and your PC's RAM size. Only when recovery data is smaller than free memory size, it tries sequential file access.

For example, you create recovery data with 5 % redundancy for 1000 GB source data. The recovery data size will be 50 GB. It would require 128 GB RAM on your PC for sequential file access. (Note, free memory space is smaller than RAM module size.)

So, there are two solutions. A) Put more RAM on your PC for larger free memory space. B) Treat fewer files for less source data size.

gareth41 commented 2 years ago

Thanks, I have since found your python script which creates multiple recovery sets for groups of 1000 files. Have been testing this, it seems to do what I need. My initial issue was creating a recovery set for a large number of files - this failed as the file count was too much - so I rar'd the whole lot and split across multiple rar volumes 2GB each totaling up to 1TB - this worked but as described in my first post, it caused heavy I/O on the disk.

Now I'm back to using the original files along with your python script to generate recovery sets for every 1000 files, although I have edited the script so the recovery size is 3%, I have also increased it to 2000 files instead of 1000 - this seems to generate the recovery sets much faster as the disk isn't having to continuously seek over the entire 1TB data but rather just 70 to 80GB at a time for each recovery set created.

Yutaka-Sawada commented 2 years ago

this seems to generate the recovery sets much faster as the disk isn't having to continuously seek over the entire 1TB data but rather just 70 to 80GB at a time for each recovery set created.

Oh, I see. I'm grad to hear that the Python script is useful. Because I don't treat so many files on my PC, I didn't test myself. Users would edit and use the script for their usage. You may modify the script to limit source files by their total data size in each recovery set.

Yutaka-Sawada / MultiPar

High random disk reads when creating recovery data #75