Check does not check the entire repository

SiddharthManthan commented 1 month ago

This is continuation of #75.

Problem

Currently check randomly selects a part of repository.
This does not guarantee the all the data will be check after sufficient runs. (Source)

Solution

Implement --read-data-subset=n/t (Documentation).
It will check all the data after sufficient runs.

I have filed it as bug report because the UI currently does not mention that all data is not checked. There should be a warning or proper check support should be implemented.

Thinkscape commented 1 month ago

Does this repo setting not fit your use case? Setting this to 100% will re-download the whole thing ...

SiddharthManthan commented 1 month ago

Does this repo setting not fit your use case? Setting this to 100% will re-download the whole thing ...

The repo is quite large. Downloading all the data will take a long time. It might not even be possible because the machine is not running 24*7.

garethgeorge commented 1 month ago

hey, --read-data-subset=n/t is only guaranteed to check the whole repo if you systematically increment n which isn't something I expect to implement as that value would need to be tracked somewhere which adds complexity relative to the value of the feature. If you're using storage that ensures it's integrity e.g. S3 etc I think there's also very limited value to actually verifying data integrity by downloading it, this is supported primarily for users of local storage or sftp.

The alternative would be checking random chunks of the repo, but this likely isn't better than using a random percentage.

SiddharthManthan commented 1 month ago

hey, --read-data-subset=n/t is only guaranteed to check the whole repo if you systematically increment n which isn't something I expect to implement as that value would need to be tracked somewhere which adds complexity relative to the value of the feature. If you're using storage that ensures it's integrity e.g. S3 etc I think there's also very limited value to actually verifying data integrity by downloading it, this is supported primarily for users of local storage or sftp.

The alternative would be checking random chunks of the repo, but this likely isn't better than using a random percentage.

Check not only checks the integrity of repository files, but also checks for repository corruption. There have been many cases where the repository gets corrupted even though the blobs themselves are error free (due to bug in restic code or other reasons). So check is very useful for cloud storage, even if cloud storage ensures integrity. If corruption is detected before data needs to be restored, it can be fixed in timely manner. The feature does have value, even if it increases the complexity.

Thinkscape commented 1 month ago

Check not only checks the integrity of repository files, but also checks for repository corruption.

That is true even in the current version of backrest. If you had set it to 0.001% (0% might work as well) it already checks the indexes, and stats the packs.

SiddharthManthan commented 1 month ago

Check not only checks the integrity of repository files, but also checks for repository corruption.

That is true even in the current version of backrest. If you had set it to 0.001% (0% might work as well) it already checks the indexes, and stats the packs.

But it will miss the data blobs with 0.001 %. Only by reading the data, we can detect errors.

Here is an example of data corruption. All files in repo (index, data blobs, etc) appear fine and checksum correctly but the actual data (after restore) will have different checksum. There have been such bugs in the past (look in forums).

These corruption can only be detected by full data read.

garethgeorge / backrest

Check does not check the entire repository #397

Problem

Solution