borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.14k stars 742 forks source link

Prune or delete : not found, but listed in compaction data with OSError: [Errno 5] Input/output error #5120

Closed pierrehenrymuller closed 4 years ago

pierrehenrymuller commented 4 years ago

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

BUG

System information. For client/server mode post info for both machines.

Client : Debian 8.11 4.9.0-0.bpo.6-amd64 #1 SMP Debian 4.9.88-1+deb9u1~bpo8+1 (2018-05-13) x86_64 GNU/Linux Server : Debian 9.12 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux Prune : Debian 9.12 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux

Your borg version (borg -V).

1.1.10 when the corruption occured, 1.1.11 now but same problem Each version is download on github releases in linux64 format on servers and clients

Operating system (distribution) and version.

Debian 9.12 on Prune server and backup Server Debian 8.11 on Client

Hardware / network configuration, and filesystems used.

Server : dedicated baremetal, soft raid 5, smartcl for all disk ok, raid is ok, backup storage on ext4, fsck.ext4 ok Client : VPS ext4 Prune : VPS ext4

How much data is handled by borg?

Server have more than many hundred servers on it same Debian 9.12 or Debian 8 or Debian 10, only one client backup have this problem. All backups is arround 6TB, client size is arround 500Go for one week of backups. All borg clients have one ssh key and one repository only for their use, /backup/XXX/XXX path is unique for each client, no concurrency.

Full borg commandline that lead to the problem (leave away excludes and passwords)

On Client side : export BORG_PASSPHRASE="XXX" export BORG_RSH="ssh -q -i /root/.ssh/id_ed25519-backup" export BORG_CONFIG_DIR="/var/backups/borg/config" export BORG_CACHE_DIR="/var/backups/borg/cache" export BORG_SECURITY_DIR="/var/backups/borg/config/security" export BORG_KEYS_DIR="/var/backups/borg/config/keys" export REPOSITORY=borgbackup@SERVER:/backup/XXX/XXX

/usr/local/bin/borg create --lock-wait 3000 --compression zlib,6 --exclude-caches --exclude "/swap" --exclude "/dev/" --exclude "/proc/" --exclude "/run/" --exclude "/sys/" --exclude "/tmp/" --exclude "/mnt/" --exclude "/var/cache/apt/" --exclude "/var/lib/apt/{mirrors,periodic}/" --exclude "/var/backups/borg/" $REPOSITORY::XXX_$date /

On Server side : command="cd /backup/XXX/XXX;borg serve --append-only --restrict-to-path /backup/XXX/XXX",no-port-forwarding,no-X11-forwarding,no-pty,no-agent-forwarding,no-user-rc ssh-ed25519 XXXX root@XXX

On Prune side :

export BORG_PASSPHRASE="XXX" export BORG_RSH="ssh -q -i /home/users/XXX/.ssh/XXX" export REPOSITORY=borgbackup@SERVER:/backup/XXX/XXX /usr/local/bin/borg prune --force -v $REPOSITORY --keep-within 3d -H 4 -d 7 -w 0 -m 0 -y 0

Describe the problem you're observing.

Backups are made all 6 hours with the same call for all clients servers, on this server I saw that the disk space used by this Client growing anormaly. On the Prune server, logs indicate that

Remote: segment 5586 not found, but listed in compaction data                                                                                                                                                                                                                   
Remote: segment 6650 not found, but listed in compaction data                                                                                                                                                                                                                   
Remote: segment 6651 not found, but listed in compaction data
[x ~100]
Traceback (most recent call last):                                                                                                                                                                                                                                              

  File "borg/remote.py", line 247, in serve                                                                                                                                                                                                                                     

  File "borg/repository.py", line 461, in commit                                                                                                                                                                                                                                

  File "borg/repository.py", line 742, in compact_segments

  File "borg/repository.py", line 1408, in iter_objects

  File "borg/repository.py", line 1500, in _read

OSError: [Errno 5] Input/output error

Borg server: Platform: Linux SERVER 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
Borg server: Linux: debian 9.12
Borg server: Borg: 1.1.11  Python: CPython 3.5.9 msgpack: 0.5.6
Borg server: PID: 14834  CWD: /backup
Borg server: sys.argv: ['borg', 'serve', '--umask=077']
Borg server: SSH_ORIGINAL_COMMAND: None
Platform: Linux PRUNE 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
Linux: debian 9.12
Borg: 1.1.11  Python: CPython 3.5.9 msgpack: 0.5.6

So I launch a check repair command with same export than prune :
/usr/local/bin/borg check --repair $REPOSITORY and same output is printed.

I also check to delete the oldest backup with

/usr/local/bin/borg delete $REPOSITORY YYY
Archive YYY not found (1/1).                                                                                                                                                                                                                        
Remote: segment 5586 not found, but listed in compaction data

and same output from prune, many lines for segment and input output error

When I make a borg list I can see all backups and I can mount this archive and navigate in differents directories.

I have test to delete cache only but after all commands do same output. It had happened to me before but I hadn't investigated, I had reset the repository and started making new backups. But now I would like to understand and avoid deleting all the backups of this Client.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

I think the errors are due to this client's backup data.

Include any warning/errors/backtraces from the system logs

ThomasWaldmann commented 4 years ago

I/O error in general points to an error below borg.

I had a look into the source and this specific I/O error occurred when trying to read some more data from an open file that already had a successful read right before.

Did you look into kernel log?

You said SMART data is ok, did you mean overall status or also detailled entries like "pending sectors" etc.?

Can you give me the full output of 1.1.11 borg check --repair ... for that repo?

Can you try it also directly on the repo server like borg check --repair --repository-only ...?

pierrehenrymuller commented 4 years ago

Hello, Nothing in kern.log concerning disks or fs only iptables drop.

For smartctl I have no value indicate error :

for i in a b c d; do smartctl -a /dev/sd$i | grep -iE '(pending|read_error|Reallocated_Sector_Ct|Seek_Error_Rate)'; done                         Interrupt
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
If Selective self-test is pending on power-up, resume after 0 minute delay.
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
If Selective self-test is pending on power-up, resume after 0 minute delay.
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
If Selective self-test is pending on power-up, resume after 0 minute delay.
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
If Selective self-test is pending on power-up, resume after 0 minute delay.

There is the log of last night prune :

Synchronizing chunks cache...
Archives: 26, w/ cached Idx: 25, w/ outdated Idx: 3, w/o cached Idx: 4.
Reading cached archive chunk index for CLIENTHOST_2020-04-17-02H27 ...
Merging into master chunks index ...
(... some more ...)
Reading cached archive chunk index for CLIENTHOST_2020-04-10-20H27 ...
Merging into master chunks index ...
Done.
Remote: segment 5586 not found, but listed in compaction data
Remote: segment 6650 not found, but listed in compaction data
Remote: segment 6651 not found, but listed in compaction data
Remote: segment 6653 not found, but listed in compaction data
Remote: segment 6881 not found, but listed in compaction data
Remote: segment 8476 not found, but listed in compaction data
Remote: segment 9259 not found, but listed in compaction data
Remote: segment 10294 not found, but listed in compaction data
Remote: segment 10571 not found, but listed in compaction data
Remote: segment 10575 not found, but listed in compaction data
Remote: segment 10609 not found, but listed in compaction data
Remote: segment 11068 not found, but listed in compaction data
Remote: segment 11428 not found, but listed in compaction data
Remote: segment 11530 not found, but listed in compaction data
Remote: segment 11656 not found, but listed in compaction data
Remote: segment 11658 not found, but listed in compaction data
Remote: segment 11694 not found, but listed in compaction data
Remote: segment 11700 not found, but listed in compaction data
Remote: segment 11702 not found, but listed in compaction data
Remote: segment 11710 not found, but listed in compaction data
Remote: segment 11723 not found, but listed in compaction data
Remote: segment 11737 not found, but listed in compaction data
Remote: segment 11752 not found, but listed in compaction data
Remote: segment 11756 not found, but listed in compaction data
Remote: segment 11757 not found, but listed in compaction data
Remote: segment 11781 not found, but listed in compaction data
Remote: segment 11782 not found, but listed in compaction data
Remote: segment 11783 not found, but listed in compaction data
Remote: segment 11784 not found, but listed in compaction data
Remote: segment 11785 not found, but listed in compaction data
Remote: segment 11786 not found, but listed in compaction data
Remote: segment 11787 not found, but listed in compaction data
Remote: segment 11788 not found, but listed in compaction data
Remote: segment 11848 not found, but listed in compaction data
Remote: segment 11860 not found, but listed in compaction data
Remote: segment 11862 not found, but listed in compaction data
Remote: segment 11878 not found, but listed in compaction data
Remote: segment 11894 not found, but listed in compaction data
Remote: segment 11895 not found, but listed in compaction data
Remote: segment 11896 not found, but listed in compaction data
Remote: segment 11897 not found, but listed in compaction data
Remote: segment 11898 not found, but listed in compaction data
Remote: segment 11899 not found, but listed in compaction data
Remote: segment 11900 not found, but listed in compaction data
Remote: segment 11901 not found, but listed in compaction data
Remote: segment 11994 not found, but listed in compaction data
Remote: segment 11995 not found, but listed in compaction data
Remote: segment 11998 not found, but listed in compaction data
Remote: segment 11999 not found, but listed in compaction data
Remote: segment 12001 not found, but listed in compaction data
Remote: segment 12027 not found, but listed in compaction data
Remote: segment 12029 not found, but listed in compaction data
Remote: segment 12030 not found, but listed in compaction data
Remote: segment 12031 not found, but listed in compaction data
Remote: segment 12032 not found, but listed in compaction data
Remote: segment 12034 not found, but listed in compaction data
Remote: segment 12062 not found, but listed in compaction data
Remote: segment 12076 not found, but listed in compaction data
Remote: segment 12077 not found, but listed in compaction data
Remote: segment 12078 not found, but listed in compaction data
Remote: segment 12079 not found, but listed in compaction data
Remote: segment 12080 not found, but listed in compaction data
Remote: segment 12081 not found, but listed in compaction data
Remote: segment 12082 not found, but listed in compaction data
Remote: segment 12083 not found, but listed in compaction data
Remote: segment 12093 not found, but listed in compaction data
Remote: segment 12095 not found, but listed in compaction data
Remote: segment 12112 not found, but listed in compaction data
Remote: segment 12114 not found, but listed in compaction data
Remote: segment 12125 not found, but listed in compaction data
Remote: segment 12128 not found, but listed in compaction data
Remote: segment 12168 not found, but listed in compaction data
Remote: segment 12170 not found, but listed in compaction data
Remote: segment 12173 not found, but listed in compaction data
Remote: segment 12175 not found, but listed in compaction data
Remote: segment 12200 not found, but listed in compaction data
Remote: segment 12206 not found, but listed in compaction data
Remote: segment 12220 not found, but listed in compaction data
Remote: segment 12221 not found, but listed in compaction data
Remote: segment 12222 not found, but listed in compaction data
Remote: segment 12223 not found, but listed in compaction data
Remote: segment 12224 not found, but listed in compaction data
Remote: segment 12225 not found, but listed in compaction data
Remote: segment 12226 not found, but listed in compaction data
Remote: segment 12254 not found, but listed in compaction data
Remote: segment 12256 not found, but listed in compaction data
Remote: segment 12284 not found, but listed in compaction data
Remote: segment 12316 not found, but listed in compaction data
Remote: segment 12318 not found, but listed in compaction data
Remote: segment 12320 not found, but listed in compaction data
Remote: segment 12329 not found, but listed in compaction data
Remote: segment 12335 not found, but listed in compaction data
Remote: segment 12343 not found, but listed in compaction data
Remote: segment 12345 not found, but listed in compaction data
Remote: segment 12353 not found, but listed in compaction data
Remote: segment 12354 not found, but listed in compaction data
Remote: segment 12355 not found, but listed in compaction data
Remote: segment 12356 not found, but listed in compaction data
Remote: segment 12357 not found, but listed in compaction data
Remote: segment 12358 not found, but listed in compaction data
Remote: segment 12359 not found, but listed in compaction data
Remote: segment 12381 not found, but listed in compaction data
Remote: segment 12408 not found, but listed in compaction data
Remote: segment 12416 not found, but listed in compaction data
Remote: segment 12422 not found, but listed in compaction data
Remote: segment 12424 not found, but listed in compaction data
Remote: segment 12426 not found, but listed in compaction data
Remote: segment 12436 not found, but listed in compaction data
Remote: segment 12437 not found, but listed in compaction data
Remote: segment 12438 not found, but listed in compaction data
Remote: segment 12439 not found, but listed in compaction data
Remote: segment 12440 not found, but listed in compaction data
Remote: segment 12441 not found, but listed in compaction data
Remote: segment 12442 not found, but listed in compaction data
Remote: segment 12468 not found, but listed in compaction data
Remote: segment 12470 not found, but listed in compaction data
Remote: segment 12473 not found, but listed in compaction data
Remote: segment 12480 not found, but listed in compaction data
Remote: segment 12503 not found, but listed in compaction data
Remote: segment 12529 not found, but listed in compaction data
Remote: segment 12531 not found, but listed in compaction data
Remote: segment 12533 not found, but listed in compaction data
Remote: segment 12535 not found, but listed in compaction data
Remote: segment 12545 not found, but listed in compaction data
Remote: segment 12547 not found, but listed in compaction data
Remote: segment 12555 not found, but listed in compaction data
Remote: segment 12556 not found, but listed in compaction data
Remote: segment 12557 not found, but listed in compaction data
Remote: segment 12558 not found, but listed in compaction data
Remote: segment 12559 not found, but listed in compaction data
Remote: segment 12560 not found, but listed in compaction data
Remote: segment 12561 not found, but listed in compaction data
Remote: segment 12562 not found, but listed in compaction data
Remote: segment 12575 not found, but listed in compaction data
Remote: segment 12590 not found, but listed in compaction data
Remote: segment 12620 not found, but listed in compaction data
Remote: segment 12621 not found, but listed in compaction data
Remote: segment 12634 not found, but listed in compaction data
Remote: segment 12636 not found, but listed in compaction data
Remote: segment 12638 not found, but listed in compaction data
Traceback (most recent call last):

  File "borg/remote.py", line 247, in serve

  File "borg/repository.py", line 461, in commit

  File "borg/repository.py", line 742, in compact_segments

  File "borg/repository.py", line 1408, in iter_objects

  File "borg/repository.py", line 1500, in _read

OSError: [Errno 5] Input/output error

Borg server: Platform: Linux dd-bck-1 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
Borg server: Linux: debian 9.12 
Borg server: Borg: 1.1.11  Python: CPython 3.5.9 msgpack: 0.5.6
Borg server: PID: 3188  CWD: /backup
Borg server: sys.argv: ['borg', 'serve', '--umask=077', '--info']
Borg server: SSH_ORIGINAL_COMMAND: None
Platform: Linux dd-bckmaster-1 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
Linux: debian 9.12 
Borg: 1.1.11  Python: CPython 3.5.9 msgpack: 0.5.6
PID: 19773  CWD: /home/users/borgmaster
sys.argv: ['/usr/local/bin/borg', 'prune', '--force', '-v', 'borgbackup@SERVERHOST:/backup/XXX/CLIENTHOST', '--keep-within', '3d', '-H', '4', '-d', '7', '-w', '0', '-m', '0', '-y', '0']
SSH_ORIGINAL_COMMAND: None

And the repair on the backup server that comes out more quickly in error :

/usr/local/bin/borg check --repair --repository-only REPODIR/
'check --repair' is an experimental feature that might result in data loss.
Type 'YES' if you understand this and want to continue: YES (from BORG_CHECK_I_KNOW_WHAT_I_AM_DOING)
YES
Local Exception
Traceback (most recent call last):
  File "borg/archiver.py", line 4529, in main
  File "borg/archiver.py", line 4461, in run
  File "borg/archiver.py", line 166, in wrapper
  File "borg/archiver.py", line 328, in do_check
  File "borg/repository.py", line 957, in check
  File "borg/repository.py", line 1408, in iter_objects
  File "borg/repository.py", line 1500, in _read
OSError: [Errno 5] Input/output error

Platform: Linux SERVERHOST 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
Linux: debian 9.12
Borg: 1.1.11  Python: CPython 3.5.9 msgpack: 0.5.6
PID: 41377  CWD: /backup/cef
sys.argv: ['/usr/local/bin/borg', 'check', '--repair', '--repository-only', 'REPODIR/']
SSH_ORIGINAL_COMMAND: None

So those are the more specifics, what do you think?

ThomasWaldmann commented 4 years ago

Hmm, root cause still unclear.

SMART looks good AFAICS. Did you also look into SMART Logs (same place, below that attribute table)?

Can you copy the whole repo to another disk / ssd / machine?

That might be useful for 2 reasons:

BTW, there is PR #4940 which tries to improve this, but it is not merged yet and also not clear whether we actually can improve this (because issue is very likely below borg). Getting more information about the root cause would help.

Of course, borg check should not crash, but we must make sure that we actually improve the situation. Just not crashing and making it worse would help nobody.

pierrehenrymuller commented 4 years ago

I'm back after transfert all data from an other place, I was able to run borg check --repair --repository-only without errors.

There is the output :

borg check --repair --repository-only directory/
'check --repair' is an experimental feature that might result in data loss.
Type 'YES' if you understand this and want to continue: YES (from BORG_CHECK_I_KNOW_WHAT_I_AM_DOING)
Data integrity error: Segment entry checksum mismatch [segment 12665, offset 206673397]

This command decrease the size of the repo, like the other backup server size for this repo. I have launch an rsync to the original server from the corrected repo to get a repository that work. In parallel I have launch again the same command on the source server but he failed like before.

We would therefore focus on a problem on the source server but which is not detectable neither in the smart of the disks, the state of the raid, the state of the filesystem ext4.

There is the complete smart output for all disks :

###### sda ######
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS726T4TALA6L1
Serial Number:    V6JHYM3S
LU WWN Device Id: 5 000cca 097e36fc0
Firmware Version: VLGNX460
User Capacity:    4 000 787 030 016 bytes [4,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 29 09:41:42 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   87) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 511) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       100
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       2
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3743
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       155
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       155
194 Temperature_Celsius     0x0002   157   157   000    Old_age   Always       -       38 (Min/Max 25/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

###### sdb ######
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS726T4TALA6L1
Serial Number:    V6HSE82S
LU WWN Device Id: 5 000cca 097d8bdd1
Firmware Version: VLGNX460
User Capacity:    4 000 787 030 016 bytes [4,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 29 09:41:42 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   87) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 525) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       100
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3806
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       5
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       161
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       161
194 Temperature_Celsius     0x0002   157   157   000    Old_age   Always       -       38 (Min/Max 25/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

###### sdc ######
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K4000
Device Model:     HGST HUS724040ALA640
Serial Number:    PN1334PEHY184S
LU WWN Device Id: 5 000cca 250db4b01
Firmware Version: MFAOABY0
User Capacity:    4 000 787 030 016 bytes [4,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 29 09:41:43 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   24) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 540) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   138   138   054    Pre-fail  Offline      -       76
  3 Spin_Up_Time            0x0007   169   169   024    Pre-fail  Always       -       515 (Average 401)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       64
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   142   142   020    Pre-fail  Offline      -       25
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       27438
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       64
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       470
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       470
194 Temperature_Celsius     0x0002   157   157   000    Old_age   Always       -       38 (Min/Max 14/49)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     22116         -
# 2  Short offline       Completed without error       00%     22100         -
# 3  Short offline       Completed without error       00%     22100         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

###### sdd ######
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K4000
Device Model:     HGST HUS724040ALA640
Serial Number:    PN1334PEHXD16S
LU WWN Device Id: 5 000cca 250daff16
Firmware Version: MFAOABY0
User Capacity:    4 000 787 030 016 bytes [4,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 29 09:41:46 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   24) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 548) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       78
  3 Spin_Up_Time            0x0007   169   169   024    Pre-fail  Always       -       513 (Average 400)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       53
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       26
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       28081
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       53
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       80
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       80
194 Temperature_Celsius     0x0002   166   166   000    Old_age   Always       -       36 (Min/Max 15/55)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     22759         -
# 2  Short offline       Completed without error       00%     22743         -
# 3  Short offline       Completed without error       00%     22743         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
ThomasWaldmann commented 4 years ago

SMART status looks good, but you have never run a SMART long test. Of course that takes a long time, but I guess that would be next step if there is no better idea.

pierrehenrymuller commented 4 years ago

Tests launched, results tomorrow.

However, I don't understand why this could be a storage concern, since if you copy the data elsewhere, it's valid. So it must have been well written on the source disks and without error. What actions does Borg do at a lower level than the filesystem? These cannot be bad sectors since they are compensated by the disk and even if they were no longer valid the data is still valid.

pierrehenrymuller commented 4 years ago

Long tests finished without errors. The disks are in good health, they do not have to be replaced at all.

So there's another cause for these concerns, it's not necessarily related to the equipment that's causing this kind of behaviour. What is the audit trail we can go to?

Knowing that I discovered that other repositories had the same concern but as the volumes are much less I had not seen the gap of a few gigas or tens of gigas between the two backup servers.

Complete smartclt status :

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS726T4TALA6L1
Serial Number:    V6JHYM3S
LU WWN Device Id: 5 000cca 097e36fc0
Firmware Version: VLGNX460
User Capacity:    4 000 787 030 016 bytes [4,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 30 16:02:24 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   87) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 511) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       100
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       2
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3773
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       156
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       156
194 Temperature_Celsius     0x0002   162   162   000    Old_age   Always       -       37 (Min/Max 25/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3758         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS726T4TALA6L1
Serial Number:    V6HSE82S
LU WWN Device Id: 5 000cca 097d8bdd1
Firmware Version: VLGNX460
User Capacity:    4 000 787 030 016 bytes [4,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 30 16:02:24 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   87) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 525) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       100
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3836
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       5
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       162
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       162
194 Temperature_Celsius     0x0002   157   157   000    Old_age   Always       -       38 (Min/Max 25/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3821         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K4000
Device Model:     HGST HUS724040ALA640
Serial Number:    PN1334PEHY184S
LU WWN Device Id: 5 000cca 250db4b01
Firmware Version: MFAOABY0
User Capacity:    4 000 787 030 016 bytes [4,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 30 16:02:24 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   24) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 540) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   138   138   054    Pre-fail  Offline      -       76
  3 Spin_Up_Time            0x0007   169   169   024    Pre-fail  Always       -       515 (Average 401)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       64
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   145   145   020    Pre-fail  Offline      -       24
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       27468
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       64
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       470
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       470
194 Temperature_Celsius     0x0002   157   157   000    Old_age   Always       -       38 (Min/Max 14/49)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     27454         -
# 2  Short offline       Completed without error       00%     22116         -
# 3  Short offline       Completed without error       00%     22100         -
# 4  Short offline       Completed without error       00%     22100         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K4000
Device Model:     HGST HUS724040ALA640
Serial Number:    PN1334PEHXD16S
LU WWN Device Id: 5 000cca 250daff16
Firmware Version: MFAOABY0
User Capacity:    4 000 787 030 016 bytes [4,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 30 16:02:26 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   24) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 548) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       78
  3 Spin_Up_Time            0x0007   169   169   024    Pre-fail  Always       -       513 (Average 400)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       53
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   142   142   020    Pre-fail  Offline      -       25
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       28111
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       53
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       80
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       80
194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       35 (Min/Max 15/55)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     28097         -
# 2  Short offline       Completed without error       00%     22759         -
# 3  Short offline       Completed without error       00%     22743         -
# 4  Short offline       Completed without error       00%     22743         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
ThomasWaldmann commented 4 years ago

I agree, the disks look fine. So I guess we can rule out physical disks and disk cables (cable/interface errors usually show up as ICRC log entries [and/or UDMA_CRC_Error?]).

borg doesn't do anything below filesystem layer. The most special thing it does is stuff like fadvise/dontneed and sync_file_range to avoid eating up lots of memory for disk caching and opening files without touching atime (if possible). But even if these calls would not do their special thing, this would not affect correctness.

I can't explain why copying the repo worked and using the original repo with borg led to that IOError. The easiest explanation is maybe random failure. But as using the repo copy with borg also worked, there does not seem to be a borg problem.

ThomasWaldmann commented 4 years ago

BTW, did you read and follow the advisory in the changelog when upgrading to 1.1.11 (see top of file)?

https://github.com/borgbackup/borg/blob/1.1.11/docs/changes.rst#pre-1111-potential-index-corruption--data-loss-issue

Hmm, as that includes borg check and you also tried borg check ...

Also, as the copied repo worked, it does not look like the kind of corruption the advisory is trying to fix.

ThomasWaldmann commented 4 years ago

@pierrehenrymuller Did you find out anything more about this issue?

pierrehenrymuller commented 4 years ago

I have not found more reasons or other log. For corrupted backups I have drop all content on the server and repo who have this problem, only one has been moved to be repaired on another server. The handling being too slow between transfer time and processing, I did not do it for all the saved OSes.

Since in the same hardware I haven't more problem. The only things change before and after is the version from 1.1.10 to 1.1.11.

Thanks

pierrehenrymuller commented 4 years ago

There's really a problem with these (seemingly) file system errors.

The problem has just happened again on the same server and another server as well that has never had this kind of problem.

The first server is baremetal so why not even if I just did a complete pass of the filesystem, raid and disks without any problem. In particular, I did an search for defective sectors without finding anything.

The second server is a VM in a cloud. There's no way to check anything other than the fsck.

In both cases I can fully copy the data from the problematic repo and by checking the consistency with md5sum the files are identical and work well on another local directory on the server.

All this happened in 1.1.11 and I upgraded to 1.1.13 client and server without any behavior change.

I specify that these problems occur only on the rests saving more than 500GB of data with a maximum of 1.2TB, on the rests smaller in size I have never had any problems.

Ideally, there should be more detail on what the code fails to do and why it considers this to be an error. Is there an advanced debug mode that I can test for more details?

ThomasWaldmann commented 4 years ago

Sorry, I don't understand "rests" in that context. Did you mean "repositories"?

ThomasWaldmann commented 4 years ago

Can you update with log output you see with borg 1.1.13?

Is this for a fresh repo, created with 1.1.13 or did you recreate repo index and caches, as advised in the advisory? I just want to rule out issues that already were fixed in the code.

About more details: if there is an IOErrror, that comes from the OS and borg / python does not get more than the error number (errno) which is then translated to textual "I/O Error". We always try to give as much and as original information as possible.

There might be more in the kernel logs, have a look at dmesg and /var/log/syslog.

pierrehenrymuller commented 4 years ago

Sorry, I don't understand "rests" in that context. Did you mean "repositories"? Yes

This repo was not created with 1.1.13, last time I delete and create again this repo was with 1.1.11.

But for the first time IOError is clearly detectable with shell command. With this command I see two files that are not readable. find repo -type f -print -exec cat {} >/dev/null \; The same command don't output any error for the past events and I have done the last time a copy of all this repo on an other server to make the repair successful.

Then I have made a fsck, smarctl full without error. I have started a badblock search for all disks<

It seems IOError was detected before the data is not readable. So it's not a borg problem but I suggest to print the file who have problem with a catch of exception. It's will better to investigate without scan with the find command. Thanks

ThomasWaldmann commented 4 years ago

Note: OSError contains a filename attribute (but only for calls that involve a filename, so maybe not for read()).

ThomasWaldmann commented 4 years ago

Closing this, error below borg (maybe hw or OS).