LINBIT / drbd

LINBIT DRBD kernel module
https://docs.linbit.com/docs/users-guide-9.0/
GNU General Public License v2.0
553 stars 93 forks source link

Full resync always stuck in congested (behind) state after few days #46

Open Lathanderjk opened 1 year ago

Lathanderjk commented 1 year ago

Hi,

DRBD resource always stuck in Behind state and sync status start decreasing 98.20 -> 98.19 ... 98.12% after 2~3 days when on-congestion policy is "pull-ahead" there is no entry about congestion fill/extents reached in kernel logs as when you hit the configured limit.

I tried to increase congestion-fill to crazy value (100M -> 200M,500M or disable 0) and congestion-extents (to value even higher than al-extents) or commented them completely out from configuration but no help still same outcome.
Commenting out on-congestion pull-ahead (switch to default block) will help and resync started continuing again.

When congested logs on primary are filling with thousand same entries in loop: [ +0.104537] drbd storage/0 drbd1 backup-dc: repl( PausedSyncS -> Ahead ) [ +1.026862] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source [ +0.002791] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source exit code 0 [ +0.001076] drbd storage/0 drbd1 backup-dc: repl( Ahead -> PausedSyncS ) [ +0.000718] drbd storage/0 drbd1 backup-dc: Began resync as PausedSyncS (will sync 12844472428 KB [3211118107 bits set]). [ +0.043059] drbd storage/0 drbd1 backup-dc: repl( PausedSyncS -> Ahead ) [ +1.040371] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source [ +0.004944] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source exit code 0 [ +0.000894] drbd storage/0 drbd1 backup-dc: repl( Ahead -> PausedSyncS ) [ +0.000695] drbd storage/0 drbd1 backup-dc: Began resync as PausedSyncS (will sync 12844472428 KB [3211118107 bits set]). [ +0.098964] drbd storage/0 drbd1 backup-dc: repl( PausedSyncS -> Ahead ) [ +1.046465] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source [ +0.003419] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source exit code 0 [ +0.004561] drbd storage/0 drbd1 backup-dc: repl( Ahead -> PausedSyncS ) [ +0.010005] drbd storage/0 drbd1 backup-dc: Began resync as PausedSyncS (will sync 12844472428 KB [3211118107 bits set]). [ +0.046983] drbd storage/0 drbd1 backup-dc: repl( PausedSyncS -> Ahead ) [ +1.022996] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source [ +0.006174] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source exit code 0 [ +0.011331] drbd storage/0 drbd1 backup-dc: repl( Ahead -> PausedSyncS ) [ +0.009500] drbd storage/0 drbd1 backup-dc: Began resync as PausedSyncS (will sync 12844472428 KB [3211118107 bits set]). [ +0.264396] drbd storage/0 drbd1 backup-dc: repl( PausedSyncS -> Ahead ) [ +1.052604] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source [ +0.004883] drbd storage/0 drbd1 backup-dc: helper command: /sbin/drbdadm before-resync-source exit code 0 [ +0.008559] drbd storage/0 drbd1 backup-dc: repl( Ahead -> PausedSyncS ) [ +0.008532] drbd storage/0 drbd1 backup-dc: Began resync as PausedSyncS (will sync 12844472428 KB [3211118107 bits set]).

ENV: DRBD 9.1.7 Oracle Linux 8.6(lattest updates, but same with few months old ackages) Full configuration in attachment, congestion is only configured for backup storage(backup-dc) node because is way slower. storage.txt

Lathanderjk commented 1 year ago

Zero progress was actually because of dynamic sync-rate controller, after switching off congestion control resync start but never finished either. After setting fixed rate "c-plan-ahead 0 and "resync-rate 50M" everything works as expected, fixed rate is still better than no rate at all.