LINBIT / drbdtop

CLI management tool for DRBD. Like top, but for DRBD resources.
GNU General Public License v2.0
48 stars 14 forks source link

Negative value in sync progress bar #7

Closed elonen closed 5 years ago

elonen commented 6 years ago

I have some resources for which drbdtop shows negative percentage in the progress bar:

DRBDTOP v0.2.1-3-g70075d0 (kernel: v9.0.15/utils: v9.5.0) - Details for vm-100-disk-1
┌─────────────────────────────────────┐┌In Sync────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│Vol 0 (/dev/drbd1018)                ││                                                   -549739692404%                                                  │
└─────────────────────────────────────┘└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

It also shows some warnings/errors on one node in the main view, but "all ok" on another:

│  Name                  | Role      | Disks | Peer Disks | Connections | Overall | Quorum                                                                 │
│  vm-100-disk-1         | Secondary | ✓     | ✗ (43)     | ✓           | ✗ (43)  | ✓

vs.

  vm-100-disk-1         | Secondary | ✓     | ✓          | ✓           | ✓       | ✓

Detailed status claims both are UpToDate:

DRBDTOP v0.2.1-3-g70075d0 (kernel: v9.0.15/utils: v9.5.0) - Details for vm-100-disk-1
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│Resource: vm-100-disk-1: (Overall danger score: 43)                                                                                                       │
│ Local Disc(Secondary):                                                                                                                                   │
│  volume 0 (/dev/drbd1018): UpToDate(normal disk state)                                                                                                   │
│    size: 32.0GiB total-read:0.0KiB read/Sec:0.0KiB total-written:0.0KiB written/Sec:0.0KiB                                                               │
│                                                                                                                                                          │
│ Connection to mox-a(Secondary): Connected(connected to mox-a)                                                                                            │
│  volume 0:                                                                                                                                               │
│   UpToDate(normal disk state)                                                                                                                            │
│   Sent: total:0.0KiB Per/Sec:0.0KiB                                                                                                                      │
│   Received: total:0.0KiB Per/Sec:0.0KiB                                                                                                                  │
│   OutOfSync: current:8.0ZiB average:1.1ZiB min:8.0ZiB max:8.0ZiB                                                                                         │
│   PendingWrites: current:0.0 average:0.0 min:0.0 max:0.0                                                                                                 │
│   UnackedWrites: current:0.0 average:0.0 min:0.0 max:0.0

Verify shows larger out-of-sync counts than the size of the resource, though, so maybe this is actually more of a DRBD9 issue than drbdtop UI problem..?

[37867.086140] drbd vm-100-disk-1: Preparing cluster-wide state change 1052119874 (0->1 496/288)
[37867.086266] drbd vm-100-disk-1: State change 1052119874: primary_nodes=0, weak_nodes=0
[37867.086266] drbd vm-100-disk-1: Committing cluster-wide state change 1052119874 (0ms)
[37867.086269] drbd vm-100-disk-1/0 drbd1018 mox-a: repl( Established -> VerifyS )
[37867.086271] drbd vm-100-disk-1/0 drbd1018 mox-a: Starting Online Verify from sector 0
[37958.172278] drbd vm-100-disk-1/0 drbd1018 mox-a: Online verify done (total 91 sec; paused 0 sec; 368740 K/sec)
[37958.172279] drbd vm-100-disk-1/0 drbd1018 mox-a: Online verify found 18446744073709550592 4k blocks out of sync!
[37958.172381] drbd vm-100-disk-1/0 drbd1018 mox-a: repl( VerifyS -> Established )
[37958.172388] drbd vm-100-disk-1/0 drbd1018 mox-a: helper command: /sbin/drbdadm out-of-sync
[37958.174733] drbd vm-100-disk-1/0 drbd1018 mox-a: helper command: /sbin/drbdadm out-of-sync exit code 0 (0x0)
rck commented 6 years ago

Yes, here drbdtop already gets wrong information.

elonen commented 6 years ago

Ok. Any idea where I can report this as a DRBD9 issue (https://github.com/LINBIT/drbd-9.0 doesn't have Issues enabled apparently)? I guess verify should never show bogus numbers like that.

rck commented 6 years ago

Making me aware here is good enough, I also work on the other projects. Yes, for kernel stuff we use mailing lists, in that case you would report it to drbd-user.

elonen commented 6 years ago

Reopening. Turns out there's one resource that's only on a single node (and doesn't report any out-of-sync sectors), but is still being shown with negative infinity progress bar in drbdtop:

DRBDTOP v0.2.1-3-g70075d0 (kernel: v9.0.15/utils: v9.5.0) - Details for vm-101-disk-2
┌───────────────────────────────────────────────┐┌In Sync──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│Vol 0 (/dev/drbd1000)                          ││                                                               -9223372036854775708%                                                             │
└───────────────────────────────────────────────┘└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

---

│Resource: vm-101-disk-2: (Overall danger score: 0)                                                                                                                                                │
│ Local Disc(Primary):                                                                                                                                                                             │
│  volume 0 (/dev/drbd1000): UpToDate(normal disk state)                                                                                                                                           │
│    size: 200.0GiB total-read:0.0KiB read/Sec:0.0KiB total-written:0.0KiB written/Sec:0.0KiB

Should probably show "N/A (no peers)" or something to that effect?

elonen commented 6 years ago

The one-node occurrence doesn't really make much sense, but in general, this is starting to look like some sort of integer wrap around bug in DRBD 9. Here's a summary of my emails to drbd-user concerning this and related problems. See especially the last one.


I've got several DRBD 9 resource that constantly show UpToDate with 9223372036854774304 bytes (exactly 8ZiB) of OutOfDate data.

Example:

# drbdsetup status --verbose --statistics vm-106-disk-1
vm-106-disk-1 node-id:0 role:Primary suspended:no
    write-ordering:flush
  volume:0 minor:1003 disk:UpToDate quorum:yes
      size:16777688 read:215779 written:22369564 al-writes:89 bm-writes:0 upper-pending:0
      lower-pending:0 al-suspended:no blocked:no
  mox-a node-id:1 connection:Connected role:Secondary congested:no ap-in-flight:0
      rs-in-flight:18446744073709549808
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
        received:215116 sent:22368903 out-of-sync:9223372036854774304 pending:0 unacked:0
  mox-c node-id:2 connection:Connected role:Secondary congested:no ap-in-flight:0
      rs-in-flight:18446744073709549808
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
        received:1188 sent:19884428 out-of-sync:0 pending:0 unacked:0

Okay, today one of these resources got a sudden, severe filesystem corruption on the primary.

On the other hand, the secondaries (that showed 8ZiB out-of-sync) were still mountable after I disconnected the corrupted primary. No idea how current data the secondaries had, but drbdtop still showed them as connected and 8Zib out-of-sync.


Here's some more info. Dmesg shows some suspicious looking log message, such as:

1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive bitmap' by drbd_r_vm-117-s[5038]

2) Wrong magic value 0xffff0007 in protocol version 114

3) peer request with dagtag 399201392 not found got_peer_ack [drbd] failed

4) Rejecting concurrent remote state change 2226202936 because of state change 2923939731 Ignoring P_TWOPC_ABORT packet 2226202936.

5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but bitmap already locked for 'write from resync_finished' by drbd_w_vm-117-s[2812] md_sync_timer expired! Worker calls drbd_md_sync().

6) incompatible discard-my-data settings conn( Connecting -> Disconnecting ) error receiving P_PROTOCOL, e: -5 l: 7!

Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All of them API v 16:

mox-a

version: 9.0.15-1 (api:2/proto:86-114) GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by root@mox-a, 2018-10-28 03:08:58 Transports (api:16): tcp (9.0.15-1)

mox-b

version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by root@mox-b, 2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)

mox-c

version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by root@mox-c, 2018-10-28 05:45:05
Transports (api:16): tcp (9.0.16-1)

mox-d

version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by root@mox-d, 2018-10-29 00:22:23
Transports (api:16): tcp (9.0.16-1)

Running Proxmox (5.2-2) as can you'd guess from host names. DRBD resources being managed by LINSTOR.


Just witnessed a resync (after invalidate) to steadily go from 100% out-of-sync to 0% (after several automatic disconnects and reconnects). Immediately after reaching 0%, it went to negative -% ! After that, drbdtop started showing 8.0ZiB out-of-sync.

kvaps commented 5 years ago

@elonen did you reported this problem to drbd-user maillist? - have you link to that?

kvaps commented 5 years ago

I have similar problem with some resources, everithing is UpToDate but out-of-sync is too high:

drbdtop shows:

┌◉ (LIVE UPDATING) Resource List───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  Name    | Role      | Disks | Peer Disks | Connections | Overall | Quorum                                                                                             │
│  myres-0 | Secondary | ✓     | ✗ (18)     | ✓           | ✗ (18)  | ✓                                                                                                  

drbdsetup status --verbose --statistics myres-0 shows:

myres-0 node-id:3 role:Secondary suspended:no
    write-ordering:flush
  volume:0 minor:1004 disk:UpToDate quorum:yes
      size:97656408 read:1226333792 written:268424740 al-writes:2907721 bm-writes:0 upper-pending:0 lower-pending:0 al-suspended:no blocked:no
  m7c7 node-id:0 connection:Connected role:Secondary congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
        received:104533176 sent:0 out-of-sync:0 pending:0 unacked:0
  m8c27 node-id:2 connection:Connected role:Primary congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Established peer-disk:Diskless peer-client:yes resync-suspended:no
        received:491053796 sent:257236 out-of-sync:94307148 pending:0 unacked:0
  m8c30 node-id:1 connection:Connected role:Secondary congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Established peer-disk:Diskless peer-client:yes resync-suspended:no
        received:825266580 sent:360933148 out-of-sync:0 pending:0 unacked:0

what is wrong?

deleteriousEffect commented 5 years ago

In the output above, you have 94307148 out of sync blocks on node-id:2 for myres-0.

If you happen to see negative out of sync again, what would be awesome would be to get the output of drbdsetup events2 --timestamps --statistics --now

kvaps commented 5 years ago

@haySwim sorry I was little confused, In my case I have another problem: out-of-sync is always 94307148 and not changing to lower values, although all resources in UpToDate

deleteriousEffect commented 5 years ago

As far as I can tell based on the output you posted, the underlying information that's being reported to drbdtop has that value of 94307148, so that's the state of the world as far as drbdtop can know.

It's strange that it's reporting that being resource connected and the resync isn't suspended and the number of out of sync blocks isn't changing. @rck, any ideas?

rck commented 5 years ago

We are working on it. We have seen that in non-drbdtop scenarios, so it is not a drbdtop bug per se.

kvaps commented 5 years ago

@rck,Can I provide you some useful information about these kind of volumes?

raltnoeder commented 5 years ago

I can confirm that this is basically an integer wrap-around bug in the DRBD kernel module (the result of a subtraction r = a - b with b > a, to be more precise). It is normally triggered in 3+ node setups when there are at least 3 different generations, Gen1 ahead of Gen2 ahead of Gen3, and Gen3 is resyncing to Gen2, and then Gen1 joins afterwards, updating sync statistics on Gen2 and Gen3, but not updating some resync counters, thereby causing the calculation of incorrect statistics at some point in time. What's new in the case that @elonen reported is that it was apparently seen on a single-node resource, which is very surprising since a single-node resource does not even have any peers to start a resync. Could that be a leftover from a time when the resource still had peers?

The bug in DRBD has been tracked for a while but has not been completely analyzed, and as of now, no fix is available yet.

crystalcross commented 4 years ago

I can provide some more input to this. This is happening to me extensively at the moment. The the scenario is reproducable in a 3-node environment. Node 1 = Primary Node 2 = Secondary (UpToDate) Node 3 = Secondary (Inconsisent)

Protocol version A on-congestion=pull-ahead

What seems to happen is that in this scenario, node 1 is syncing to Node 3. Then congestion happens, and Node 1 gets ahead of Node 2. This triggers a resync. And consistently that causes this error to occur.

Has there been any work towards correcting it? I'm on 9.0.21-1

raltnoeder commented 4 years ago

Thanks for the report. That does indeed seem to fall in the same category of error, probably incorrectly updated resync statistics counters at some point. We'll take a look at it.