Closed elonen closed 5 years ago
Yes, here drbdtop
already gets wrong information.
Ok. Any idea where I can report this as a DRBD9 issue (https://github.com/LINBIT/drbd-9.0 doesn't have Issues enabled apparently)? I guess verify should never show bogus numbers like that.
Making me aware here is good enough, I also work on the other projects. Yes, for kernel stuff we use mailing lists, in that case you would report it to drbd-user
.
Reopening. Turns out there's one resource that's only on a single node (and doesn't report any out-of-sync sectors), but is still being shown with negative infinity progress bar in drbdtop:
DRBDTOP v0.2.1-3-g70075d0 (kernel: v9.0.15/utils: v9.5.0) - Details for vm-101-disk-2
┌───────────────────────────────────────────────┐┌In Sync──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│Vol 0 (/dev/drbd1000) ││ -9223372036854775708% │
└───────────────────────────────────────────────┘└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
---
│Resource: vm-101-disk-2: (Overall danger score: 0) │
│ Local Disc(Primary): │
│ volume 0 (/dev/drbd1000): UpToDate(normal disk state) │
│ size: 200.0GiB total-read:0.0KiB read/Sec:0.0KiB total-written:0.0KiB written/Sec:0.0KiB
Should probably show "N/A (no peers)" or something to that effect?
The one-node occurrence doesn't really make much sense, but in general, this is starting to look like some sort of integer wrap around bug in DRBD 9. Here's a summary of my emails to drbd-user concerning this and related problems. See especially the last one.
I've got several DRBD 9 resource that constantly show UpToDate with 9223372036854774304 bytes (exactly 8ZiB) of OutOfDate data.
Example:
# drbdsetup status --verbose --statistics vm-106-disk-1
vm-106-disk-1 node-id:0 role:Primary suspended:no
write-ordering:flush
volume:0 minor:1003 disk:UpToDate quorum:yes
size:16777688 read:215779 written:22369564 al-writes:89 bm-writes:0 upper-pending:0
lower-pending:0 al-suspended:no blocked:no
mox-a node-id:1 connection:Connected role:Secondary congested:no ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
received:215116 sent:22368903 out-of-sync:9223372036854774304 pending:0 unacked:0
mox-c node-id:2 connection:Connected role:Secondary congested:no ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
received:1188 sent:19884428 out-of-sync:0 pending:0 unacked:0
Okay, today one of these resources got a sudden, severe filesystem corruption on the primary.
On the other hand, the secondaries (that showed 8ZiB out-of-sync) were still mountable after I disconnected the corrupted primary. No idea how current data the secondaries had, but drbdtop still showed them as connected and 8Zib out-of-sync.
Here's some more info. Dmesg shows some suspicious looking log message, such as:
1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive bitmap' by drbd_r_vm-117-s[5038]
2) Wrong magic value 0xffff0007 in protocol version 114
3) peer request with dagtag 399201392 not found got_peer_ack [drbd] failed
4) Rejecting concurrent remote state change 2226202936 because of state change 2923939731 Ignoring P_TWOPC_ABORT packet 2226202936.
5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but bitmap already locked for 'write from resync_finished' by drbd_w_vm-117-s[2812] md_sync_timer expired! Worker calls drbd_md_sync().
6) incompatible discard-my-data settings conn( Connecting -> Disconnecting ) error receiving P_PROTOCOL, e: -5 l: 7!
Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All of them API v 16:
mox-a
version: 9.0.15-1 (api:2/proto:86-114) GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by root@mox-a, 2018-10-28 03:08:58 Transports (api:16): tcp (9.0.15-1)
mox-b
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by root@mox-b, 2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)
mox-c
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by root@mox-c, 2018-10-28 05:45:05
Transports (api:16): tcp (9.0.16-1)
mox-d
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by root@mox-d, 2018-10-29 00:22:23
Transports (api:16): tcp (9.0.16-1)
Running Proxmox (5.2-2) as can you'd guess from host names. DRBD resources being managed by LINSTOR.
Just witnessed a resync (after invalidate) to steadily go from 100% out-of-sync to 0% (after several automatic disconnects and reconnects). Immediately after reaching 0%, it went to negative -
@elonen did you reported this problem to drbd-user
maillist? - have you link to that?
I have similar problem with some resources, everithing is UpToDate
but out-of-sync
is too high:
drbdtop
shows:
┌◉ (LIVE UPDATING) Resource List───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Name | Role | Disks | Peer Disks | Connections | Overall | Quorum │
│ myres-0 | Secondary | ✓ | ✗ (18) | ✓ | ✗ (18) | ✓
drbdsetup status --verbose --statistics myres-0
shows:
myres-0 node-id:3 role:Secondary suspended:no
write-ordering:flush
volume:0 minor:1004 disk:UpToDate quorum:yes
size:97656408 read:1226333792 written:268424740 al-writes:2907721 bm-writes:0 upper-pending:0 lower-pending:0 al-suspended:no blocked:no
m7c7 node-id:0 connection:Connected role:Secondary congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
received:104533176 sent:0 out-of-sync:0 pending:0 unacked:0
m8c27 node-id:2 connection:Connected role:Primary congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Established peer-disk:Diskless peer-client:yes resync-suspended:no
received:491053796 sent:257236 out-of-sync:94307148 pending:0 unacked:0
m8c30 node-id:1 connection:Connected role:Secondary congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Established peer-disk:Diskless peer-client:yes resync-suspended:no
received:825266580 sent:360933148 out-of-sync:0 pending:0 unacked:0
what is wrong?
In the output above, you have 94307148 out of sync blocks on node-id:2 for myres-0.
If you happen to see negative out of sync again, what would be awesome would be to get the output of drbdsetup events2 --timestamps --statistics --now
@haySwim sorry I was little confused, In my case I have another problem: out-of-sync
is always 94307148
and not changing to lower values, although all resources in UpToDate
As far as I can tell based on the output you posted, the underlying information that's being reported to drbdtop has that value of 94307148
, so that's the state of the world as far as drbdtop can know.
It's strange that it's reporting that being resource connected and the resync isn't suspended and the number of out of sync blocks isn't changing. @rck, any ideas?
We are working on it. We have seen that in non-drbdtop scenarios, so it is not a drbdtop bug per se.
@rck,Can I provide you some useful information about these kind of volumes?
I can confirm that this is basically an integer wrap-around bug in the DRBD kernel module (the result of a subtraction r = a - b
with b > a
, to be more precise). It is normally triggered in 3+ node setups when there are at least 3 different generations, Gen1 ahead of Gen2 ahead of Gen3, and Gen3 is resyncing to Gen2, and then Gen1 joins afterwards, updating sync statistics on Gen2 and Gen3, but not updating some resync counters, thereby causing the calculation of incorrect statistics at some point in time.
What's new in the case that @elonen reported is that it was apparently seen on a single-node resource, which is very surprising since a single-node resource does not even have any peers to start a resync. Could that be a leftover from a time when the resource still had peers?
The bug in DRBD has been tracked for a while but has not been completely analyzed, and as of now, no fix is available yet.
I can provide some more input to this. This is happening to me extensively at the moment. The the scenario is reproducable in a 3-node environment. Node 1 = Primary Node 2 = Secondary (UpToDate) Node 3 = Secondary (Inconsisent)
Protocol version A on-congestion=pull-ahead
What seems to happen is that in this scenario, node 1 is syncing to Node 3. Then congestion happens, and Node 1 gets ahead of Node 2. This triggers a resync. And consistently that causes this error to occur.
Has there been any work towards correcting it? I'm on 9.0.21-1
Thanks for the report. That does indeed seem to fall in the same category of error, probably incorrectly updated resync statistics counters at some point. We'll take a look at it.
I have some resources for which drbdtop shows negative percentage in the progress bar:
It also shows some warnings/errors on one node in the main view, but "all ok" on another:
vs.
Detailed status claims both are UpToDate:
Verify shows larger out-of-sync counts than the size of the resource, though, so maybe this is actually more of a DRBD9 issue than drbdtop UI problem..?