LINBIT / drbd

LINBIT DRBD kernel module
https://docs.linbit.com/docs/users-guide-9.0/
GNU General Public License v2.0
574 stars 97 forks source link

UpToDate and out-of-sync inconsistence #30

Closed shutsutsumi closed 2 years ago

shutsutsumi commented 2 years ago

Hi all,

On drbd-9.0.23 (CentOS 8.2), we are facing an inconsistence of node status. That is, although the state of both DRBD device is UpToDate, there exists some out-of-sync value.

#drbdsetup status --verbose --statistics
 r008 node-id:0 role:Primary suspended:no
 write-ordering:flush
 volume:0 minor:8 disk:UpToDate quorum:yes
 size:109757124 read:3065 written:105640 al-writes:27 bm-writes:0
 upper-pending:0 lower-pending:0 al-suspended:no blocked:no
 fs100 node-id:1 connection:Connected role:Secondary congested:no
 ap-in-flight:0 rs-in-flight:0
 volume:0 replication:Established peer-disk:UpToDate
 resync-suspended:no
 received:0 sent:105028 out-of-sync:12 pending:0 unacked:0

I see this issue on several systems running drbd-9.0, and I made a stable reproducer (attached for reference). This repeats umount/mount of DRBD disk on primary node, and disconnect/connect on secondary node at the same time. drbd-out-of-sync-mountumount.zip

When I run this reproducer on drbd-8.9 for comparison, I never see the issue. On the other hand, I see this not only drbd-9.0.23, but also the latest 9.1.

Am I doing something wrong or do we have a bug in out-of-sync calculation and state transition to UpToDate? Any feedback would be greatly appreciated.

shutsutsumi commented 2 years ago

Hi,all

I am continuing to investigate this. As a result, I found one fact.

This problem only occurs when the state transitions from "OutDated" to "UpToDate". And it never occurs when the state transitions from "Inconsistent" to "UpTodate".

Probably,I presume that the state transition from "OutDated" to "UpToDate" is not exclusive in the state transition and asynchronous calculation.

Is this a specification? Or is it a bug? If it's a bug, where in the source code should I add exclusive processing?

I'm worried because I don't know how to fix it. I would be grateful if anyone could tell me.

rck commented 2 years ago

Sorry, but 9.0.23 is very very very outdated, I don't think anybody is really up to reproducing bugs in outdated versions that hopefully got fixed in the ~10 releases since then.

shutsutsumi commented 2 years ago

I have confirmed that this problem can be reproduced with the latest drbd-9.1.16.

On primary and secondary DRBD devices Both states are UpToDate, but there are values ​​that are out of sync.

Any feedback would be greatly appreciated.

rck commented 2 years ago

I have confirmed that this problem can be reproduced with the latest drbd-9.1.16.

There is no such version, are you talking about 9.1.6?

Quite frankly last time I only did read up to 9.0.23 and stopped. Now reading the rest of the message: That can be fine. UpToDate is based on UUID comparison, out-of-sync means there are bits set in the bitmap. You could run a verify and check what it has to say

shutsutsumi commented 2 years ago

There is no such version, are you talking about 9.1.6?

9.1.6 is correct.

UpToDate is based on UUID comparison

Despite the transition to the same UpToDate Does the UUID change whether it is updated or not depending on the transition source?

If Outdated-> UpToDate, the UUID will not be updated. If Inconsistent-> UpToDate, the UUID will be updated.

Is this the intended process?

shutsutsumi commented 2 years ago

I will add to the above comment.

If the UUID is updated, the synchronization process will run, If the UUID is not updated, the synchronization process will not run.

Therefore, if "Outdated-> UpToDate", the synchronization process will not run. Doesn't the synchronization process run at the time of this state transition?

I'm worried because I don't know how to fix it. I would be grateful if anyone could tell me.

JoelColledge commented 2 years ago

You can find a description of how DRBD UUIDs work here. I hope that answers your questions.

As rck mentioned, the out-of-sync counter shows how many bits are set in the bitmap. DRBD is cautious in clearing bitmap bits, so it is possible for them to still be set even when the data is actually in sync. Whether a resync occurs is determined by the UUID comparison.

Please compare the data with verify or by reading from each node separately. If this shows that there is really inconsistent data, then we are certainly interested in investigating further.

shutsutsumi commented 2 years ago

Thanks for the reply. I understand the content. However, the following two points are doubtful.

  1. This does not occur with DRBD 8.x

    Is this a specification designed after DRBD 9.x? Or is it the same specification as DRBD 8.x?

  2. If you execute verify in this case, you can see that there are unsynchronized blocks.

    When this phenomenon occurs, if you execute the drbdadm verify command, You can see that asynchronous blocks remain. However, it seems strange that the disk status is UpToDate.

Can anyone give us an opinion on the above two points?