Closed 389-ds-bot closed 1 year ago
Comment from rmeggins (@richm) at 2015-08-28 20:53:50
What is the exact version of 389-ds-base you are using? What is the platform?
Comment from nhosoi (@nhosoi) at 2016-05-13 00:12:23
Per triage, push the target milestone to 1.3.6.
Comment from rmeggins (@richm) at 2017-02-11 22:49:42
Metadata Update from @richm:
Comment from mreynolds (@mreynolds389) at 2017-07-05 17:37:55
Metadata Update from @mreynolds389:
Comment from mreynolds (@mreynolds389) at 2019-08-23 20:38:11
Metadata Update from @mreynolds389:
Comment from vashirov (@vashirov) at 2020-03-18 16:27:58
Metadata Update from @vashirov:
This issue needs some investigation.
Per triage meeting, closing as duplicate of #1317.
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/48261
This is regarding Data loss on replication topology :
I have noticed Data loss during the replication between Supplier/Hub and consumer when master /hub changelog db file/replica entry is being deleted due to some reasons.
Please note that the hub and consumer is imported with some stale data and consumer doesn’t want initialization during the new replication agreement. The test scenario is outlined below
My Topology looks like .
(o=dev and c=test) (o=dev and c=test ) (o=dev and c=test ) (o=dev and c=test) Master ======================= Hub1====== ======== Hub2=============== ==== Consumer
Created two suffixes (o=dev and c=test) in all instances and created replication for both suffixes and both suffixes are replicated from Master to all the way down to the consumer. Let us assume that , Now 10 entries/records has been added for both suffixes (o=dev and c=test) in the topology (i.e CSN1 –CSN10) - all are in sync at this point of time.
Reproducer steps :
1) Take a db2ldif with “-r” option from the Hub /supplier for both suffixes (o=dev, o=test) . Make sure that replica instance stopped to perform this step. 2) Delete Changelog and Recreate the Changelog again on Hub2 side 3) Delete the Supplier DN (cn=Replication manager,cn=config) and re add the Supplier DN (cn=Replication manager,cn=config) again on Hub2 side 4) Delete the replication agreements between Hub2 and consumer for both suffix (o=dev and o=test) 5) Delete replica and re-add replica for both suffixes (o=dev,o=test) on Hub2 side 6) Now add 5 more entries (CSN11-CSN15) to suffix (o=dev) ONLY on supplier Side and check they get replicated to supplier,Hub1.Hub2 : Now suffix ( o=dev) having CSN1- CSN15 and suffix (o=test) having CSN1 – CSN10 entries (as new 5 entries are added only o=dev suffix ) 7) Now stop both Hub2 and consumer slapd instances
8) Import the data from the ldif file using ldif2db command which we have taken in step 1 above on Hub2 side 9) Now start the both slapd instances Hub2 and consumer 10) Delete the Supplier DN (cn=Replication manager,cn=config) and re add the Supplier DN (cn=Replication manager,cn=config) again on consumer side 11) Delete replica and re-add replica for both suffiex (o=dev,o=test) again on consumer side 12) Add the replication Agreements for both Hub2 and consumer (both suffiex o=dev and o=test ) 13) Stop both slapd instances Hub and Consumer 14) import the data by using the same ldif file as done on step1 on consumer side 15) Now start Hub2 and consumer slapd instances 16) Now add another 5 entries on both suffixes (o=dev ,o=test) on master side (CSN16-CSN20)
Check entries in supplier ,Hub2 and consumer. Now you can that newly added entries (CSN11-CSN15) step 6, are missed in the consumer side for one suffiex (o=dev).
Output on supplier/Hub1/Hub2:
Suffix O=dev will have CSN1- CSN20 entries Suffix O=test=> CSN1-CSN 15 entries
Output on Consumer side: O=dev CSN1- CSN10 entries only (5 entries were missed here ) O=test=> CSN1-CSN 15
I have verified this in latest code base and noticed the same. Any suggestions are welcome on this.
I found the root cause of the issue :
Root cause :
When we delete the changelog and replica entry (step 2nd and 5th) , the changelog will be deleted and it does not have previous MAX CSN number of o=test suffix . After importing the ldif file in the consumer (step 14) then it advertise old MAX CSN number which was not located in the hub changelog. From code I can see the condition which could occur in a replication sequence is that maxCSN of consumer is not locatable either in changelog database nor in purge RUV list. When this condition occurs, the supplier believes that this could occur when its database is initialized or reloaded. With this premise it tries to determine the cursor value from its RUV(MaxCSN).
I have identified the exact function ruv_get_min_or_max_csn() where I can see the problem- repl5_ruv.c file.
/*
*/
please let me know your inputs