389ds / 389-ds-base

The enterprise-class Open Source LDAP server for Linux
https://www.port389.org/
Other
210 stars 90 forks source link

IPA server dirsrv RUV entry data excluded from replication #705

Closed 389-ds-bot closed 4 years ago

389-ds-bot commented 4 years ago

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/47368


Ticket was cloned from Red Hat Bugzilla (product Fedora): Bug 966562

Description of problem:

This may affect 389 Directory Server in general but, I specifically saw this
behavior in my IPA test env.

After a good bit of troubleshooting, and getting dev involved, Rich found that
the RUV data was excluded from replication (at least in some cases):

From Rich:

However, I do see it in the changelog on f18-2:

dbid: 519b96d3000000030000
     replgen: 1369151185 Tue May 21 10:46:25 2013
     csn: 519b96d3000000030000
     uniqueid: c4a8008c-c15d11e2-81caa921-5c226755
     dn: fqdn=f18-2.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com
     operation: modify
         krbLastSuccessfulAuth: 20130521154625Z
         modifiersname: cn=Directory Manager
         modifytimestamp: 20130521154625Z
         entryusn: 1068

All of these attributes are excluded from replication, which means it
will be in the local RUV but not in any other RUVs.

Version-Release number of selected component (if applicable):
389-ds-base-1.3.0.6-1.fc18.x86_64

How reproducible:
very

Steps to Reproduce:
1.  Setup a few IPA servers
2.  use ldapsearch (like below) to compare the RUV data for each server on each
server

for RUV in $MASTER $REPLICA1 $REPLICA2 $REPLICA3 $REPLICA4; do
    for SERVER in $MASTER $REPLICA1 $REPLICA2 $REPLICA3 $REPLICA4; do
        RUVCHK=$(ldapsearch -o ldif-wrap=no -h $SERVER \
        -xLLL -D "$ROOTDN" -w $ROOTDNPWD -b $BASEDN \
'(&(objectclass=nstombstone)(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff))'
\
        nsds50ruv|grep $RUV);   echo "$(echo $SERVER|cut -f1 -d.): $RUVCHK"
    done
    echo
done

Actual results:

You see some differences in the last field (MaxCSN):

f18-1: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000
f18-2: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000
f18-3: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000
f18-4: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000
f18-5: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000

f18-1: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a7c32000400030000
f18-2: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a81b2000200030000
f18-3: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a7c32000400030000
f18-4: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a7c32000400030000
f18-5: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a7c32000400030000

f18-1: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a447b000500050000
f18-2: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a447b000500050000
f18-3: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a8138000100050000
f18-4: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a8138000100050000
f18-5: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a8138000100050000

f18-1: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a4477000600060000
f18-2: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a4477000600060000
f18-3: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a4477000600060000
f18-4: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a813b000300060000
f18-5: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a813b000300060000

f18-1: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a4483000000070000
f18-2: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a4483000000070000
f18-3: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a4483000000070000
f18-4: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a4483000000070000
f18-5: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a813e000000070000

Expected results:

Entries match or some other definitive way to confirm that the directory
servers are in sync.

Additional info:
389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-15 01:45:48

Simplified steps to reproduce the problem:

[1] Setup a master and a replica [2] Setup replication [3] Add these attributes to the replication agreement:

nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE sn nsds5ReplicaStripAttrs: modifiersname modifytimestamp

[4] Create a new user. [5] modify the "sn" attribute in the user. [6] Look at the database ruvs: the consumer is missing the last update

master: nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4644000000010000

replica: nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4532000000010000

[7] However, if you look at the ruv in the replication agreement it is the same as the replica:

nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4532000000010000

So maybe, the solution is to not look at the database ruv's, but at the agreement ruv's? This of course requires more work, as you would have to check each agreement on each server, instead of the server as a whole. Maybe revise repl-monitor.pl, or create a new task?

Still investigating...

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-24 03:38:57

Basic overview:

Replica 1 dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config nsds50ruv: {replica 111 ldap://localhost.localdomain:389} 5267db120000006f0000 52683d810000006f0000 nsds50ruv: {replica 222 ldap://localhost.localdomain:22222} 5267e90e000000de0000 52683d89000100de0000 nsds5AgmtMaxCSN: cn=to replica2;localhost.localdomain;22222;111;52683d810000006f0000

Replica 2 dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config nsds50ruv: {replica 222 ldap://localhost.localdomain:22222} 5267e90e000000de0000 52683d89000100de0000 nsds50ruv: {replica 111 ldap://localhost.localdomain:389} 5267db120000006f0000 52683d810000006f0000 nsds5AgmtMaxCSN: cn=to replica1;localhost.localdomain;389;222;52683d89000100de0000

Check the agmt maxcsn against the replica's RUV(supplier element). So on Replica 1, look at the agmt maxcsn for Replica 2(maxcsn is 52683d810000006f0000)> Compare this on Replica 2 with its local ruv element for replica 111(replica 1 rid).

Attaching patch next...

389-ds-bot commented 4 years ago

Comment from lkrispen (@elkris) at 2013-10-24 18:23:04

It looks good to me, just one concern.

Checking all replication agreements for any update could be some overhead and might not always be needed (eg no fractional agreements) or wanted 8no requirement to fine grained repl monitoring) - so could this additional ruv be optional ?

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-24 19:09:01

Replying to [comment:6 elkris]:

It looks good to me, just one concern.

Checking all replication agreements for any update could be some overhead and might not always be needed (eg no fractional agreements) or wanted 8no requirement to fine grained repl monitoring) - so could this additional ruv be optional ?

Originally I just did this for fractional agreements, but then I thought it would also be useful for non-fractional agreements. I could make it configurable: all agmts or just fractional agmts.

Looping over all the agreements I don't think is that bad - it's basically just a linked list so I wouldn't expect too much overhead. But, yes, it was also a concern of mine. So, today I plan on doing performance testing to see how much of an impact it has when there are many agreements.

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-24 20:42:57

1st round of performance testing (precise etime):

11 fractional agmts(4 -6 stripped attributes per agmt). Simply performing a mod on the same entry.

Master: average: 0.056 Patched server: average: 0.055

There is no significant change between the two versions.

Going to test with "callgrind" next...

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-25 22:13:45

callgrind results

write_changelog_and_ruv() - sysTime 0.018

    -> agmt_update_maxcsn() - sysTime 0.0491

    -> cl5WriteOperationTxn() - sysTime 24.272

0.0491 is insignificant.

Since there is basically no impact, I will not make it configurable for fractional vs. non-fractional agmts. So it will be global, and this will also make rewriting repl-monitor.pl easier and more consistent.

389-ds-bot commented 4 years ago

Comment from rmeggins (@richm) at 2013-10-26 05:01:18


    2750                    sprintf(buf, "%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), 
    2751                            agmt->hostname, agmt->port); 
    2752                    agmt->maxcsn = slapi_ch_strdup(buf);

Just do this instead:

    agmt->maxcsn = slapi_ch_smprintf("%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), 
                                 agmt->hostname, agmt->port);

same here


    2756                    sprintf(buf, "%s;%s;%d;%d;%s",slapi_rdn_get_rdn(agmt->rdn), 
    2757                            agmt->hostname, agmt->port, rid, maxcsn); 
    2758                    agmt->maxcsn = slapi_ch_strdup(buf);

Then you can get rid of char buf[BUFSIZ];

The other places in the code where you use sprintf e.g.

         sprintf(buf,"%s;%s;%d;%d;",slapi_rdn_get_rdn(ra->rdn),ra->hostname, ra->port, rid);

Please use PR_snprintf instead, which gives you protection against buffer overruns (yes, it is highly unlikely that the rdn+hostname+port+rid will be > 8192 bytes long, but . . .)


    2827            val.bv_val = agmt->maxcsn; 
    2828            PR_Unlock(agmt->lock); 
    2829            val.bv_len = strlen(val.bv_val); 
    2830            slapi_mod_add_value(smod, &val);

If the lock is to protect agmt->maxcsn from being freed or overwritten, then you will need to unlock after the value has been copied in slapi_mod_add_value.


    2911                        if(strstr(maxcsns[i], buf) || strstr(maxcsns[i], unavail_buf)){ 
    2912                            ra->maxcsn = slapi_ch_strdup(maxcsns[i]); 
    2913                        }

at this point in the code, is it possible that ra->maxcsn will already have been set?

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-28 18:59:46

Replying to [comment:10 richm]:


  2750                    sprintf(buf, "%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), 
  2751                            agmt->hostname, agmt->port); 
  2752                    agmt->maxcsn = slapi_ch_strdup(buf);

Just do this instead:

    agmt->maxcsn = slapi_ch_smprintf("%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), 
                               agmt->hostname, agmt->port);

same here


  2756                    sprintf(buf, "%s;%s;%d;%d;%s",slapi_rdn_get_rdn(agmt->rdn), 
  2757                            agmt->hostname, agmt->port, rid, maxcsn); 
  2758                    agmt->maxcsn = slapi_ch_strdup(buf);

Then you can get rid of char buf[BUFSIZ];

The other places in the code where you use sprintf e.g.

         sprintf(buf,"%s;%s;%d;%d;",slapi_rdn_get_rdn(ra->rdn),ra->hostname, ra->port, rid);

Please use PR_snprintf instead, which gives you protection against buffer overruns (yes, it is highly unlikely that the rdn+hostname+port+rid will be > 8192 bytes long, but . . .)


  2827            val.bv_val = agmt->maxcsn; 
  2828            PR_Unlock(agmt->lock); 
  2829            val.bv_len = strlen(val.bv_val); 
  2830            slapi_mod_add_value(smod, &val);

If the lock is to protect agmt->maxcsn from being freed or overwritten, then you will need to unlock after the value has been copied in slapi_mod_add_value.


  2911                        if(strstr(maxcsns[i], buf) || strstr(maxcsns[i], unavail_buf)){ 
  2912                            ra->maxcsn = slapi_ch_strdup(maxcsns[i]); 
  2913                        }

at this point in the code, is it possible that ra->maxcsn will already have been set?

Thanks Rich, I'll look into this. As I was working on repl-monitor.pl I realized I had to change part of the server fix too. So, I'm working on a new patch for the server and repl-monitor now...

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-30 00:39:30

text report output text-report.txt

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-30 00:39:45

html report report.html

389-ds-bot commented 4 years ago

Comment from rmeggins (@richm) at 2013-10-31 00:26:16

I would prefer to leave repl-monitor.pl unbranded, not designated as "389 Directory Server Replication Monitor".

Will repl-monitor.pl work as before, if the server doesn't have nsds5AgmtMaxCSN? It has to work in a mixed topology where nsds5AgmtMaxCSN isn't present. Of course, it won't "work" in that it will give results that are not entirely accurate, but it should at least give a best effort, no worse than the way it works now.

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-31 00:49:29

Replying to [comment:12 richm]:

I would prefer to leave repl-monitor.pl unbranded, not designated as "389 Directory Server Replication Monitor".

No problem.

Will repl-monitor.pl work as before, if the server doesn't have nsds5AgmtMaxCSN? It has to work in a mixed topology where nsds5AgmtMaxCSN isn't present. Of course, it won't "work" in that it will give results that are not entirely accurate, but it should at least give a best effort, no worse than the way it works now.

It will not work with older versions, but it should not take much to make it work across the board. The only real change in the script is how the supplier maxcsn is located.

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-31 21:37:47

Revision 2 0001-Ticket-47368-IPA-server-dirsrv-RUV-entry-data-exclud.patch

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-31 21:42:32

New patch attached. Removed "389" branding, and script is now backwards compatible with older versions of DS that do not use "nsds5AgmtMaxCSN".

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-31 23:01:58

git merge ticket47368 Updating 4c1dfaf..c30c897 Fast-forward ldap/admin/src/scripts/repl-monitor.pl.in | 271 ++++++++---- ldap/servers/plugins/replication/repl5.h | 13 +- ldap/servers/plugins/replication/repl5_agmt.c | 472 +++++++++++++++++++- ldap/servers/plugins/replication/repl5_agmtlist.c | 8 + ldap/servers/plugins/replication/repl5_plugins.c | 7 + ldap/servers/plugins/replication/repl5_protocol.c | 2 +- ldap/servers/plugins/replication/repl5_replica.c | 43 ++- .../plugins/replication/repl5_replica_config.c | 58 ++-- ldap/servers/plugins/replication/repl5_ruv.c | 1 + ldap/servers/plugins/replication/repl_globals.c | 1 + ldap/servers/slapd/entry.c | 6 +- ldap/servers/slapd/rdn.c | 10 + ldap/servers/slapd/slapi-plugin.h | 10 +- 13 files changed, 752 insertions(+), 150 deletions(-)

git push origin master Counting objects: 45, done. Delta compression using up to 4 threads. Compressing objects: 100% (22/22), done. Writing objects: 100% (23/23), 9.12 KiB, done. Total 23 (delta 20), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 4c1dfaf..c30c897 master -> master

commit c30c897acb80ae00ecdd3063145ddd716caa96f1 Author: Mark Reynolds mreynolds389@redhat.com Date: Thu Oct 31 12:36:14 2013 -0400

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-31 23:26:37

Summary:

To manually check if a consumer is up to date, you compare the agreements max csn, located in the local ruv tombstone entry, to the consumers RUV element for the suppliers rid.

Here is the format of the new attribute/value:

nsds5AgmtMaxCSN: <repl area>:<agmt name>:<consumer host>:<port>:<consumer rid>:<maxcsn>

nsds5AgmtMaxCSN: dc=example,dc=com;to replica 1;localhost.localdomain;389;111;5270095c0000014d0000

Example: Check if Replica B is caught up with Replica A

Replica A (rid 222):

ldapsearch -h localhost -D "cn=directory manager" -W -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' -xLLL  nsds50ruv nsds5AgmtMaxCSN
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 222 ldap://localhost.localdomain:389} 527283720000006f0000 52729c8b0000006f0000
nsds5AgmtMaxCSN: dc=example,dc=com;to replica B;localhost.localdomain;22222;65535;52729c8b0000006f0000
nsds5AgmtMaxCSN: dc=example,dc=com;to replica C;localhost.localdomain;22222;444;52729c999900006f0000

Look at the nsds5AgmtMaxCSN attributes, and locate the agreement for replica B, and get its max csn value. Then compare this max csn to the max csn on Replica B(nsds50ruv) for replica 222(the supplier):

Replica B (rid 65535):

ldapsearch -h localhost -D "cn=directory manager" -W -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' -p 22222 -xLLL  nsds50ruv
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 222 ldap://localhost.localdomain:389} 527283720000006f0000 52729c8b0000006f0000
nsds50ruv: {replica 444 ldap://localhost.localdomain:389} 527283af0000006f0000 527299990000006f0000

So now each agreement can be monitored, and this is also the only way to correctly tell if fractional replication agreements are in sync.

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-31 23:52:49

Fixed jenkins errors 0001-Ticket-47368-Fix-Jenkins-errors.patch

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-10-31 23:54:15

Fixed Jenkins errors:

git merge ticket47368 Updating c30c897..eb6a462 Fast-forward ldap/servers/plugins/replication/repl5_agmt.c | 13 +++++-------- 1 files changed, 5 insertions(+), 8 deletions(-)

git push origin master Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 887 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git c30c897..eb6a462 master -> master

commit eb6a462b8ed8ddffa29e01178e39d77112dcfe32 Author: Mark Reynolds mreynolds389@redhat.com Date: Thu Oct 31 14:49:07 2013 -0400

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-11-04 20:55:24

Fix coverity errors 0001-Ticket-47368-Fix-coverity-issues.patch

389-ds-bot commented 4 years ago

Comment from rmeggins (@richm) at 2013-11-04 21:05:30

ack

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-11-04 21:21:01

git merge ticket47368 Updating 8db3a1a..8d398a5 Fast-forward ldap/servers/plugins/replication/repl5_agmt.c | 6 +++--- ldap/servers/plugins/replication/repl5_replica.c | 8 ++++++-- 2 files changed, 9 insertions(+), 5 deletions(-)

git push origin master Counting objects: 15, done. Delta compression using up to 4 threads. Compressing objects: 100% (8/8), done. Writing objects: 100% (8/8), 929 bytes, done. Total 8 (delta 6), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 8db3a1a..8d398a5 master -> master

commit 8d398a555fc1632d678223e6547a5a4df4082775 Author: Mark Reynolds mreynolds389@redhat.com Date: Mon Nov 4 09:52:11 2013 -0500

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-12-05 23:30:47

Fix memory leaks 0001-Ticket-47368-fix-memory-leaks.patch

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2013-12-06 01:17:15

git merge ticket47368 Updating bae797c..f67e638 Fast-forward ldap/servers/plugins/replication/repl5_agmt.c | 15 ++++++++++----- ldap/servers/plugins/replication/repl5_replica.c | 4 +--- 2 files changed, 11 insertions(+), 8 deletions(-)

git push origin master Counting objects: 15, done. Delta compression using up to 4 threads. Compressing objects: 100% (8/8), done. Writing objects: 100% (8/8), 1016 bytes, done. Total 8 (delta 6), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git bae797c..f67e638 master -> master

commit f67e638bada2b2377081a19dc2546622eaa6b055 Author: Mark Reynolds mreynolds389@redhat.com Date: Thu Dec 5 12:28:23 2013 -0500

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2017-02-11 22:54:45

Metadata Update from @mreynolds389: