Closed 389-ds-bot closed 4 years ago
Comment from mreynolds (@mreynolds389) at 2013-10-15 01:45:48
Simplified steps to reproduce the problem:
[1] Setup a master and a replica [2] Setup replication [3] Add these attributes to the replication agreement:
nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE sn nsds5ReplicaStripAttrs: modifiersname modifytimestamp
[4] Create a new user. [5] modify the "sn" attribute in the user. [6] Look at the database ruvs: the consumer is missing the last update
master: nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4644000000010000
replica: nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4532000000010000
[7] However, if you look at the ruv in the replication agreement it is the same as the replica:
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4532000000010000
So maybe, the solution is to not look at the database ruv's, but at the agreement ruv's? This of course requires more work, as you would have to check each agreement on each server, instead of the server as a whole. Maybe revise repl-monitor.pl, or create a new task?
Still investigating...
Comment from mreynolds (@mreynolds389) at 2013-10-24 03:38:57
Basic overview:
Update arrives
Write update to change log and ruv
Event thread will write the agreements maxcsn to the local ruv:
nsds5AgmtMaxCSN:
Replica 1 dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config nsds50ruv: {replica 111 ldap://localhost.localdomain:389} 5267db120000006f0000 52683d810000006f0000 nsds50ruv: {replica 222 ldap://localhost.localdomain:22222} 5267e90e000000de0000 52683d89000100de0000 nsds5AgmtMaxCSN: cn=to replica2;localhost.localdomain;22222;111;52683d810000006f0000
Replica 2 dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config nsds50ruv: {replica 222 ldap://localhost.localdomain:22222} 5267e90e000000de0000 52683d89000100de0000 nsds50ruv: {replica 111 ldap://localhost.localdomain:389} 5267db120000006f0000 52683d810000006f0000 nsds5AgmtMaxCSN: cn=to replica1;localhost.localdomain;389;222;52683d89000100de0000
Check the agmt maxcsn against the replica's RUV(supplier element). So on Replica 1, look at the agmt maxcsn for Replica 2(maxcsn is 52683d810000006f0000)> Compare this on Replica 2 with its local ruv element for replica 111(replica 1 rid).
Attaching patch next...
Comment from lkrispen (@elkris) at 2013-10-24 18:23:04
It looks good to me, just one concern.
Checking all replication agreements for any update could be some overhead and might not always be needed (eg no fractional agreements) or wanted 8no requirement to fine grained repl monitoring) - so could this additional ruv be optional ?
Comment from mreynolds (@mreynolds389) at 2013-10-24 19:09:01
Replying to [comment:6 elkris]:
It looks good to me, just one concern.
Checking all replication agreements for any update could be some overhead and might not always be needed (eg no fractional agreements) or wanted 8no requirement to fine grained repl monitoring) - so could this additional ruv be optional ?
Originally I just did this for fractional agreements, but then I thought it would also be useful for non-fractional agreements. I could make it configurable: all agmts or just fractional agmts.
Looping over all the agreements I don't think is that bad - it's basically just a linked list so I wouldn't expect too much overhead. But, yes, it was also a concern of mine. So, today I plan on doing performance testing to see how much of an impact it has when there are many agreements.
Comment from mreynolds (@mreynolds389) at 2013-10-24 20:42:57
1st round of performance testing (precise etime):
11 fractional agmts(4 -6 stripped attributes per agmt). Simply performing a mod on the same entry.
Master: average: 0.056 Patched server: average: 0.055
There is no significant change between the two versions.
Going to test with "callgrind" next...
Comment from mreynolds (@mreynolds389) at 2013-10-25 22:13:45
callgrind results
write_changelog_and_ruv() - sysTime 0.018
-> agmt_update_maxcsn() - sysTime 0.0491
-> cl5WriteOperationTxn() - sysTime 24.272
0.0491 is insignificant.
Since there is basically no impact, I will not make it configurable for fractional vs. non-fractional agmts. So it will be global, and this will also make rewriting repl-monitor.pl easier and more consistent.
Comment from rmeggins (@richm) at 2013-10-26 05:01:18
2750 sprintf(buf, "%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn),
2751 agmt->hostname, agmt->port);
2752 agmt->maxcsn = slapi_ch_strdup(buf);
Just do this instead:
agmt->maxcsn = slapi_ch_smprintf("%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn),
agmt->hostname, agmt->port);
same here
2756 sprintf(buf, "%s;%s;%d;%d;%s",slapi_rdn_get_rdn(agmt->rdn),
2757 agmt->hostname, agmt->port, rid, maxcsn);
2758 agmt->maxcsn = slapi_ch_strdup(buf);
Then you can get rid of char buf[BUFSIZ];
The other places in the code where you use sprintf e.g.
sprintf(buf,"%s;%s;%d;%d;",slapi_rdn_get_rdn(ra->rdn),ra->hostname, ra->port, rid);
Please use PR_snprintf instead, which gives you protection against buffer overruns (yes, it is highly unlikely that the rdn+hostname+port+rid will be > 8192 bytes long, but . . .)
2827 val.bv_val = agmt->maxcsn;
2828 PR_Unlock(agmt->lock);
2829 val.bv_len = strlen(val.bv_val);
2830 slapi_mod_add_value(smod, &val);
If the lock is to protect agmt->maxcsn from being freed or overwritten, then you will need to unlock after the value has been copied in slapi_mod_add_value.
2911 if(strstr(maxcsns[i], buf) || strstr(maxcsns[i], unavail_buf)){
2912 ra->maxcsn = slapi_ch_strdup(maxcsns[i]);
2913 }
at this point in the code, is it possible that ra->maxcsn will already have been set?
Comment from mreynolds (@mreynolds389) at 2013-10-28 18:59:46
Replying to [comment:10 richm]:
2750 sprintf(buf, "%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), 2751 agmt->hostname, agmt->port); 2752 agmt->maxcsn = slapi_ch_strdup(buf);
Just do this instead:
agmt->maxcsn = slapi_ch_smprintf("%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), agmt->hostname, agmt->port);
same here
2756 sprintf(buf, "%s;%s;%d;%d;%s",slapi_rdn_get_rdn(agmt->rdn), 2757 agmt->hostname, agmt->port, rid, maxcsn); 2758 agmt->maxcsn = slapi_ch_strdup(buf);
Then you can get rid of char buf[BUFSIZ];
The other places in the code where you use sprintf e.g.
sprintf(buf,"%s;%s;%d;%d;",slapi_rdn_get_rdn(ra->rdn),ra->hostname, ra->port, rid);
Please use PR_snprintf instead, which gives you protection against buffer overruns (yes, it is highly unlikely that the rdn+hostname+port+rid will be > 8192 bytes long, but . . .)
2827 val.bv_val = agmt->maxcsn; 2828 PR_Unlock(agmt->lock); 2829 val.bv_len = strlen(val.bv_val); 2830 slapi_mod_add_value(smod, &val);
If the lock is to protect agmt->maxcsn from being freed or overwritten, then you will need to unlock after the value has been copied in slapi_mod_add_value.
2911 if(strstr(maxcsns[i], buf) || strstr(maxcsns[i], unavail_buf)){ 2912 ra->maxcsn = slapi_ch_strdup(maxcsns[i]); 2913 }
at this point in the code, is it possible that ra->maxcsn will already have been set?
Thanks Rich, I'll look into this. As I was working on repl-monitor.pl I realized I had to change part of the server fix too. So, I'm working on a new patch for the server and repl-monitor now...
Comment from mreynolds (@mreynolds389) at 2013-10-30 00:39:30
text report output text-report.txt
Comment from mreynolds (@mreynolds389) at 2013-10-30 00:39:45
html report report.html
Comment from rmeggins (@richm) at 2013-10-31 00:26:16
I would prefer to leave repl-monitor.pl unbranded, not designated as "389 Directory Server Replication Monitor".
Will repl-monitor.pl work as before, if the server doesn't have nsds5AgmtMaxCSN? It has to work in a mixed topology where nsds5AgmtMaxCSN isn't present. Of course, it won't "work" in that it will give results that are not entirely accurate, but it should at least give a best effort, no worse than the way it works now.
Comment from mreynolds (@mreynolds389) at 2013-10-31 00:49:29
Replying to [comment:12 richm]:
I would prefer to leave repl-monitor.pl unbranded, not designated as "389 Directory Server Replication Monitor".
No problem.
Will repl-monitor.pl work as before, if the server doesn't have nsds5AgmtMaxCSN? It has to work in a mixed topology where nsds5AgmtMaxCSN isn't present. Of course, it won't "work" in that it will give results that are not entirely accurate, but it should at least give a best effort, no worse than the way it works now.
It will not work with older versions, but it should not take much to make it work across the board. The only real change in the script is how the supplier maxcsn is located.
Comment from mreynolds (@mreynolds389) at 2013-10-31 21:37:47
Revision 2 0001-Ticket-47368-IPA-server-dirsrv-RUV-entry-data-exclud.patch
Comment from mreynolds (@mreynolds389) at 2013-10-31 21:42:32
New patch attached. Removed "389" branding, and script is now backwards compatible with older versions of DS that do not use "nsds5AgmtMaxCSN".
Comment from mreynolds (@mreynolds389) at 2013-10-31 23:01:58
git merge ticket47368 Updating 4c1dfaf..c30c897 Fast-forward ldap/admin/src/scripts/repl-monitor.pl.in | 271 ++++++++---- ldap/servers/plugins/replication/repl5.h | 13 +- ldap/servers/plugins/replication/repl5_agmt.c | 472 +++++++++++++++++++- ldap/servers/plugins/replication/repl5_agmtlist.c | 8 + ldap/servers/plugins/replication/repl5_plugins.c | 7 + ldap/servers/plugins/replication/repl5_protocol.c | 2 +- ldap/servers/plugins/replication/repl5_replica.c | 43 ++- .../plugins/replication/repl5_replica_config.c | 58 ++-- ldap/servers/plugins/replication/repl5_ruv.c | 1 + ldap/servers/plugins/replication/repl_globals.c | 1 + ldap/servers/slapd/entry.c | 6 +- ldap/servers/slapd/rdn.c | 10 + ldap/servers/slapd/slapi-plugin.h | 10 +- 13 files changed, 752 insertions(+), 150 deletions(-)
git push origin master Counting objects: 45, done. Delta compression using up to 4 threads. Compressing objects: 100% (22/22), done. Writing objects: 100% (23/23), 9.12 KiB, done. Total 23 (delta 20), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 4c1dfaf..c30c897 master -> master
commit c30c897acb80ae00ecdd3063145ddd716caa96f1 Author: Mark Reynolds mreynolds389@redhat.com Date: Thu Oct 31 12:36:14 2013 -0400
Comment from mreynolds (@mreynolds389) at 2013-10-31 23:26:37
Summary:
To manually check if a consumer is up to date, you compare the agreements max csn, located in the local ruv tombstone entry, to the consumers RUV element for the suppliers rid.
Here is the format of the new attribute/value:
nsds5AgmtMaxCSN: <repl area>:<agmt name>:<consumer host>:<port>:<consumer rid>:<maxcsn>
nsds5AgmtMaxCSN: dc=example,dc=com;to replica 1;localhost.localdomain;389;111;5270095c0000014d0000
Example: Check if Replica B is caught up with Replica A
Replica A (rid 222):
ldapsearch -h localhost -D "cn=directory manager" -W -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' -xLLL nsds50ruv nsds5AgmtMaxCSN
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 222 ldap://localhost.localdomain:389} 527283720000006f0000 52729c8b0000006f0000
nsds5AgmtMaxCSN: dc=example,dc=com;to replica B;localhost.localdomain;22222;65535;52729c8b0000006f0000
nsds5AgmtMaxCSN: dc=example,dc=com;to replica C;localhost.localdomain;22222;444;52729c999900006f0000
Look at the nsds5AgmtMaxCSN attributes, and locate the agreement for replica B, and get its max csn value. Then compare this max csn to the max csn on Replica B(nsds50ruv) for replica 222(the supplier):
Replica B (rid 65535):
ldapsearch -h localhost -D "cn=directory manager" -W -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' -p 22222 -xLLL nsds50ruv
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 222 ldap://localhost.localdomain:389} 527283720000006f0000 52729c8b0000006f0000
nsds50ruv: {replica 444 ldap://localhost.localdomain:389} 527283af0000006f0000 527299990000006f0000
So now each agreement can be monitored, and this is also the only way to correctly tell if fractional replication agreements are in sync.
Comment from mreynolds (@mreynolds389) at 2013-10-31 23:52:49
Fixed jenkins errors 0001-Ticket-47368-Fix-Jenkins-errors.patch
Comment from mreynolds (@mreynolds389) at 2013-10-31 23:54:15
Fixed Jenkins errors:
git merge ticket47368 Updating c30c897..eb6a462 Fast-forward ldap/servers/plugins/replication/repl5_agmt.c | 13 +++++-------- 1 files changed, 5 insertions(+), 8 deletions(-)
git push origin master Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 887 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git c30c897..eb6a462 master -> master
commit eb6a462b8ed8ddffa29e01178e39d77112dcfe32 Author: Mark Reynolds mreynolds389@redhat.com Date: Thu Oct 31 14:49:07 2013 -0400
Comment from mreynolds (@mreynolds389) at 2013-11-04 20:55:24
Fix coverity errors 0001-Ticket-47368-Fix-coverity-issues.patch
Comment from rmeggins (@richm) at 2013-11-04 21:05:30
ack
Comment from mreynolds (@mreynolds389) at 2013-11-04 21:21:01
git merge ticket47368 Updating 8db3a1a..8d398a5 Fast-forward ldap/servers/plugins/replication/repl5_agmt.c | 6 +++--- ldap/servers/plugins/replication/repl5_replica.c | 8 ++++++-- 2 files changed, 9 insertions(+), 5 deletions(-)
git push origin master Counting objects: 15, done. Delta compression using up to 4 threads. Compressing objects: 100% (8/8), done. Writing objects: 100% (8/8), 929 bytes, done. Total 8 (delta 6), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 8db3a1a..8d398a5 master -> master
commit 8d398a555fc1632d678223e6547a5a4df4082775 Author: Mark Reynolds mreynolds389@redhat.com Date: Mon Nov 4 09:52:11 2013 -0500
Comment from mreynolds (@mreynolds389) at 2013-12-05 23:30:47
Fix memory leaks 0001-Ticket-47368-fix-memory-leaks.patch
Comment from mreynolds (@mreynolds389) at 2013-12-06 01:17:15
git merge ticket47368 Updating bae797c..f67e638 Fast-forward ldap/servers/plugins/replication/repl5_agmt.c | 15 ++++++++++----- ldap/servers/plugins/replication/repl5_replica.c | 4 +--- 2 files changed, 11 insertions(+), 8 deletions(-)
git push origin master Counting objects: 15, done. Delta compression using up to 4 threads. Compressing objects: 100% (8/8), done. Writing objects: 100% (8/8), 1016 bytes, done. Total 8 (delta 6), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git bae797c..f67e638 master -> master
commit f67e638bada2b2377081a19dc2546622eaa6b055 Author: Mark Reynolds mreynolds389@redhat.com Date: Thu Dec 5 12:28:23 2013 -0500
Comment from mreynolds (@mreynolds389) at 2017-02-11 22:54:45
Metadata Update from @mreynolds389:
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/47368
Ticket was cloned from Red Hat Bugzilla (product Fedora): Bug 966562