389ds / 389-ds-base

The enterprise-class Open Source LDAP server for Linux
https://www.port389.org/
Other
210 stars 89 forks source link

Freeipa nightly test failure: ipa_topo_util_cleanruv: failed to create cleanalltuv task #4442

Open flo-renaud opened 3 years ago

flo-renaud commented 3 years ago

Issue Description FreeIPA nightly tests frequently fail in the test test_integration/test_topology.py::TestCASpecificRUVs::test_replica_uninstall_deletes_ruvs. See for instance PR #533 in the test testing-fedora/test_topology_TestCASpecificRUVs.

Package Version and Platform:

Steps to Reproduce Steps to reproduce the behavior: Launch freeipa nightly test. The test scenario is the following:

The test fails at the last step.

Expected results list-ruv on the last step should not display replica2

Additional context The logs are available here and the master's ds error log shows multiple errors but one of them seems relevant to the issue:

[15/Nov/2020:05:06:10.672537871 +0000] - ERR - ipa-topology-plugin - ipa_topo_util_cleanruv: failed to create cleanalltuv task
mreynolds389 commented 3 years ago

Looking at the logs I can see the tasks were running, and finished very fast. About 1 or 2 seconds per task.

The logs show:

[15/Nov/2020:05:05:57.726306784 +0000] - INFO - NSMMReplicationPlugin - CleanAllRUV Task (rid 7): Successfully cleaned rid(7)

So RID 7 was already cleaned

[15/Nov/2020:05:06:02.649307414 +0000] - ERR - NSMMReplicationPlugin - CleanAllRUV Task (rid 7): Replica id (7) is already being cleaned [15/Nov/2020:05:06:04.654267828 +0000] - WARN - modify_internal_entry - Can't modify task entry 'cn=clean 7,cn=cleanallruv,cn=tasks,cn=config'; No such object (32) [15/Nov/2020:05:06:08.666102849 +0000] - ERR - NSMMReplicationPlugin - CleanAllRUV Task (rid 7): Task failed...(-1)

A second task was created to clean rid 7, but it was already cleaned. Which is why that task entry was not found and you get an error when trying to clean it again

[15/Nov/2020:05:06:10.672537871 +0000] - ERR - ipa-topology-plugin - ipa_topo_util_cleanruv: failed to create cleanalltuv task

This error is probably because IPA is trying to clean rid 7 again.

The issue here is why is the completed task entry being removed so quickly from cn=config. I can see in my testing that the tasks are immediately removed after the task completes. Typically completed tasks are suppose to stick around for 1 hour. So the TTL is not being enforced anymore, which is odd since this code was not changed.

mreynolds389 commented 3 years ago

@flo-renaud are you still seeing this failure?

flo-renaud commented 3 years ago

@mreynolds389 yes, see for instance this run. It happened on 8 runs out of 11 in the last month.

mreynolds389 commented 2 years ago

@flo-renaud - sorry to check again but are you still seeing this random failure?

flo-renaud commented 2 years ago

@mreynolds389 yes, it still happens quite often. We don't note all the occurrences but you can find recent ones in https://pagure.io/freeipa/issue/8023.

mreynolds389 commented 1 year ago

Getting really odd results when I test this on beaker boxes:

In the logs I see Rids 5 and 3 being cleaned (not sure which backend is being cleaned though). I see the test fails because it finds the deleted replica's hostname in the RUV. But the RUV element for that host (muflon) is Replica ID 7 (not 5 or 3):

assert(replica.hostname not in res2), ("Replica RUVs were not clean during replica uninstallation") E AssertionError: Replica RUVs were not clean during replica uninstallation E assert 'muflon-4.ipa.test' not in 'Replica Upd...est:389: 5\n' E 'muflon-4.ipa.test' is contained here: E t:389: 3 E muflon-4.ipa.test:389: 7 E Certificate Server Replica Update Vectors: E opicee-7.ipa.test:389: 6 E ibm-x3650m4-01-vm-13.ipa.test:389: 5

ldapsearch -xLLL -D "cn=directory manager" -w Secret.123 -b "o=ipaca" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config nsds50ruv: {replica 6 ldap://opicee-7.ipa.test:389} 63d81c60000000060000 63d82 c25000000060000 nsds50ruv: {replica 5 ldap://ibm-x3650m4-01-vm-13.ipa.test:389} 63d81c67000000 050000 63d82168000700050000

ldapsearch -xLLL -D "cn=directory manager" -w Secret.123 -b "dc=ipa,dc=test" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv dn: cn=replica,cn=dc\3Dipa\2Cdc\3Dtest,cn=mapping tree,cn=config nsds50ruv: {replica 4 ldap://opicee-7.ipa.test:389} 63d81c00000100040000 63d82 a2e000000040000 nsds50ruv: {replica 3 ldap://ibm-x3650m4-01-vm-13.ipa.test:389} 63d81c08000200 030000 63d8236e000000030000 nsds50ruv: {replica 7 ldap://muflon-4.ipa.test:389} 63d81eb2000000070000 63d82 363000000070000

So we cleaned rid 5 (and 3), but expected rid 7 to be cleaned???

Going to rerun the tests. This time I will get the deleted replica's config before it is removed (as I can not confirm what muflon's replica ID actually was). Also going to improve cleanallruv logging so I can confirm what backend/rid is actually being cleaned...

mreynolds389 commented 1 year ago

Ok, so something seems really broken in this test. The test removes a replica host (muflon-4 in my case) which uses rids 7 & 8 for each of its backends. But when that replica host is removed we are cleaning rids 3 and 5. Why? The test should be cleaning rids 7 & 8 !!

So from what I can tell the tests are running the cleanAllRUV tasks on the wrong replica IDs. What I can not explain is that when the test passes it's still cleaning those exact same incorrect RID's (3 & 5). IMO the test should be failing every time it's run. Either way I'm not seeing a problem in DS when I run the tests...