setting up a new replica with latest freeipa-server: failed in "retrieving schema for SchemaCache"

kforner commented 7 years ago

I'm still stuck with my master and its replica running an old version of the freeipa-server docker (adelton/freeipa-server:latest-systemd). I was trying to setup a new replica using the latest freeipa/freeipa-server.

I had to add the --skip-conncheck option to ipa-data/ipa-replica-install-options because there is no ssh installed in the docker.

But it still fails at some point. Here are the IMHO relevant lines from /var/log/ipareplica-install.log, out of ~ 2000 lines:

2017-08-02T11:54:20Z DEBUG Successfully updated nsDS5ReplicaId.
2017-08-02T11:54:20Z DEBUG flushing ldapi://%2fvar%2frun%2fslapd-QUARTZBIO-COM.socket from SchemaCache
2017-08-02T11:54:20Z DEBUG retrieving schema for SchemaCache url=ldapi://%2fvar%2frun%2fslapd-QUARTZBIO-COM.socket conn=<ldap.ldapobject.SimpleLDAPObject inst
ance at 0x7fdb699aed88>
2017-08-02T11:54:37Z DEBUG Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 449, in start_creation
    run_step(full_msg, method)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 439, in run_step
    method()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/dsinstance.py", line 431, in __setup_replica
    r_bindpw=self.dm_password)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", line 1068, in setup_replication
    raise RuntimeError("Failed to start replication")
RuntimeError: Failed to start replication

2017-08-02T11:54:37Z DEBUG   [error] RuntimeError: Failed to start replication
2017-08-02T11:54:37Z DEBUG Destroyed connection context.ldap2_140580366919440
2017-08-02T11:54:37Z DEBUG   File "/usr/lib/python2.7/site-packages/ipapython/admintool.py", line 172, in execute
    return_value = self.run()
  File "/usr/lib/python2.7/site-packages/ipapython/install/cli.py", line 318, in run
    cfgr.run()
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 310, in run
    self.execute()
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 334, in execute
    for nothing in self._executor():
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 376, in __runner
    exc_handler(exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 405, in _handle_execute_exception
    self._handle_exception(exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 395, in _handle_exception
    six.reraise(*exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 366, in __runner
    step()
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 363, in <lambda>
    step = lambda: next(self.__gen)
  File "/usr/lib/python2.7/site-packages/ipapython/install/util.py", line 81, in run_generator_with_yield_from
    six.reraise(*exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/util.py", line 59, in run_generator_with_yield_from
    value = gen.send(prev_value)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 597, in _configure
    next(executor)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 376, in __runner
    exc_handler(exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 405, in _handle_execute_exception
    self._handle_exception(exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 460, in _handle_exception
    self.__parent._handle_exception(exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 395, in _handle_exception
    six.reraise(*exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 457, in _handle_exception
    super(ComponentBase, self)._handle_exception(exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 395, in _handle_exception
    six.reraise(*exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 366, in __runner
    step()
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 363, in <lambda>
    step = lambda: next(self.__gen)
  File "/usr/lib/python2.7/site-packages/ipapython/install/util.py", line 81, in run_generator_with_yield_from
    six.reraise(*exc_info)
  File "/usr/lib/python2.7/site-packages/ipapython/install/util.py", line 59, in run_generator_with_yield_from
    value = gen.send(prev_value)
  File "/usr/lib/python2.7/site-packages/ipapython/install/common.py", line 63, in _install
    for nothing in self._installer(self.parent):
  File "/usr/lib/python2.7/site-packages/ipaserver/install/server/replicainstall.py", line 1741, in main
    install(self)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/server/replicainstall.py", line 375, in decorated
    func(installer)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/server/replicainstall.py", line 818, in install
    ds = install_replica_ds(config, options, ca_enabled, remote_api)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/server/replicainstall.py", line 139, in install_replica_ds
    api=remote_api,
  File "/usr/lib/python2.7/site-packages/ipaserver/install/dsinstance.py", line 412, in create_replica
    self.start_creation(runtime=60)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 449, in start_creation
    run_step(full_msg, method)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 439, in run_step
    method()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/dsinstance.py", line 431, in __setup_replica
    r_bindpw=self.dm_password)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", line 1068, in setup_replication
    raise RuntimeError("Failed to start replication")

2017-08-02T11:54:37Z DEBUG The ipa-replica-install command failed, exception: RuntimeError: Failed to start replication
2017-08-02T11:54:37Z ERROR Failed to start replication
2017-08-02T11:54:37Z ERROR The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information

felipevolpone commented 7 years ago

Hi @kforner, have you exported the port 389 when running the container? You can see here the list of other ports required from FreeIPA.

HTH

kforner commented 7 years ago

Hi @felipevolpone Yes, it is started like this:

    docker run --name $(REPLICA_NAME) -d  \
    --restart=always \
     -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
    -e IPA_SERVER_IP=$(REPLICA_IP) \
    --dns $(MASTER_IP) \
    -h $(REPLICA_HOST).$(DOMAIN) \
    -v $(PWD)/ipa-data:/data \
    -v /etc/localtime:/etc/localtime:ro \
    -p 53:53  -p 53:53/udp \
    -p 80:80 -p 443:443 \
    -p 389:389 -p 636:636 -p 88:88 -p 464:464 \
    -p 88:88/udp -p 464:464/udp -p 7389:7389 \
    -p 9443:9443 -p 9444:9444 -p 9445:9445  $(IMAGE)

felipevolpone commented 7 years ago

@kforner Sorry, I have not formulated my question properly. Have you also ran the master exposing the ports?

kforner commented 7 years ago

Yes:

9b9584c17685        docker.quartzbio.com/freeipa-server:fedora23-freeipa4.2.3-systemd   "/usr/sbin/init-data"   2 weeks ago         Up 2 weeks          0.0.0.0:53->53/tcp, 0.0.0.0:80->80/tcp, 0.0.0.0:53->53/udp, 0.0.0.0:88->88/tcp, 0.0.0.0:389->389/tcp, 0.0.0.0:88->88/udp, 0.0.0.0:443->443/tcp, 0.0.0.0:464->464/tcp, 0.0.0.0:636->636/tcp, 0.0.0.0:7389->7389/tcp, 0.0.0.0:9443-9445->9443-9445/tcp, 0.0.0.0:464->464/udp, 123/udp   freeipa

I have another replica running for years, and I'm trying to setup another one.

felipevolpone commented 7 years ago

@kforner could you share ipaupgrade.log?

kforner commented 7 years ago

a better way to include the log: ipareplica-install.txt

felipevolpone commented 7 years ago

@kforner Hi, it's seems that is an issue with IPv6. So, try again enabling IPv6: https://docs.docker.com/engine/userguide/networking/default_network/ipv6/

kforner commented 7 years ago

@felipevolpone I manually tried running docker like this:

sudo dockerd --ipv6 --fixed-cidr-v6="2001:db8:1::/64"

I retried the replica install, with the same result:

ipa.ipapython.ipaldap.SchemaCache: DEBUG    flushing ldapi://%2fvar%2frun%2fslapd-QUARTZBIO-COM.socket from SchemaCache
ipa.ipapython.ipaldap.SchemaCache: DEBUG    retrieving schema for SchemaCache url=ldapi://%2fvar%2frun%2fslapd-QUARTZBIO-COM.socket conn=<ldap.ldapobject.SimpleLDAPObject instance at 0x7f804f98e518>
ipa         : DEBUG    Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 449, in start_creation
    run_step(full_msg, method)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 439, in run_step
    method()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/dsinstance.py", line 431, in __setup_replica
    r_bindpw=self.dm_password)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", line 1068, in setup_replication
    raise RuntimeError("Failed to start replication")
RuntimeError: Failed to start replication

germanparente commented 7 years ago

Hi, it's not the same. Before it was failing at wait_for_open_ports by timeout. Now you have a replication failure. We should inspect the logs of directory server, master and replica.

kforner commented 7 years ago

@germanparente oh, thank you.

What log files should I check ?
the master is running in a docker without ipv6. Could this be a problem ? I'm reluctant to change it since it is the production server...

On the master I found this, at the exact same time as the above error (retrieving schema for SchemaCache) in /var/log/dirsrv/slapd-QUARTZBIO-COM/errors :

[03/Aug/2017:18:52:15 +0200] get_dom_sid - [file ipa_sidgen_common.c, line 75]: Internal search failed.
[03/Aug/2017:18:52:15 +0200] get_dom_sid - [file ipa_sidgen_common.c, line 75]: Internal search failed.
[03/Aug/2017:18:52:15 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[03/Aug/2017:18:52:15 +0200] NSMMReplicationPlugin - agmt="cn=meToiparig.quartzbio.com" (iparig:389): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't conta
ct LDAP server) ()
[03/Aug/2017:18:52:15 +0200] get_dom_sid - [file ipa_sidgen_common.c, line 75]: Internal search failed.
[03/Aug/2017:18:52:15 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[03/Aug/2017:18:52:15 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[03/Aug/2017:18:52:15 +0200] NSMMReplicationPlugin - Warning: unable to acquire replica for total update, error: -1, retrying in 1 seconds.
[03/Aug/2017:18:52:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[03/Aug/2017:18:52:16 +0200] NSMMReplicationPlugin - Warning: unable to acquire replica for total update, error: -1, retrying in 2 seconds.

on the replica server (after rerunning the replica install), I found this in the dirsrv logs about database generation ID:

Aug 04 10:31:46 iparig.quartzbio.com ns-slapd[480]: [04/Aug/2017:10:31:46.464741513 +0200] ipaenrollment_start - [file ipa_enrollment.c, line 408]: Failed to get default realm?!
Aug 04 10:31:46 iparig.quartzbio.com ns-slapd[480]: [04/Aug/2017:10:31:46.494741725 +0200] slapd started.  Listening on All Interfaces port 389 for LDAP requests
Aug 04 10:31:46 iparig.quartzbio.com ns-slapd[480]: [04/Aug/2017:10:31:46.500172010 +0200] Listening on All Interfaces port 636 for LDAPS requests
Aug 04 10:31:46 iparig.quartzbio.com ns-slapd[480]: [04/Aug/2017:10:31:46.503584141 +0200] Listening on /var/run/slapd-QUARTZBIO-COM.socket for LDAPI requests
Aug 04 10:31:46 iparig.quartzbio.com systemd[1]: Started 389 Directory Server QUARTZBIO-COM..
Aug 04 10:31:47 iparig.quartzbio.com ns-slapd[480]: [04/Aug/2017:10:31:47.803910268 +0200] NSMMReplicationPlugin - agmt="cn=meToipa.quartzbio.com" (ipa:389): The remote replica has a dif
ferent database generation ID than the local database.  You may have to reinitialize the remote replica, or the local replica.

germanparente commented 7 years ago

Hi,

I have a guess here. You are re-installing replica several times without completely cleaning from the master.

If you do

"ipa-server-find" in master, do you see your replica ?

If you see it, you need to do in master: ipa server-del --force

Then, let's see also:

"ipa-replica-manage list-ruv"

If you see records showing your replica, you will need to do: ipa-replica-manage clean-ruv

These steps have to be done each time your replica install fails, unfortunately. Because when you install your replica, some changes are also done at master side configuration that have to be cleaned.

Regards,

German

kforner commented 7 years ago

I don't see my replica in ipa-server-find:

# ipa server-find
---------------------
2 IPA servers matched
---------------------
  Server name: ipa.quartzbio.com
  Min domain level: 0
  Max domain level: 0

  Server name: ipasif2.quartzbio.com
  Managed suffix: dc=quartzbio,dc=com, o=ipaca
  Min domain level: 0
  Max domain level: 1

but there are some zombie "ruv"

# ipa-replica-manage list-ruv
iparig.quartzbio.com:389: 16
ipa.quartzbio.com:389: 4
iparig.quartzbio.com:389: 17
iparig.quartzbio.com:389: 18
iparig.quartzbio.com:389: 12
iparig.quartzbio.com:389: 14
ipasif2.quartzbio.com:389: 11
iparig.quartzbio.com:389: 19
iparig.quartzbio.com:389: 15

I've run ipa-replica-manage clean-ruv xx for xx in 12:19, but I don't know how much time it takes to get done by the background tasks.

germanparente commented 7 years ago

Hi,

you could tail -f /var/log/dirsrv/slapd-QUARTZBIO-COM/errors To see if the ruv has been cleaned.

Or retry the list-ruv command.

regards.

kforner commented 7 years ago

@germanparente tail -f /var/log/dirsrv/slapd-QUARTZBIO-COM/errors

[04/Aug/2017:14:58:23 +0200] NSMMReplicationPlugin - CleanAllRUV Task (rid 18): Replica not online (agmt="cn=meToiparig.quartzbio.com" (iparig:389)) 
[04/Aug/2017:14:58:23 +0200] NSMMReplicationPlugin - CleanAllRUV Task (rid 18): Not all replicas online, retrying in 640 seconds... 
[04/Aug/2017:14:58:24 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:14:58:41 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:14:58:41 +0200] NSMMReplicationPlugin - agmt="cn=meToiparig.quartzbio.com" (iparig:389): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) ()

It seems that the RUV cleanup needs to contact the replica, so it nevers succeeds...

germanparente commented 7 years ago

Do you have both ipa.quartzbio.com and ipasif2.quartzbio.com running ? The CLEANALLRUV operation is sent to all the nodes.

kforner commented 7 years ago

@germanparente Yes, both are running. ipa-replica-manage list-ruv still lists the same ruvs, i.e. nothing has been deleted.

The CLEANALLRUV operation is sent to all the nodes.

Yes, but this message: Not all replicas online, retrying in 640 seconds... makes me think that it will never happen.

germanparente commented 7 years ago

Ok. Let's try another method.

First, let's see that there's no clean-ruv tasks running:

ipa-replica-manage list-clean-ruv If there are., please do: ipa-replica-manage abort-clean-ruv

Then, let's do the cleanallruv like this, in the master:

ldapmodify -a -D "cn=directory manager" -W << EOF dn: cn=clean 19, cn=cleanallruv, cn=tasks, cn=config objectclass: extensibleObject replica-base-dn: dc=quartzbio,dc=com replica-id: 19 replica-force-cleaning: yes cn: clean 19 EOF

and see after a moment if the id 19 is not there in "list-ruv" command. Then repeat for the other invalid ruvs.

kforner commented 7 years ago

Yes there were some clean-ruv tasks:

# ipa-replica-manage list-clean-ruv
CLEANALLRUV tasks
RID 12: Not all replicas online, retrying in 14400 seconds...
RID 14: Not all replicas online, retrying in 1280 seconds...
RID 15: Not all replicas online, retrying in 1280 seconds...
RID 16: Not all replicas online, retrying in 14400 seconds...
RID 17: Not all replicas online, retrying in 1280 seconds...
RID 18: Not all replicas online, retrying in 1280 seconds...
RID 19: Not all replicas online, retrying in 1280 seconds...

No abort CLEANALLRUV tasks running

After running a bunch of ipa-replica-manage abort-clean-ruv xxx, and waiting, I finally got:

# ipa-replica-manage list-clean-ruv
No CLEANALLRUV tasks running

Abort CLEANALLRUV tasks
RID 12: Retrying in 320 seconds
RID 14: Retrying in 320 seconds
RID 15: Retrying in 160 seconds
RID 16: Retrying in 160 seconds
RID 17: Retrying in 160 seconds
RID 18: Retrying in 160 seconds
RID 19: Retrying in 160 seconds

I tried the ldapmodify command, and got this in the log:

[04/Aug/2017:15:29:05 +0200] NSMMReplicationPlugin - CleanAllRUV Task (rid 19): Initiating CleanAllRUV Task... 
[04/Aug/2017:15:29:05 +0200] NSMMReplicationPlugin - CleanAllRUV Task (rid 19): Retrieving maxcsn... 
[04/Aug/2017:15:29:05 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:15:29:06 +0200] NSMMReplicationPlugin - agmt="cn=meToiparig.quartzbio.com" (iparig:389): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) ()
[04/Aug/2017:15:29:06 +0200] NSMMReplicationPlugin - CleanAllRUV Task (rid 19): Found maxcsn (00000000000000000000) 
[04/Aug/2017:15:29:06 +0200] get_dom_sid - [file ipa_sidgen_common.c, line 75]: Internal search failed.
[04/Aug/2017:15:29:06 +0200] NSMMReplicationPlugin - CleanAllRUV Task (rid 19): Cleaning rid (19)... 
[04/Aug/2017:15:29:06 +0200] NSMMReplicationPlugin - CleanAllRUV Task (rid 19): Waiting to process all the updates from the deleted replica... 
[04/Aug/2017:15:29:06 +0200] NSMMReplicationPlugin - CleanAllRUV Task (rid 19): Waiting for all the replicas to be online... 
[04/Aug/2017:15:29:06 +0200] NSMMReplicationPlugin - CleanAllRUV Task (rid 19): Task aborted for rid(19).

and the ruv 19 is still listed by ipa-replica-manage list-ruv.

germanparente commented 7 years ago

are you sure that both replicas are running and the replication between them is working ? If you add a user in one of them, you can see it in the other ?

We could try a clean ruv only in both replicas. The operation will be like this:

ldapmodify -x -D "cn=directory manager" -W <<EOF dn: cn=replica,cn=dc\3Dquartzbio\2Cdc\3Dcom,cn=mapping tree,cn=config changetype: modify replace: nsds5task nsds5task: CLEANRUV19 EOF

In both replicas.

kforner commented 7 years ago

@germanparente

are you sure that both replicas are running and the replication between them is working ?

There's only one replica (ipasif2), or do you also count the master ?

If you add a user in one of them, you can see it in the other ?

Damnation: you're right, it's not working any more. I even tried ipa-replica-manage force-sync --from ipa.quartzbio.com on the replica, but it won't sync. It's a nightmare...

We could try a clean ruv only in both replicas.

On my replica (that does not replicate anymore),

# ipa-replica-manage list-ruv
ipa.quartzbio.com:389: 4
ipasif2.quartzbio.com:389: 11

kforner commented 7 years ago

What a nightmare... Isn't there a simple way to come clean ? I'd like to start with fresh, up-to-date freeipa-server container, but of course to keep my user data. Isn't there a way to export the data and reimport it ?

germanparente commented 7 years ago

That's why. Let's re-initialize that replica from the master.

ipa-replica-manage re-initialize --from ipa.quartzbio.com

thanks for your patience.

germanparente commented 7 years ago

Yes, there is, of course a way to import, export. But your master is working fine. the re-initialize is similar to re-import db in your replica (on-line).

kforner commented 7 years ago

@germanparente

ipa-replica-manage re-initialize --from ipa.quartzbio.com

It worked ! I can now see my new user, and it it working both ways. What a relief !! Thanks !!!

thanks for your patience.

You're kidding right ? Thanks a lot for your invaluable help !!!

germanparente commented 7 years ago

Nice ! Thanks to you. Let's re-try the CLEANALLRUV. I would use the ldapmodify version. And let's do it one by one.

kforner commented 7 years ago

So now, on the master:

# ipa-replica-manage list-clean-ruv
No CLEANALLRUV tasks running

No abort CLEANALLRUV tasks running

but

# ipa-replica-manage list-ruv
iparig.quartzbio.com:389: 16
iparig.quartzbio.com:389: 17
ipa.quartzbio.com:389: 4
iparig.quartzbio.com:389: 18
iparig.quartzbio.com:389: 12
iparig.quartzbio.com:389: 14
ipasif2.quartzbio.com:389: 11
iparig.quartzbio.com:389: 19
iparig.quartzbio.com:389: 15

and the same on the replica

kforner commented 7 years ago

I would use the ldapmodify version. And let's do it one by one.

Ok, I could delete 2 ruvs, the 19 and 18 from the master, but they still appear on the replica in ipa-replica-manage list-ruv

Should I do it also with the ldapmodify on the replica ?

germanparente commented 7 years ago

But have you used this one ?

ldapmodify -a -D "cn=directory manager" -W << EOF dn: cn=clean 19, cn=cleanallruv, cn=tasks, cn=config objectclass: extensibleObject replica-base-dn: dc=quartzbio,dc=com replica-id: 19 replica-force-cleaning: yes cn: clean 19 EOF

germanparente commented 7 years ago

If you have used the task, then it has to be done in all the nodes. But I advise to use the former one. Because the replica could send the ruvs again to the master.

kforner commented 7 years ago

Yes, I used the bad one. Then fixed it on the replica, then used the correct ldapmodify. Now both servers are clean:

# ipa-replica-manage list-ruv
ipa.quartzbio.com:389: 4
ipasif2.quartzbio.com:389: 11

Should I try again to add the new replica ?

germanparente commented 7 years ago

yes. But just as a remark, ipv6 should be enabled also in master. Not for replication but later in the install it could fail if not enabled.

kforner commented 7 years ago

I tried, and it failed again at the same step:

2017-08-04T15:17:55Z DEBUG retrieving schema for SchemaCache url=ldapi://%2fvar%2frun%2fslapd-QUARTZBIO-COM.socket conn=<ldap.ldapobject.SimpleLDAPObject instance at 0x7f4d79a0c638>
2017-08-04T15:18:12Z DEBUG Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 449, in start_creation
    run_step(full_msg, method)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 439, in run_step
    method()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/dsinstance.py", line 431, in __setup_replica
    r_bindpw=self.dm_password)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", line 1068, in setup_replication
    raise RuntimeError("Failed to start replication")
RuntimeError: Failed to start replication

2017-08-04T15:18:12Z DEBUG   [error] RuntimeError: Failed to start replication
2017-08-04T15:18:12Z DEBUG Destroyed connection context.ldap2_139970749192016
2017-08-04T15:18:12Z DEBUG   File "/usr/lib/python2.7/site-packages/ipapython/admintool.py", line 172, in execute
    return_value = self.run()
  File "/usr/lib/python2.7/site-packages/ipapython/install/cli.py", line 318, in run
    cfgr.run()
  File "/usr/lib/python2.7/site-packages/ipapython/install/core.py", line 310, in run

At the same time, on the master, in /var/log/dirsrv/slapd-QUARTZBIO-COM/errors:

[04/Aug/2017:17:17:55 +0200] get_dom_sid - [file ipa_sidgen_common.c, line 75]: Internal search failed.
[04/Aug/2017:17:17:55 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:17:17:55 +0200] NSMMReplicationPlugin - agmt="cn=meToiparig.quartzbio.com" (iparig:389): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) ()
[04/Aug/2017:17:17:55 +0200] get_dom_sid - [file ipa_sidgen_common.c, line 75]: Internal search failed.
[04/Aug/2017:17:17:55 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:17:17:55 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:17:17:55 +0200] NSMMReplicationPlugin - Warning: unable to acquire replica for total update, error: -1, retrying in 1 seconds.
[04/Aug/2017:17:17:56 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:17:17:56 +0200] NSMMReplicationPlugin - Warning: unable to acquire replica for total update, error: -1, retrying in 2 seconds.
[04/Aug/2017:17:17:58 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:17:17:58 +0200] NSMMReplicationPlugin - Warning: unable to acquire replica for total update, error: -1, retrying in 3 seconds.
[04/Aug/2017:17:18:01 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:17:18:01 +0200] NSMMReplicationPlugin - Warning: unable to acquire replica for total update, error: -1, retrying in 4 seconds.
[04/Aug/2017:17:18:05 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 0 (Success)
[04/Aug/2017:17:18:05 +0200] NSMMReplicationPlugin - Warning: unable to acquire replica for total update, error: -1, retrying in 5 seconds.

germanparente commented 7 years ago

I am sure you have checked that 389/636 ports are open.

if you try in master:

kinit -kt /etc/dirsrv/ds.keytab ldap/hostname ldapsearch -h -p 389 -Y GSSAPI -b "" -s base

does this work ? same from the replica to master.

kforner commented 7 years ago

kinit:

on master:

# seems to work
kinit -kt /etc/dirsrv/ds.keytab ldap/ipa.quartzbio.com
# does not
kinit -kt /etc/dirsrv/ds.keytab ldap/ipasif2.quartzbio.com
kinit: Keytab contains no suitable keys for ldap/ipasif2.quartzbio.com@QUARTZBIO.COM while getting initial credentials

on replica (ipasif2) it is the opposite:

[root@ipasif2 /]# kinit -kt /etc/dirsrv/ds.keytab ldap/ipa.quartzbio.com    
kinit: Keytab contains no suitable keys for ldap/ipa.quartzbio.com@QUARTZBIO.COM while getting initial credentials

ldapsearch

I was not sure about what you meant, I tried this: on master:

%ldapsearch -h ipa -p 389 -Y GSSAPI -b "" -s base -W
->lots of output, last lines:
vendorVersion: 389-Directory/1.3.4.5 B2015.323.048
dataversion: 020170731153946020170731153946020170731153946
netscapemdsuffix: cn=ldap://dc=ipa,dc=quartzbio,dc=com:389
lastusn: 3719059
changeLog: cn=changelog
firstchangenumber: 879865
lastchangenumber: 881801

%ldapsearch -h ipasif2 -p 389 -Y GSSAPI -b "" -s base -W
...
vendorVersion: 389-Directory/1.3.4.5 B2015.323.048
dataversion: 020170804135304020170804135304020170804135304
netscapemdsuffix: cn=ldap://dc=ipasif2,dc=quartzbio,dc=com:389
lastusn: 3458032
changeLog: cn=changelog
firstchangenumber: 766007
lastchangenumber: 767549

same results on on replica

germanparente commented 7 years ago

And from the master to iparig ?

-h iparig -p 389

kforner commented 7 years ago

And from the master to iparig ?

but iparig is the name of the freeipa inside the docker (running on rig), which failed, so how can it work ?

 ldapsearch -h iparig -p 389 -Y GSSAPI -b "" -s base -W
Enter LDAP Password: 
ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1)

germanparente commented 7 years ago

From master,

openssl s_client -connect iparig:389

germanparente commented 7 years ago

Hi, I thought iparig was the name of the new node that was going to be installed. I had seen iparig.quartzbio.com

kforner commented 7 years ago

@germanparente

I thought iparig was the name of the new node that was going to be installed.

Yes it is. To be run in a docker on a computer named rig. The master runs in a docker: ipa.quartzbio.com

The existing replica, ipasif2, runs also in a docker on a computer named sif.

germanparente commented 7 years ago

Ok. Not sure i am following completely.

In "ipa-replica-manage list" I see ipasif2 and this is working fine.

So, from ipa node, we should be able to do:

openssl s_client -connect iparig:389

And from rig, we should be able to do the same against ipa:389

Regards

kforner commented 7 years ago

From inside the freeipa-server container running on rig (which has no services running since the replica failed):

%openssl s_client -connect ipa.quartzbio.com:389
CONNECTED(00000003)
140092481591160:error:140790E5:SSL routines:ssl23_write:ssl handshake failure:s23_lib.c:177:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 201 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : 0000
    Session-ID: 
    Session-ID-ctx: 
    Master-Key: 
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1502097998
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)

But from the master (heimdall, running ipa.quartzbio.com in a container):

%openssl s_client -connect iparig.quartzbio.com:389
getaddrinfo: Name or service not known

From what I understand, this is expected since the name "iparig.quartzbio.com" does not yet exist since the replica failed.

openssl s_client -connect rig:389
socket: Bad file descriptor
connect:errno=9

This rig:389 is bound the 389 port of the docker container, which does NOT run the dirsrv instance since the replica failed.

germanparente commented 7 years ago

Hi,

I don't understand well how for the master ipa, we can see a replica iparig but it's not resolvable from it.

You have two nodes already installed that are working fine:

ipa and ipasif2

so, from ipa you can ping ipasif2 and connect through 389 port, for instance and the opposite is also valid.

This same configuration should be followed between ipa and rig (or iparig).

Thanks.

German.

kforner commented 7 years ago

Fixed by merging issue 151 in issue 150

@germanparente @felipevolpone I just realized that I mixed up my replies from another issue: https://github.com/freeipa/freeipa-container/issues/151 . Let me clarify: I tried two different things,

upgrading my existing replica to a newer version by running it with the latest version of the freeipa-server docker container: issue https://github.com/freeipa/freeipa-container/issues/150

adding a new replica (using also the latest version of the freeipa-server container): issue #151 .

So all the posts, command outputs, log excerpts that I posted in this issue (issue #150) actually belong to issue #151. I am very sorry for the confusion, but also hope that it could also explain why it is so difficult to fix.

kforner commented 7 years ago

any insights on how to move on ? Would it be easier to just create a new master by exporting/importing the data ? What would be the best reliable deployment: to use fedora and freeipa from rpm ?

felipevolpone commented 7 years ago

@kforner the situation is little bit confusing now, the log you provide is about this error or about the issue 151?

kforner commented 7 years ago

@felipevolpone

the situation is little bit confusing now, the log you provide is about this error or about the issue 151?

Yes I know and I'm sorry for that. All replies in this issue #150 actually belong to issue #151. That's why there are three freeIPA: the master, the existing replica (ipasif2), and the new replica (iparig) that I can not manage to setup. Maybe I could close the existing issues and start with a new one ?

felipevolpone commented 7 years ago

So, just change the title and the description of this issue. I'll close #151, ok?

Now, back to the problem. Have you tried to run docker run again after all the tests you and @germanparente have done? Could you share the output/log?

kforner commented 7 years ago

So, just change the title and the description of this issue. I'll close #151, ok?

done

Have you tried to run docker run again after all the tests you and @germanparente have done?

Yes.

It still fails at the same step:

2017-08-04T15:17:55Z DEBUG flushing ldapi://%2fvar%2frun%2fslapd-QUARTZBIO-COM.socket from SchemaCache
2017-08-04T15:17:55Z DEBUG retrieving schema for SchemaCache url=ldapi://%2fvar%2frun%2fslapd-QUARTZBIO-COM.socket conn=<ldap.ldapobject.SimpleLDAPObject instance 
at 0x7f4d79a0c638>
2017-08-04T15:18:12Z DEBUG Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 449, in start_creation
    run_step(full_msg, method)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 439, in run_step
    method()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/dsinstance.py", line 431, in __setup_replica
    r_bindpw=self.dm_password)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", line 1068, in setup_replication
    raise RuntimeError("Failed to start replication")
RuntimeError: Failed to start replication

2017-08-04T15:18:12Z DEBUG   [error] RuntimeError: Failed to start replication

ipareplica-install.txt

germanparente commented 7 years ago

Correct me if i am wrong but master ipa and replica sif2 are working fine.

Let's go on installing the third node. The problem here is that we don't manage to connect to port 388 from ipa to iparig. Could we check that again ?

kforner commented 7 years ago

Correct me if i am wrong but master ipa and replica sif2 are working fine.

Yes

The problem here is that we don't manage to connect to port 388 from ipa to iparig. Could we check that again ?

This is the part I'm not sure to understand: since the replica install failed, there is no service running in the freeipa docker on rig, hence nothing to contact on port 389, right ?

freeipa / freeipa-container

setting up a new replica with latest freeipa-server: failed in "retrieving schema for SchemaCache" #150

kinit:

ldapsearch