DS starts in read-only mode when another replica is unreachable

pavelhoral commented 1 year ago

Summary

When in multi-master replication mode a single server is unreachable (socket connection has to timeout) it will cause replication server to not be able to accept any connection due to data server's own socket timeout. This happens during the first DS-RS handshake phase and is accompanied with the following error in the log:

category=SYNC severity=ERROR msgID=178 msg=Directory server 15265 was attempting to connect to replication server 25602 but has disconnected in handshake phase

Steps To Reproduce ("Repro Steps")

Download attached docker-compose.yml
Start both server instances with docker-compose up -d

Configure replication:

docker exec -it wrends-test1 \
    dsreplication enable --adminUID admin --adminPassword password --trustAll --no-prompt --baseDN dc=example,dc=com \
    --host1 wrends-test1 --port1 4444 --bindDN1 "cn=Directory Manager" \
    --bindPassword1 password --replicationPort1 8989 \
    --host2 wrends-test2 --port2 4444 --bindDN2 "cn=Directory Manager" \
    --bindPassword2 password --replicationPort2 8989
docker exec -it wrends-test1 \
    dsreplication initialize-all --adminUID admin --adminPassword password --trustAll --no-prompt \
    --baseDN dc=example,dc=com --hostname wrends-test1 --port 4444

Shutdown both servers with docker-compose stop
Start only the first server with docker-compose up -d wrends-test1

Try to perform modification with docker exec -it wrends-test1 ldapdelete -h localhost -p 1389

docker exec -it wrends-test1 ldapdelete -h localhost -p 1389 "uid=user.1,ou=People,dc=example,dc=com"

Expected Result (Behavior You Expected to See)

Server deletes the requested LDAP entry.

Actual Result (Behavior You Saw)

The following error is returned:

Processing DELETE request for uid=user.1,ou=People,dc=example,dc=com
The LDAP delete request failed: 53 (Unwilling to Perform)
Additional Information:  The Replication is configured for suffix
dc=example,dc=com but was not able to connect to any Replication Server

Additional Notes

I have spent several hours trying to debug this. The underlying issue is that replication server's connection listener is trying to contact all other replication servers when accepting new connection. As the other server does not exist, this attempt timeouts with the same timeout value as is the data server's timeout for the connection handshake.

I am not sure why we need to wait for the replication domain to actually contact all servers. Simply increasing handshake timeout wouldn't work if we have multiple servers in the domain.

Creating this issue to actually track potential discussion regarding solving this problem.

pavelhoral commented 1 year ago

GitHub won't let me attach YML file, so here it is:

version: "3.4"

x-shared-config: &shared-config
  extra_hosts:
    - "wrends-test1:10.0.0.31"
    - "wrends-test2:10.0.0.32"

services:
  wrends-test1:
    image: wrensecurity/wrends:5.0.1
    container_name: wrends-test1
    environment:
      ADDITIONAL_SETUP_ARGS: "--sampleData 10"
      ROOT_USER_DN: cn=Directory Manager
      ROOT_USER_PASSWORD: password
    volumes:
      - wrends-data:/opt/wrends/instance
    networks:
      wrenam:
        ipv4_address: 10.0.0.31
    <<: *shared-config

  wrends-test2:
    image: wrensecurity/wrends:5.0.1
    container_name: wrends-test2
    environment:
      ROOT_USER_DN: cn=Directory Manager
      ROOT_USER_PASSWORD: password
    networks:
      wrenam:
        ipv4_address: 10.0.0.32
    <<: *shared-config

volumes:
  wrends-data:

networks:
  wrenam:
    name: wrends-test
    ipam:
      config:
        - subnet: 10.0.0.0/24

pavelhoral commented 1 year ago

What is actually happening under the hood:

DS tries to connect to RS and perform handshake phase 1: https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/service/ReplicationBroker.java#L748-L798
RS accepts the connection and tries to perform the handshake as well: https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/server/ReplicationServer.java#L239-L268
RS tries to initialize replication domain parameters: https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/server/DataServerHandler.java#L337
RS waits for establishing connections to other (non-responding) RSs: https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/server/MessageHandler.java#L176-L187
...in the meantime, DS's handshake attempt times out...

pavelhoral commented 1 year ago

To be honest I don't understand why DataServerHandler has to wait for the replication server to actually perform connection check cycle:

https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/server/DataServerHandler.java#L337

https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/server/MessageHandler.java#L566-L567

I mean the only thing that happens from the point of view of DataServerHandler is that under the hood ReplicationServer will try to (re)connect to all other RSs, but nothing changes in terms of ReplicationDomain's internal state if the connection fails... nothing happens, just the call takes a bit longer. And as a side effect of waiting for the connection check, the original socket times out.

When I drop the requirement for the connection check, everything works as expected. But dropping something that is there obviously for some reason is a pretty big deal :/.

pavelhoral commented 1 year ago

Digging through commit history...

Based on https://github.com/WrenSecurity/wrends/commit/51ef33bebdaa4f8df31131374fce8433c431c298 it seems that the wait was always mandatory when creating new ReplicationDomain in the "old days". This commit made the wait mandatory only for DS connections.

Commit message for this change https://github.com/WrenSecurity/wrends/commit/4d90aff1b4e079be6e32e3f880e328883dd534ee seems like it can be the title of this issue:

Fix issue OpenDJ-96: Replication server monitor data computation takes too long / blocks rest of server when another RS is cannot be reached

Still, none the wiser to why the wait is there.

pavelhoral commented 1 year ago

Ok, even though the timeout is very strangely implemented, the server is behaving as intended. With isolation-policy set to accept-all-updates server starts with write enabled.

WrenSecurity / wrends