Open pavelhoral opened 1 year ago
GitHub won't let me attach YML file, so here it is:
version: "3.4"
x-shared-config: &shared-config
extra_hosts:
- "wrends-test1:10.0.0.31"
- "wrends-test2:10.0.0.32"
services:
wrends-test1:
image: wrensecurity/wrends:5.0.1
container_name: wrends-test1
environment:
ADDITIONAL_SETUP_ARGS: "--sampleData 10"
ROOT_USER_DN: cn=Directory Manager
ROOT_USER_PASSWORD: password
volumes:
- wrends-data:/opt/wrends/instance
networks:
wrenam:
ipv4_address: 10.0.0.31
<<: *shared-config
wrends-test2:
image: wrensecurity/wrends:5.0.1
container_name: wrends-test2
environment:
ROOT_USER_DN: cn=Directory Manager
ROOT_USER_PASSWORD: password
networks:
wrenam:
ipv4_address: 10.0.0.32
<<: *shared-config
volumes:
wrends-data:
networks:
wrenam:
name: wrends-test
ipam:
config:
- subnet: 10.0.0.0/24
What is actually happening under the hood:
DS tries to connect to RS and perform handshake phase 1: https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/service/ReplicationBroker.java#L748-L798
RS accepts the connection and tries to perform the handshake as well: https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/server/ReplicationServer.java#L239-L268
RS tries to initialize replication domain parameters: https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/server/DataServerHandler.java#L337
RS waits for establishing connections to other (non-responding) RSs: https://github.com/WrenSecurity/wrends/blob/b9af6473dd85fb55f62aa4adb4cab5389ff2fa4a/opendj-server-legacy/src/main/java/org/opends/server/replication/server/MessageHandler.java#L176-L187
...in the meantime, DS's handshake attempt times out...
To be honest I don't understand why DataServerHandler
has to wait for the replication server to actually perform connection check cycle:
I mean the only thing that happens from the point of view of DataServerHandler
is that under the hood ReplicationServer
will try to (re)connect to all other RSs, but nothing changes in terms of ReplicationDomain
's internal state if the connection fails... nothing happens, just the call takes a bit longer. And as a side effect of waiting for the connection check, the original socket times out.
When I drop the requirement for the connection check, everything works as expected. But dropping something that is there obviously for some reason is a pretty big deal :/.
Digging through commit history...
Based on https://github.com/WrenSecurity/wrends/commit/51ef33bebdaa4f8df31131374fce8433c431c298 it seems that the wait was always mandatory when creating new ReplicationDomain
in the "old days". This commit made the wait mandatory only for DS connections.
Commit message for this change https://github.com/WrenSecurity/wrends/commit/4d90aff1b4e079be6e32e3f880e328883dd534ee seems like it can be the title of this issue:
Fix issue OpenDJ-96: Replication server monitor data computation takes too long / blocks rest of server when another RS is cannot be reached
Still, none the wiser to why the wait is there.
Ok, even though the timeout is very strangely implemented, the server is behaving as intended. With isolation-policy
set to accept-all-updates
server starts with write enabled.
Summary
When in multi-master replication mode a single server is unreachable (socket connection has to timeout) it will cause replication server to not be able to accept any connection due to data server's own socket timeout. This happens during the first DS-RS handshake phase and is accompanied with the following error in the log:
Steps To Reproduce ("Repro Steps")
docker-compose up -d
docker-compose stop
docker-compose up -d wrends-test1
docker exec -it wrends-test1 ldapdelete -h localhost -p 1389
Expected Result (Behavior You Expected to See)
Server deletes the requested LDAP entry.
Actual Result (Behavior You Saw)
The following error is returned:
Additional Notes
I have spent several hours trying to debug this. The underlying issue is that replication server's connection listener is trying to contact all other replication servers when accepting new connection. As the other server does not exist, this attempt timeouts with the same timeout value as is the data server's timeout for the connection handshake.
I am not sure why we need to wait for the replication domain to actually contact all servers. Simply increasing handshake timeout wouldn't work if we have multiple servers in the domain.
Creating this issue to actually track potential discussion regarding solving this problem.