Closed esebasti closed 1 year ago
Are we sure that emulator is not interfering with the newly promoted active and client.
Found the cause of the issue. It's a minor flaw in the way inflight invalidations are handled in Ehcache and the fix is not gonna be trivial as a proper fix might require more API support from platform.
Currently in Ehcache client keeps track of the invalidations that they are waiting for. So if the active crashes while invalidations were in progress and passive takes over, the client will send information on the invalidations that it's waiting for in a ReconnectMessage
. The newly promoted active in it's handleReconnect phase will get this information and will record that the client is waiting on certain invalidations to be completed. But the server does not try to complete those invalidations during this reconnect phase but delays it till the next server store operation invocation happens.
This waiting till the next operation is what's causing the problem in this test case as no other client is doing any operation. So the only put operation that happened waits forever.
Now waiting to send invalidations till the next invocation is clearly not ideal. And we can not fire invalidations during the reconnect phase either. So there must be a phase in between where all the clients are connected and ready but before any invocations are allowed. In such a phase we can fire those pending invalidations. This would mean that we'll require such an additional transition phase on the server side from platform.
This is a known issue, where entity tries to ascertain that reconnect window is closed which is marked by the very first invoke across concurrency keys for that entity. This requires a api support from platform, so that the information of closure of reconnect window is pushed upto the entity layer, that it can do invalidations and cleanup related to clients which failed to reconnect.
Adding to the above comment, even if cache is reading or writing on other threads, this problem will not happen as the invokes will trigger the invalidations.
I have filed Terracotta-OSS/terracotta-apis#183 to request this
What about issuing an invoke on the client whenever the platform asks the EndpointDelegate to create extended reconnect data? Would that solve this problem more directly? The call sequence should go like this
<<Client>>
EndpointDelegate.createExtendedReconnectData()
EntityClientEndpoint.beginInvoke().message(<<send invalidations to this client>>).invoke()
<<Server>>
ActiveServerEntity.handleReconnectData()
<<all clients reconnected or reconnect window times out>>
ActiveServerEntity.<<handle resent transactions>>
<<new invokes allowed to process>>
ActiveServerEntity.<<handle send invalidations message>>
@albinsuresh is this still an issue?
@albinsuresh what about now?
Had missed the last two comments. Will test it in the next couple of days and update.
@chrisdennis The code hasn't changed as @myronkscott suggested. Inflight invalidations are still handled in the invoke path and there are no dummy invokes added in the reconnect path.
There is now ability in the ActiveServerEntity to know when the reconnect window closes
@nnares this should be fixed. If you could validate that the original test case that Eugene submitted now passes, and then hopefully close this, that would be appreciated.
@chrisdennis I have validated this use case, Here is my observation :
/*
*
* 1. Created a clustered cache with 1 Active, 1 Passive server
* 2. Created 2 cache manager clients
* 3. Simulate network partition to shut down traffic originating from second client to active ( using angels disrupt())
* 4. Start put operation from first client
* 5. Shutdown active when first client invalidation request is still pending due to partition
* 6. Verify undelivered invalidation request is processed after failover
*
* */
code :
// client-1
Cache<String, String> normalCache = cacheManagers.get(0).getCache("cache-1", String.class, String.class);
// client-2
Cache<String, String> disruptedCache = cacheManagers.get(1).getCache("cache-1", String.class, String.class);
// client-2 got disrupted
clientToServerDisruptor.disrupt();
// doing a put, when one client is disrupted
ExecutorService executorService = Executors.newFixedThreadPool(1);
Future<Void> future = executorService.submit(() -> {
normalCache.put("keyPostDisrupt", "valPostDisrupt");
return null;
});
// doing server failover & client-2 undisruption
tsa.stop(activeServer);
clientToServerDisruptor.undisrupt();
// validating the put operation completion
MatcherAssert.assertThat(normalCache.get("keyPostDisrupt"), is("valPostDisrupt"));
MatcherAssert.assertThat(disruptedCache.get("keyPostDisrupt"), is("valPostDisrupt"));
Here I am providing a timeout of 20s(.write(Duration.ofSeconds(20)))
) for client-1's write operation, because
server failover is not happening within the default timeout - 5s, failed with below exception
java.util.concurrent.TimeoutException: Timeout exceeded for GET_AND_APPEND message; PT5S
With 20s timeout server failover is getting completed gracefully and invalidations are getting processed.
Issue is been fixed partially, working well with extended timeout duration
Well the original bug reported for resending invalidations after failover during reconnect is fixed. I will open a new issue for another suspected behavior found with invalidations during the testing.
It seems any invalidation requests that are not yet delivered to clients through client communicator when active goes down are forgotten. This results mutation operations to hang even after failover to passive. Here is the test case to reproduce it using network partition.