ehcache / ehcache3

Ehcache 3.x line
http://www.ehcache.org
Apache License 2.0
2.02k stars 579 forks source link

Invalidation doesn't work when active goes down with undelivered invalidation requests. #1626

Closed esebasti closed 1 year ago

esebasti commented 7 years ago

It seems any invalidation requests that are not yet delivered to clients through client communicator when active goes down are forgotten. This results mutation operations to hang even after failover to passive. Here is the test case to reproduce it using network partition.

CLUSTER.getClusterControl().waitForActive();
CLUSTER.getClusterControl().waitForRunningPassivesInStandby();

//Create two cache manager clients connecting to same entity
int numClients = 2;
List<PersistentCacheManager> cacheManagers = new ArrayList<>();
List<Integer> clientPorts = new ArrayList<>();
int activePort = getActiveTsaPort();
for (int i = 0; i < numClients; ++i) {
  CacheManagerBuilder<PersistentCacheManager> builder = CacheManagerBuilder.newCacheManagerBuilder()
    .with(ClusteringServiceConfigurationBuilder.cluster(CLUSTER.getConnectionURI()
      .resolve("/crud-cm-replication"))
      .autoCreate()
      .defaultServerResource("primary-server-resource")
      .resourcePool("resource-pool-a", 128, MemoryUnit.MB))
    .withCache("cache1", CacheConfigurationBuilder.newCacheConfigurationBuilder(Integer.class, Integer.class,
      ResourcePoolsBuilder.newResourcePoolsBuilder()
        .with(ClusteredResourcePoolBuilder.clusteredDedicated("primary-server-resource", 128, MemoryUnit.MB)))
      .add(ClusteredStoreConfigurationBuilder.withConsistency(Consistency.STRONG)));
  cacheManagers.add(builder.build(true));
  //Find client port using netstat command
  clientPorts.add(getNewClientPort(activePort, clientPorts));
}

//this shuts down traffic flowing from current active to first client
IEmulator emulator = startUniDirectionalPartition(activePort, clientPorts.get(0));

//Initiate put from second client. Invalidation request to first client is queued up in server due to partition.
ExecutorService executorService = Executors.newFixedThreadPool(1);
Future<Void> future = executorService.submit(new Callable<Void>() {
  @Override
  public Void call() {
    cacheManagers.get(1).getCache("cache1", Integer.class, Integer.class).put(1, 1);
    return null;
  }
});

Thread.sleep(5000);
//Stop active with undelivered invalidation request to first client.
CLUSTER.getClusterControl().terminateActive();
//put hangs as invalidation request never reached first client even after failover.
future.get();
emulator.stop();
AbfrmBlr commented 7 years ago

Are we sure that emulator is not interfering with the newly promoted active and client.

albinsuresh commented 7 years ago

Found the cause of the issue. It's a minor flaw in the way inflight invalidations are handled in Ehcache and the fix is not gonna be trivial as a proper fix might require more API support from platform.

Currently in Ehcache client keeps track of the invalidations that they are waiting for. So if the active crashes while invalidations were in progress and passive takes over, the client will send information on the invalidations that it's waiting for in a ReconnectMessage. The newly promoted active in it's handleReconnect phase will get this information and will record that the client is waiting on certain invalidations to be completed. But the server does not try to complete those invalidations during this reconnect phase but delays it till the next server store operation invocation happens.

This waiting till the next operation is what's causing the problem in this test case as no other client is doing any operation. So the only put operation that happened waits forever.

Now waiting to send invalidations till the next invocation is clearly not ideal. And we can not fire invalidations during the reconnect phase either. So there must be a phase in between where all the clients are connected and ready but before any invocations are allowed. In such a phase we can fire those pending invalidations. This would mean that we'll require such an additional transition phase on the server side from platform.

AbfrmBlr commented 7 years ago

This is a known issue, where entity tries to ascertain that reconnect window is closed which is marked by the very first invoke across concurrency keys for that entity. This requires a api support from platform, so that the information of closure of reconnect window is pushed upto the entity layer, that it can do invalidations and cleanup related to clients which failed to reconnect.

AbfrmBlr commented 7 years ago

Adding to the above comment, even if cache is reading or writing on other threads, this problem will not happen as the invokes will trigger the invalidations.

ljacomet commented 7 years ago

I have filed Terracotta-OSS/terracotta-apis#183 to request this

myronkscott commented 7 years ago

What about issuing an invoke on the client whenever the platform asks the EndpointDelegate to create extended reconnect data? Would that solve this problem more directly? The call sequence should go like this

<<Client>>
EndpointDelegate.createExtendedReconnectData()
EntityClientEndpoint.beginInvoke().message(<<send invalidations to this client>>).invoke()  

<<Server>>
ActiveServerEntity.handleReconnectData()
<<all clients reconnected or reconnect window times out>>
ActiveServerEntity.<<handle resent transactions>>
<<new invokes allowed to process>>
ActiveServerEntity.<<handle send invalidations message>>
chrisdennis commented 6 years ago

@albinsuresh is this still an issue?

chrisdennis commented 5 years ago

@albinsuresh what about now?

albinsuresh commented 5 years ago

Had missed the last two comments. Will test it in the next couple of days and update.

albinsuresh commented 5 years ago

@chrisdennis The code hasn't changed as @myronkscott suggested. Inflight invalidations are still handled in the invoke path and there are no dummy invokes added in the reconnect path.

myronkscott commented 5 years ago

There is now ability in the ActiveServerEntity to know when the reconnect window closes

https://github.com/Terracotta-OSS/terracotta-apis/blob/master/entity-server-api/src/main/java/org/terracotta/entity/ActiveServerEntity.java#L121

chrisdennis commented 2 years ago

@nnares this should be fixed. If you could validate that the original test case that Eugene submitted now passes, and then hopefully close this, that would be appreciated.

nnares commented 1 year ago

@chrisdennis I have validated this use case, Here is my observation :

    /*
     *
     *  1. Created a clustered cache with 1 Active, 1 Passive server
     *  2. Created 2 cache manager clients
     *  3. Simulate network partition to shut down traffic originating from second client to active ( using angels disrupt())
     *  4. Start put operation from first client
     *  5. Shutdown active when first client invalidation request is still pending due to partition
     *  6. Verify undelivered invalidation request is processed after failover
     *
     * */

code :
// client-1
Cache<String, String> normalCache = cacheManagers.get(0).getCache("cache-1", String.class, String.class);
// client-2
Cache<String, String> disruptedCache = cacheManagers.get(1).getCache("cache-1", String.class, String.class);

// client-2 got disrupted
clientToServerDisruptor.disrupt();

// doing a put, when one client is disrupted
ExecutorService executorService = Executors.newFixedThreadPool(1);
Future<Void> future = executorService.submit(() -> {
    normalCache.put("keyPostDisrupt", "valPostDisrupt");
    return null;
});

// doing server failover & client-2 undisruption
tsa.stop(activeServer);
clientToServerDisruptor.undisrupt();

// validating the put operation completion
MatcherAssert.assertThat(normalCache.get("keyPostDisrupt"), is("valPostDisrupt"));
MatcherAssert.assertThat(disruptedCache.get("keyPostDisrupt"), is("valPostDisrupt"));

Here I am providing a timeout of 20s(.write(Duration.ofSeconds(20)))) for client-1's write operation, because server failover is not happening within the default timeout - 5s, failed with below exception java.util.concurrent.TimeoutException: Timeout exceeded for GET_AND_APPEND message; PT5S

With 20s timeout server failover is getting completed gracefully and invalidations are getting processed.

Issue is been fixed partially, working well with extended timeout duration

AbfrmBlr commented 1 year ago

Well the original bug reported for resending invalidations after failover during reconnect is fixed. I will open a new issue for another suspected behavior found with invalidations during the testing.