The listener may be called quite some time later, but it retains the original ClusterInfo (potentially 10s of MiBs) for its computations. Could we discard that and use a more recent ClusterInfo?
That's not all that bad, because we only ever have one such task pending. The worst bit is that this listener takes the reroutedClusterState, which means the BatchedRerouteService captures the newState here ...
... and keeps hold of it until the reroute completes (i.e. we converge on a new balance and start to reconcile it) which could be quite some time later. That means that every other batched reroute also retains the state until reconciliation starts, which could represent a rather large amount of heap.
The
DiskThresholdMonitor
enqueues a batched reroute if the watermarks situation in the cluster changes:https://github.com/elastic/elasticsearch/blob/b4712a5c705c9aa44ca32e41e9e0ba513548ddad/server/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java#L317-L363
The listener may be called quite some time later, but it retains the original
ClusterInfo
(potentially 10s of MiBs) for its computations. Could we discard that and use a more recentClusterInfo
?That's not all that bad, because we only ever have one such task pending. The worst bit is that this listener takes the
reroutedClusterState
, which means theBatchedRerouteService
captures thenewState
here ...https://github.com/elastic/elasticsearch/blob/405a53b3c37699ac53f4eada855a5a3a14783bbe/server/src/main/java/org/elasticsearch/cluster/routing/BatchedRerouteService.java#L154-L156
... and keeps hold of it until the reroute completes (i.e. we converge on a new balance and start to reconcile it) which could be quite some time later. That means that every other batched reroute also retains the state until reconciliation starts, which could represent a rather large amount of heap.