Autoscaling overreacts if a shard gets larger than the disk watermark

Elasticsearch Version

8.10.4

Installed Plugins

none

Java Version

bundled

OS Version

N/A

Problem Description

We encountered an unusual situation where autoscaling repeatedly scaled the hot tier up until it reached the configured limit, despite it not being appropriate to do so. The cause was two extremely large shards. They were each larger than the low watermark disk size of the nodes in the hot tier. These nodes were already maximum size, so couldn't be scaled up further.

Here’s what happened:

there were a couple of shards, each around 700GB in size, which ballooned to that size due to the rollover failure caused by a custom role not having correct permissions for the ILM rollover action
the node holding one of these shards filled up and breached the high watermark
autoscaling didn’t take action prior to this point because overall the cluster had space
after the high watermark breach, the cluster attempted to move the huge shard off the node, because that’s what it does beyond high watermark
there was no place in the cluster that the huge shard could go to, because it was almost as big as the size of each hot node
because it couldn’t go anywhere, this finally did trigger autoscaling, which added an empty node
the huge shard was moved over to the empty node
the other huge shard then had the same problem on another node, and so autoscaling triggered again
the second huge shard was moved to its new empty node
but now these huge shards had become so big that they couldn't even allocate to new empty nodes because doing so would breach the low watermark
autoscaling added nodes again and again until the autoscaling maximums were reached

Although autoscaling behaved as expected given the circumstances, we conclude that this is a bug because a customer would not expect autoscaling to behave in this manner. For a situation where the consequence is an unbounded scale up, the system should notice the trap and deliberately not scale.

Steps to Reproduce

This bug is reproducible in ESS as described in the steps above.

Logs (if relevant)

N/A

elastic / elasticsearch