Force Merge should check disk space before attempting a merge to a single segment

markharwood commented 6 years ago

We know that a force merge down to a single segment will be roughly a 2x cost of the current index size because it requires the old and new index co-exist on disk (briefly) before swapping over to the new index. We should introduce a disk space check before attempting such a merge and return an error if the operation is estimated to tip us over the high watermark threshold for allocating shards to the current node. This would save the heavy cost of attempting an impossible task and prevent the dangerous scenario of running out of disk space.

elasticmachine commented 6 years ago

Pinging @elastic/es-core-infra

geekpete commented 4 years ago

It's probably good to check continually through the force merge process if possible? Rather than only at the start, in case things change while it's merging after it's done the initial calculation. This is probably only useful if there's a mechanism to pause or cancel the force merge if the disk space check fails part way through, I'm unsure if internally we have any existing way to pause/cancel a force merge.

remd commented 2 years ago

We know that a force merge down to a single segment will be roughly a 2x cost of the current index size because it requires the old and new index co-exist on disk (briefly) before swapping over to the new index.

Does that 2x disk space cost include store.size (primary shards + replica shards) or just the pri.store.size (primary shards) of the index being forcemerged?

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

kunisen commented 1 year ago

geekpete commented 1 year ago

@remd A force merge will occur on replicas as well, which means the 2x cost includes replicas not only primary store size. There's another issue around if/how force merge on replicas might be better optimized - https://github.com/elastic/elasticsearch/issues/75478

srri commented 1 month ago

Just writing in as a customer here (Elastic cloud case 01673285).

I believe we just ran into this case in production (running v8.13.3).

We have a cluster with 3 hot-nodes and 1 frozen node. Each node was configured with a max capacity of 640GB. Before the incident started, the utilization was as follows: instance-021: 59% (377.6 GB used of 640 GB) instance-024: 69% (441.6 GB used of 640 GB) instance-023: 87% (556.8 GB used of 640 GB)

At 17:01 UTC we see:

Aug 13, 2024, 5:01:05 PM UTC | INFO | i23@eu-central-1a | [instance-0000000023] moving index [.ds-logs-events-<redacted>-2024.07.31-000271] from [{"phase":"frozen","action":"searchable_snapshot","name":"branch-check-existing-snapshot"}] to [{"phase":"frozen","action":"searchable_snapshot","name":"forcemerge"}] in policy [lifecycle-policy-events]

According to the index API, the index in question at the time had a size of 160.25 GB. The index has 3 shards w/ 1 replica.

        "total_data_set_size_in_bytes": 160259664406,

At 17:07 UTC we see:

[instance-0000000023] high disk watermark [90%] exceeded on [HQQozT-eRSWxR5IlO0Xexw][instance-0000000023][/app/data] free: 63.2gb[9.8%], shards will be relocated away from this node; currently relocating away shards totalling [53759210748] bytes; the node is expected to be below the high disk watermark when these relocations are complete

And we see further log lines showing disk space still being eaten up despite the change to read-only indices (kaboom for our service). Auto-scaling kicked in when instance-23 hit 90% or so utilization, but it wasn't fast enough:

Aug 13, 2024, 5:21:18 PM UTC | WARN | i23@eu-central-1a | [instance-0000000023] flood stage disk watermark [95%] exceeded on [HQQozT-eRSWxR5IlO0Xexw][instance-0000000023][/app/data] free: 30.9gb[4.8%], all indices on this node will be marked read-only
Aug 13, 2024, 5:21:48 PM UTC | WARN | i23@eu-central-1a | [instance-0000000023] flood stage disk watermark [95%] exceeded on [HQQozT-eRSWxR5IlO0Xexw][instance-0000000023][/app/data] free: 29.3gb[4.5%], all indices on this node will be marked read-only
Aug 13, 2024, 5:22:18 PM UTC | WARN | i23@eu-central-1a | [instance-0000000023] flood stage disk watermark [95%] exceeded on [HQQozT-eRSWxR5IlO0Xexw][instance-0000000023][/app/data] free: 27.7gb[4.3%], all indices on this node will be marked read-only
Aug 13, 2024, 5:22:48 PM UTC | WARN | i23@eu-central-1a | [instance-0000000023] flood stage disk watermark [95%] exceeded on [HQQozT-eRSWxR5IlO0Xexw][instance-0000000023][/app/data] free: 26.2gb[4%], all indices on this node will be marked read-only

These logs continue (logging until it fails)

Aug 13, 2024, 5:28:49 PM UTC | WARN | i23@eu-central-1a | [instance-0000000023] flood stage disk watermark [95%] exceeded on [HQQozT-eRSWxR5IlO0Xexw][instance-0000000023][/app/data] free: 7.3gb[1.1%], all indices on this node will be marked read-only

Followed by the following, though I suspect this may be due to the auto-scaling killing nodes at this point:

[instance-0000000023] policy [lifecycle-policy-events] for index [.ds-logs-events-<redacted>-2024.07.31-000271] failed on step [{"phase":"frozen","action":"searchable_snapshot","name":"forcemerge"}]. Moving to ERROR step org.elasticsearch.ElasticsearchException: index [.ds-logs-events-<redacted>-2024.07.31-000271] in policy [lifecycle-policy-events] encountered failures [{"shard":0,"index":".ds-logs-events-<redacted>-2024.07.31-000271","status":"INTERNAL_SERVER_ERROR","reason":{"type":"failed_node_exception","reason":"Failed node [T1hJY3GeQAWi4T_IMBcDxw]","node_id":"T1hJY3GeQAWi4T_IMBcDxw","caused_by":{"type":"node_disconnected_exception","reason":"[instance-0000000021][10.11.160.14:19144][indices:admin/forcemerge[n]] disconnected"}}},{"shard":2,"index":".ds-logs-events-<redacted>-2024.07.31-000271","status":"INTERNAL_SERVER_ERROR","reason":{"type":"failed_node_exception","reason":"Failed node [T1hJY3GeQAWi4T_IMBcDxw]","node_id":"T1hJY3GeQAWi4T_IMBcDxw","caused_by":{"type":"node_disconnected_exception","reason":"[instance-0000000021][10.11.160.14:19144][indices:admin/forcemerge[n]] disconnected"}}},{"shard":1,"index":".ds-logs-events-<redacted>-2024.07.31-000271","status":"INTERNAL_SERVER_ERROR","reason":{"type":"failed_node_exception","reason":"Failed node [qejCwzcXSLCQKgYTcJ23Zw]","node_id":"qejCwzcXSLCQKgYTcJ23Zw","caused_by":{"type":"node_disconnected_exception","reason":"[instance-0000000024][10.11.156.235:19957][indices:admin/forcemerge[n]] disconnected"}}},{"shard":2,"index":".ds-logs-events-<redacted>-2024.07.31-000271","status":"INTERNAL_SERVER_ERROR","reason":{"type":"failed_node_exception","reason":"Failed node [qejCwzcXSLCQKgYTcJ23Zw]","node_id":"qejCwzcXSLCQKgYTcJ23Zw","caused_by":{"type":"node_disconnected_exception","reason":"[instance-0000000024][10.11.156.235:19957][indices:admin/forcemerge[n]] disconnected"}}},{"shard":0,"index":".ds-logs-events-<redacted>-2024.07.31-000271","status":"INTERNAL_SERVER_ERROR","reason":{"type":"failed_node_exception","reason":"Failed node [HQQozT-eRSWxR5IlO0Xexw]","node_id":"HQQozT-eRSWxR5IlO0Xexw","caused_by":{"type":"node_closed_exception","reason":"node closed {instance-0000000023}{HQQozT-eRSWxR5IlO0Xexw}{0SK8PCV0R1ig97aiPbgYKQ}{instance-0000000023}{10.11.137.237}{10.11.137.237:19433}{himrst}{8.13.3}{7000099-8503000}{region=eu-central-1, instance_configuration=aws.es.datahot.i3en, availability_zone=eu-central-1a, logical_availability_zone=zone-1, data=hot, ml.config_version=12.0.0, xpack.installed=true, transform.config_version=10.0.0, server_name=instance-0000000023.e736726bf17743dc9b58f40a26f893ec}"}}},{"shard":1,"index":".ds-logs-events-<redacted>-2024.07.31-000271","status":"INTERNAL_SERVER_ERROR","reason":{"type":"failed_node_exception","reason":"Failed node [HQQozT-eRSWxR5IlO0Xexw]","node_id":"HQQozT-eRSWxR5IlO0Xexw","caused_by":{"type":"node_closed_exception","reason":"node closed {instance-0000000023}{HQQozT-eRSWxR5IlO0Xexw}{0SK8PCV0R1ig97aiPbgYKQ}{instance-0000000023}{10.11.137.237}{10.11.137.237:19433}{himrst}{8.13.3}{7000099-8503000}{region=eu-central-1, instance_configuration=aws.es.datahot.i3en, availability_zone=eu-central-1a, logical_availability_zone=zone-1, data=hot, ml.config_version=12.0.0, xpack.installed=true, transform.config_version=10.0.0, server_name=instance-0000000023.e736726bf17743dc9b58f40a26f893ec}"}}}] on step [forcemerge] at org.elasticsearch.xpack.core.ilm.ForceMergeStep.lambda$performAction$0(ForceMergeStep.java:80) ~[?:?] at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:202) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:196) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:95) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.support.SubscribableListener$SuccessResult.complete(SubscribableListener.java:366) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:286) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.support.SubscribableListener.addListener(SubscribableListener.java:189) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.support.SubscribableListener.addListener(SubscribableListener.java:165) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.support.CancellableFanOut$SubtasksCompletionHandler.run(CancellableFanOut.java:201) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.core.AbstractRefCounted$1.closeInternal(AbstractRefCounted.java:118) ~[elasticsearch-core-8.13.3.jar:?] at org.elasticsearch.core.AbstractRefCounted.decRef(AbstractRefCounted.java:70) ~[elasticsearch-core-8.13.3.jar:?] at org.elasticsearch.action.support.RefCountingRunnable.close(RefCountingRunnable.java:112) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.core.Releasables.close(Releasables.java:34) ~[elasticsearch-core-8.13.3.jar:?] at org.elasticsearch.core.Releasables.closeExpectNoException(Releasables.java:56) ~[elasticsearch-core-8.13.3.jar:?] at org.elasticsearch.action.ActionListenerImplementations$2.run(ActionListenerImplementations.java:49) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.ActionListenerImplementations$RunAfterActionListener.onFailure(ActionListenerImplementations.java:280) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:53) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1475) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:393) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.13.3.jar:?] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.3.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.lang.Thread.run(Thread.java:1583) ~[?:?]

And the exit:

CODE=143 ++ date + WHEN='Tue Aug 13 17:29:43 UTC 2024' + echo 'Error exitcode 143 at Tue Aug 13 17:29:43 UTC 2024' Error exitcode 143 at Tue Aug 13 17:29:43 UTC 2024 + '[' 1 -eq 0 ']' + case ${CODE} in + echo 'No idea what to do with error code 143' No idea what to do with error code 143 ++ find /app/heap_dumps/ -maxdepth 1 -type f -regex '.*.hprof' + HEAP_DUMPS= + '[' -n '' ']' ++ date + echo 'exiting on-error-exitcode.sh at Tue Aug 13 17:29:43 UTC 2024' exiting on-error-exitcode.sh at Tue Aug 13 17:29:43 UTC 2024 + exit 0
Aug 13, 2024, 5:29:43 PM UTC

Attached is also a graph of our disk usage from the console (10:00 Pacific = 17:00 UTC), can see the sharp uptick followed by the chaos of auto-scaling.

srri commented 1 month ago

cc: @DaveCTurner would appreciate any input or updates re: https://github.com/elastic/elasticsearch/issues/88606 Thanks!

elastic / elasticsearch

Force Merge should check disk space before attempting a merge to a single segment #31924