[CI] DiskThresholdDeciderIT testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards failing

iverase commented 9 months ago

Build scan: https://gradle-enterprise.elastic.co/s/kxdqmnytyenuq/tests/:server:internalClusterTest/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT/testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards

Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards" -Dtests.seed=ED5C019BB52E4B53 -Dtests.locale=is -Dtests.timezone=America/Anchorage -Druntime.java=21

Applicable branches: main

Reproduces locally?: No

Failure history: Failure dashboard for org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT#testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))

Failure excerpt:

java.lang.Exception: Test abandoned because suite timeout was reached.

  at __randomizedtesting.SeedInfo.seed([ED5C019BB52E4B53]:0)

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko commented 9 months ago

Oh, interesting, according to:

"TEST-DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards-seed#[ED5C019BB52E4B53]" ID=2402 WAITING on org.elasticsearch.action.support.PlainActionFuture$Sync@dcadc57
    at java.base@21.0.2/jdk.internal.misc.Unsafe.park(Native Method)
    - waiting on org.elasticsearch.action.support.PlainActionFuture$Sync@dcadc57
    at java.base@21.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
    at java.base@21.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    at java.base@21.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
    at app//org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:273)
    at app//org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:93)
    at app//org.elasticsearch.client.internal.support.AbstractClient$RefCountedFuture.get(AbstractClient.java:1534)
    at app//org.elasticsearch.client.internal.support.AbstractClient$RefCountedFuture.get(AbstractClient.java:1514)
    at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
    at app//org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:152)
    at app//org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:43)
    at app//org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards(DiskThresholdDeciderIT.java:204)
    at java.base@21.0.2/java.lang.invoke.LambdaForm$DMH/0x00007f2a3c300000.invokeVirtual(LambdaForm$DMH)
    at java.base@21.0.2/java.lang.invoke.LambdaForm$MH/0x00007f2a3c0a0800.invoke(LambdaForm$MH)

it was stuck waiting at this line:

https://github.com/elastic/elasticsearch/blob/a7f2e2daaf03ca8253bdd6bda9163ec03fb2b2d2/server/src/internalClusterTest/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDeciderIT.java#L202-L204

idegtiarenko commented 9 months ago

Output also contains:

WARNING: Uncaught exception in thread: Thread[#2574,elasticsearch[node_t0][clusterApplierService#updateTask][T#1],5,TGRP-DiskThresholdDeciderIT]
java.lang.AssertionError: 
Expected: <0>
     but: was <1>
    at __randomizedtesting.SeedInfo.seed([ED5C019BB52E4B53]:0)
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
    at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2119)
    at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.lambda$testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards$2(DiskThresholdDeciderIT.java:176)
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560)
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546)
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505)
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429)
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)

so the actual failure happened at:

https://github.com/elastic/elasticsearch/blob/a7f2e2daaf03ca8253bdd6bda9163ec03fb2b2d2/server/src/internalClusterTest/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDeciderIT.java#L173-L178

idegtiarenko commented 8 months ago

According to the recent failure info, the test created the following shards:

ShardSize[shardId=[btydusdduwrc][3], size=16919]
ShardSize[shardId=[btydusdduwrc][0], size=17886]
ShardSize[shardId=[btydusdduwrc][5], size=17972]
ShardSize[shardId=[btydusdduwrc][1], size=18195]
ShardSize[shardId=[btydusdduwrc][4], size=18198]
ShardSize[shardId=[btydusdduwrc][2], size=18472]

During the first allocation, only 5 shard had computed balance and were allocated accordingly:

[2024-03-03T17:22:30,358][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciling desired balance: DesiredBalance[lastConvergedIndex=13, assignments={
[btydusdduwrc][0]=ShardAssignment[nodeIds=[CTSLnDyTQj2xDmatjmYTtQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][1]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][4]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][5]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][2]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][3]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1]
}]

Notice the smallest one is ignored. Possibly as the size was still not computed from the repository.

Later the shard balance was computed as following:

[2024-03-03T17:22:30,400][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance updated: DesiredBalance[lastConvergedIndex=14, assignments={[btydusdduwrc][0]=ShardAssignment[nodeIds=[CTSLnDyTQj2xDmatjmYTtQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][1]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][4]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][5]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][2]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][3]=ShardAssignment[nodeIds=[CTSLnDyTQj2xDmatjmYTtQ], total=1, unassigned=0, ignored=0]}]. Diff: 
[btydusdduwrc][3]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[CTSLnDyTQj2xDmatjmYTtQ], total=1, unassigned=0, ignored=0]

This suggests that the computation failed to take into account another non-empty shard (still?) initializing on the same node.

idegtiarenko commented 8 months ago

I wonder if in this case the shard has started, but corresponding information was still not available in ClusterInfo and the second allocation round was happening as if CTSLnDyTQj2xDmatjmYTtQ is empty.

idegtiarenko commented 8 months ago

I checked the latest failure logs and they contain:

[2024-03-11T18:25:16,539][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Recomputing desired balance for [15]: DesiredBalance[lastConvergedIndex=14, assignments={[wggatuinx][2]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1], [wggatuinx][3]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1], [wggatuinx][4]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1], [wggatuinx][5]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1], [wggatuinx][0]=ShardAssignment[nodeIds=[CwlFGBJYSXywcA9Qqt25xg], total=1, unassigned=0, ignored=0], [wggatuinx][1]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1]}], routing_nodes:
-----node_id[CwlFGBJYSXywcA9Qqt25xg][V]
--------[wggatuinx][0], node[CwlFGBJYSXywcA9Qqt25xg], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=lHmaALapQ1qElJRtpQAgug], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11386]
-----node_id[B-MnLxupT2qkXpWstJb6mA][V]
---- unassigned
--------[wggatuinx][1], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[wggatuinx][2], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[wggatuinx][3], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[wggatuinx][4], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[wggatuinx][5], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
, 

{
  "nodes" : {
    "CwlFGBJYSXywcA9Qqt25xg" : {
      "node_name" : "node_t2",
      "least_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-2",
        "total_bytes" : 21632,
        "used_bytes" : 0,
        "free_bytes" : 21632,
        "free_disk_percent" : 100.0,
        "used_disk_percent" : 0.0
      },
      "most_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-2",
        "total_bytes" : 21632,
        "used_bytes" : 0,
        "free_bytes" : 21632,
        "free_disk_percent" : 100.0,
        "used_disk_percent" : 0.0
      }
    },
    "B-MnLxupT2qkXpWstJb6mA" : {
      "node_name" : "node_t1",
      "least_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-1",
        "total_bytes" : 63302156288,
        "used_bytes" : 9738104832,
        "free_bytes" : 53564051456,
        "free_disk_percent" : 84.6,
        "used_disk_percent" : 15.4
      },
      "most_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-1",
        "total_bytes" : 63302156288,
        "used_bytes" : 9738104832,
        "free_bytes" : 53564051456,
        "free_disk_percent" : 84.6,
        "used_disk_percent" : 15.4
      }
    }
  },
  "shard_sizes" : {
    "[wggatuinx][0][p]_bytes" : 11386
  },
  "shard_data_set_sizes" : {
    "[wggatuinx][0]_bytes" : 11386
  },
  "shard_paths" : {
    "NodeAndShard[nodeId=CwlFGBJYSXywcA9Qqt25xg, shardId=[wggatuinx][0]]" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-2"
  },
  "reserved_sizes" : [
    {
      "node_id" : "CwlFGBJYSXywcA9Qqt25xg",
      "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-2",
      "total" : 0,
      "shards" : [
        "[wggatuinx][0]"
      ]
    }
  ]
}, 
^^^ cluster info is complete and correct

SnapshotShardSizeInfo{snapshotShardSizes=[[snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][2]]=>11469, [snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][3]]=>11869, [snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][4]]=>11392, [snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][5]]=>12075, [snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][1]]=>12389]}
^^^ all snapshot sizes are known and available. They are also available for the prior allocation round

[2024-03-11T18:25:16,540][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][0] marked shard as started (routing: [wggatuinx][0], node[CwlFGBJYSXywcA9Qqt25xg], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=lHmaALapQ1qElJRtpQAgug], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11386])
^^^ allocated correctly in the prior round

[2024-03-11T18:25:16,540][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] running delegate allocator
[2024-03-11T18:25:16,542][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][4] marked shard as started (routing: [wggatuinx][4], node[CwlFGBJYSXywcA9Qqt25xg], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=-gFVMSNgRoeo1Fv53EIBhQ], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11392])
^^^ allocated incorrectly using in the balanced allocator using all available information

[2024-03-11T18:25:16,542][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][1] marked shard as started (routing: [wggatuinx][1], node[B-MnLxupT2qkXpWstJb6mA], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=Ehs2RXxxTaaYu-puiIjWOg], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[12389])
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][2] marked shard as started (routing: [wggatuinx][2], node[B-MnLxupT2qkXpWstJb6mA], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=v3Yh-Ie1SEWjezhmdIiwfg], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11469])
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][3] marked shard as started (routing: [wggatuinx][3], node[B-MnLxupT2qkXpWstJb6mA], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=wz8am3sZSGm7LteXF6GN2Q], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11869])
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][5] marked shard as started (routing: [wggatuinx][5], node[B-MnLxupT2qkXpWstJb6mA], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=BJ7H_lu6QKyV8WvVZsaHRA], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[12075])
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Desired balance computation for [15] is still not converged after [0s] and [1] iterations
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] running delegate allocator
[2024-03-11T18:25:16,549][DEBUG][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Desired balance computation for [15] converged after [0s] and [2] iterations
[2024-03-11T18:25:16,550][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance updated: DesiredBalance[lastConvergedIndex=15, assignments={[wggatuinx][2]=ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0], [wggatuinx][3]=ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0], [wggatuinx][4]=ShardAssignment[nodeIds=[CwlFGBJYSXywcA9Qqt25xg], total=1, unassigned=0, ignored=0], [wggatuinx][5]=ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0], [wggatuinx][0]=ShardAssignment[nodeIds=[CwlFGBJYSXywcA9Qqt25xg], total=1, unassigned=0, ignored=0], [wggatuinx][1]=ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]}]. Diff: 
[wggatuinx][2]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]
[wggatuinx][3]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]
[wggatuinx][4]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[CwlFGBJYSXywcA9Qqt25xg], total=1, unassigned=0, ignored=0]
[wggatuinx][5]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]
[wggatuinx][1]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]

idegtiarenko commented 7 months ago

According to the logs from the latest failure:

Shard sizes
shard=[zoxdkcqpuouh][0]=>12166
shard=[zoxdkcqpuouh][3]=>12451
shard=[zoxdkcqpuouh][2]=>12535
shard=[zoxdkcqpuouh][4]=>12594
shard=[zoxdkcqpuouh][5]=>13006

Nodes
"r4ludfGvTh2wZ7UGP-SjWA" / "node_t1" <-- big node
"zIF6uDFXRYKPVDXFSMTRQQ" / "node_t2" <-- small node

Allocation of all shards happens in 2 rounds. Round 1:

routing_nodes:
-----node_id[zIF6uDFXRYKPVDXFSMTRQQ][V]
-----node_id[r4ludfGvTh2wZ7UGP-SjWA][V]
---- unassigned
--------[zoxdkcqpuouh][0], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][1], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][2], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][3], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][4], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][5], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]

  "shard_sizes" : { },
  "reserved_sizes" : [ ]

SnapshotShardSizeInfo{snapshotShardSizes=[
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][0]]=>12166, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][1]]=>13187, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][2]]=>12535, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][5]]=>13006, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][4]]=>12594, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][3]]=>12451
]}

[2024-04-06T16:17:35,503][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance updated: DesiredBalance[lastConvergedIndex=16, assignments={
[zoxdkcqpuouh][1]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1],
[zoxdkcqpuouh][0]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1],
[zoxdkcqpuouh][5]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0],
[zoxdkcqpuouh][4]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1],
[zoxdkcqpuouh][3]=ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0],
[zoxdkcqpuouh][2]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]
}]. Diff:
[zoxdkcqpuouh][5]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]
[zoxdkcqpuouh][3]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0]
[zoxdkcqpuouh][2]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]

Nothing out of ordinary here.

Round 2:

routing_nodes:
-----node_id[zIF6uDFXRYKPVDXFSMTRQQ][V]
--------[zoxdkcqpuouh][3], node[zIF6uDFXRYKPVDXFSMTRQQ], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[INITIALIZING], a[id=092Hbr2cSEakrjwApnkZow], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[12451]
-----node_id[r4ludfGvTh2wZ7UGP-SjWA][V]
--------[zoxdkcqpuouh][2], node[r4ludfGvTh2wZ7UGP-SjWA], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[INITIALIZING], a[id=TpE9Bw8VTAy1CJz-pDXquQ], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[12535]
--------[zoxdkcqpuouh][5], node[r4ludfGvTh2wZ7UGP-SjWA], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[INITIALIZING], a[id=2n_rRR9XQteq_585F1FliQ], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[13006]
---- unassigned
--------[zoxdkcqpuouh][0], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[zoxdkcqpuouh][1], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[zoxdkcqpuouh][4], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]

  "shard_sizes" : {
    "[zoxdkcqpuouh][2][p]_bytes" : 0, <=== has reserved 12535 bytes
    "[zoxdkcqpuouh][3][p]_bytes" : 12451, <=== has reserved 0 bytes
    "[zoxdkcqpuouh][5][p]_bytes" : 0 <=== has no reserved entry
  },
  "reserved_sizes" : [
    {
      "node_id" : "zIF6uDFXRYKPVDXFSMTRQQ",
      "path" : "/dev/shm/bk/bk-agent-prod-gcp-1712433772034280571/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_E34CC9AAB7634065-001/tempDir-005/node-2",
      "total" : 0,
      "shards" : [
        "[zoxdkcqpuouh][3]"
      ]
    },
    {
      "node_id" : "r4ludfGvTh2wZ7UGP-SjWA",
      "path" : "/dev/shm/bk/bk-agent-prod-gcp-1712433772034280571/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_E34CC9AAB7634065-001/tempDir-005/node-1",
      "total" : 12535,
      "shards" : [
        "[zoxdkcqpuouh][2]"
      ]
    }
  ]
    "zIF6uDFXRYKPVDXFSMTRQQ" : {
      "node_name" : "node_t2",
      "least_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1712433772034280571/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_E34CC9AAB7634065-001/tempDir-005/node-2",
        "total_bytes" : 22691,
        "used_bytes" : 0,
        "free_bytes" : 22691,
        "free_disk_percent" : 100.0,
        "used_disk_percent" : 0.0
      }
    },
    "r4ludfGvTh2wZ7UGP-SjWA" : {
      "node_name" : "node_t1",
      "least_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1712433772034280571/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_E34CC9AAB7634065-001/tempDir-005/node-1",
        "total_bytes" : 63302148096,
        "used_bytes" : 9228197888,
        "free_bytes" : 54073950208,
        "free_disk_percent" : 85.4,
        "used_disk_percent" : 14.6
      },

SnapshotShardSizeInfo{snapshotShardSizes=[
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][1]]=>13187, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][4]]=>12594,
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][0]]=>12166
]}

Desired balance updated: DesiredBalance[lastConvergedIndex=17, assignments={
[zoxdkcqpuouh][5]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][4]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][3]=ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][2]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][1]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][0]=ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0]
}]. Diff: 
[zoxdkcqpuouh][4]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]
[zoxdkcqpuouh][1]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]
[zoxdkcqpuouh][0]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0]

The round 2 balance is calculated incorrectly as it is based on incorrect cluster info:

[zoxdkcqpuouh][5][p]_bytes does not have size nor reserved entry
zIF6uDFXRYKPVDXFSMTRQQ reports "used_bytes" : 0 even though there is one shard on it already (can be related to TestFileStore and how we mock?)

idegtiarenko commented 7 months ago

I have not found any particular reason why balance was calculated in 2 round here, however I consider this to be a valid scenario. I also do not think we have any guarantees to have a point in time snapshot for ClusterInfo that can explain the test value.

cbuescher commented 5 months ago

This one looks related https://gradle-enterprise.elastic.co/s/kxx7lbvti6dii/tests/task/:server:internalClusterTest/details/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT/testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards?top-execution=1

DiannaHohensee commented 2 months ago

Above buildscan failure shows in logs there is

Maj 17, 2024 5:37:20 PD com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#2582,elasticsearch[node_t0][clusterApplierService#updateTask][T#1],5,TGRP-DiskThresholdDeciderIT]
java.lang.AssertionError: Expects no relocating shards but got: routing_nodes:
-----node_id[2_l39Q7SSiWyhgjDJv2lBg][V]
--------[fvyruaxlr][0], node[2_l39Q7SSiWyhgjDJv2lBg], [P], s[STARTED], a[id=SGQo56cyQOuyPEmhwMgeuA], failed_attempts[0]
--------[fvyruaxlr][1], node[2_l39Q7SSiWyhgjDJv2lBg], relocating [s_UbSUETQ-GejfrGhwmCfw], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=5-Mwd2VLSoWPJ5EcXzlrHA, rId=x11KnGalToKb1w59JaOxcQ], failed_attempts[0], expected_shard_size[21221]
--------[fvyruaxlr][3], node[2_l39Q7SSiWyhgjDJv2lBg], [P], s[STARTED], a[id=N1UYVQ_aThyoOxBMlQdyGQ], failed_attempts[0]
--------[fvyruaxlr][4], node[2_l39Q7SSiWyhgjDJv2lBg], [P], s[STARTED], a[id=09TJ-r24QQK5MTD-GaJMcQ], failed_attempts[0]
--------[fvyruaxlr][5], node[2_l39Q7SSiWyhgjDJv2lBg], [P], s[STARTED], a[id=7sbAsKiUQBWSwlk2fdFDlQ], failed_attempts[0]
-----node_id[s_UbSUETQ-GejfrGhwmCfw][V]
--------[fvyruaxlr][1], node[s_UbSUETQ-GejfrGhwmCfw], relocating [2_l39Q7SSiWyhgjDJv2lBg], [P], s[RELOCATING], a[id=x11KnGalToKb1w59JaOxcQ, rId=5-Mwd2VLSoWPJ5EcXzlrHA], failed_attempts[0], expected_shard_size[21221]
--------[fvyruaxlr][2], node[s_UbSUETQ-GejfrGhwmCfw], [P], s[STARTED], a[id=4Htw-f0yStK2qMjoNC7LNA], failed_attempts[0]
---- unassigned

Expected: <0>
     but: was <1>
    at __randomizedtesting.SeedInfo.seed([360E9317DD5AD10]:0)
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
    at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2123)
    at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.lambda$testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards$2(DiskThresholdDeciderIT.java:186)
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560)
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546)
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505)
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429)
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)

DiannaHohensee commented 2 months ago

My debug logging...


REPRODUCE WITH: ./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards {seed=[360E9317DD5AD10:9C6621EC0B9A256A]}" -Dtests.seed=360E9317DD5AD10 -Dtests.filter=@awaitsfix -Dtests.locale=sq -Dtests.timezone=America/Phoenix -Druntime.java=21

DiskThresholdDeciderIT > testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards {seed=[360E9317DD5AD10:9C6621EC0B9A256A]} FAILED
    java.lang.AssertionError: Mismatching shard routings: []
    Expected: iterable containing [one of {<[sqikodzgne][3]>, <[sqikodzgne][2]>}]
         but: no item was one of {<[sqikodzgne][3]>, <[sqikodzgne][2]>}
        at __randomizedtesting.SeedInfo.seed([360E9317DD5AD10:9C6621EC0B9A256A]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2470)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.lambda$assertBusyWithDiskUsageRefresh$7(DiskThresholdDeciderIT.java:360)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1408)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.assertBusyWithDiskUsageRefresh(DiskThresholdDeciderIT.java:350)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards(DiskThresholdDeciderIT.java:243)

  1> [2024-08-29T08:44:58,238][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] ~~~test done creating data for shards, shardSizes: [
  ShardSize[shardId=[uaxlrfmucfxo][0], size=22976],
  ShardSize[shardId=[uaxlrfmucfxo][2], size=23049],
  ShardSize[shardId=[uaxlrfmucfxo][4], size=23122],
  ShardSize[shardId=[uaxlrfmucfxo][3], size=23271],
  ShardSize[shardId=[uaxlrfmucfxo][1], size=23948],
  ShardSize[shardId=[uaxlrfmucfxo][5], size=24030]]

  1> [2024-08-29T08:44:58,462][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] ~~~test decreased disk space and refreshed it, WATERMARK_BYTES: 10240, usableSpace: 23049

  1> [2024-08-29T08:44:58,658][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] ~~~test matcher shard set: [[uaxlrfmucfxo][0], [uaxlrfmucfxo][2]]

  1> [2024-08-29T08:44:58,666][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] ~~~assertBusyWithDiskUsageRefresh refreshDiskUsage complete, checking, shardRouting: [[uaxlrfmucfxo][2]]

The matcher has what we want, and the shard is on the node that we want..

Maybe this 🤔

DiannaHohensee commented 2 months ago

Okay, so I fixed the Matcher problem. Replace this line with the following

assertBusyWithDiskUsageRefresh(dataNode0Id, indexName, hasItem(in(shardSizes.getShardIdsWithSizeSmallerOrEqual(usableSpace))));

But now I've uncovered something else, either a test bug or code bug. These are my debug logs limited to this line of code where the snapshot restore occurs and allocates shards out to the nodes -- note that rebalancing is otherwise disabled, we won't rebalance after the initial allocation, IIUC.

A potentially interesting bit is this, before allocation occurs: somehow allocation thinks the available on the node is zero?

1> [2024-08-29T12:07:08,531][WARN ][o.e.c.r.a.DiskThresholdMonitor] [node_t0] high disk watermark [10240b] exceeded on [q24VcceoTY-cq6vgafXpOQ][node_t2][/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2] free: 0b[0%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete

Again later,

  1> [2024-08-29T12:07:08,538][TRACE][o.e.c.r.a.d.DiskThresholdDecider] [node_t0] getDiskUsage(subtractLeavingShards=false) returning DiskUsageWithRelocations[diskUsage=[q24VcceoTY-cq6vgafXpOQ][node_t2][/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2] free: 0b[0%], relocatingShardSize=0]
  1> [2024-08-29T12:07:08,538][TRACE][o.e.c.r.a.d.DiskThresholdDecider] [node_t0] node [q24VcceoTY-cq6vgafXpOQ] has 100.0% used disk
  1> [2024-08-29T12:07:08,538][DEBUG][o.e.c.r.a.d.DiskThresholdDecider] [node_t0] less than the required 10240 free bytes threshold (0b free) on node q24VcceoTY-cq6vgafXpOQ, preventing allocation

100% disk usage until the end (last occurrence of log message)

  1> [2024-08-29T12:07:08,559][TRACE][o.e.c.r.a.d.DiskThresholdDecider] [node_t0] node [q24VcceoTY-cq6vgafXpOQ] has 100.0% used disk

DiannaHohensee commented 2 months ago

These are the logs written while these lines of code executed, which change the disk size. It looks like

  1> [2024-08-29T12:07:08,514][DEBUG][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Executing allocate for [11]
  1> [2024-08-29T12:07:08,514][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] Refreshed cluster info: {
  1>   "nodes" : {
  1>     "Hh_cFSDBQxSRw0flPCpXYw" : {
  1>       "node_name" : "node_t1",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-1",
  1>         "total_bytes" : 364024733696,
  1>         "used_bytes" : 78748274688,
  1>         "free_bytes" : 285276459008,
  1>         "free_disk_percent" : 78.4,
  1>         "used_disk_percent" : 21.6
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-1",
  1>         "total_bytes" : 364024733696,
  1>         "used_bytes" : 78748274688,
  1>         "free_bytes" : 285276459008,
  1>         "free_disk_percent" : 78.4,
  1>         "used_disk_percent" : 21.6
  1>       }
  1>     },
  1>     "q24VcceoTY-cq6vgafXpOQ" : {
  1>       "node_name" : "node_t2",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       }
  1>     }
  1>   },
  1>   "shard_sizes" : { },
  1>   "shard_data_set_sizes" : { },
  1>   "shard_paths" : { },
  1>   "reserved_sizes" : [ ]
  1> }
  1> [2024-08-29T12:07:08,514][DEBUG][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Starting desired balance computation for [11]
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciling desired balance: DesiredBalance[lastConvergedIndex=10, assignments={}]
  1> [2024-08-29T12:07:08,514][DEBUG][o.e.c.r.a.a.DesiredBalanceReconciler] [node_t0] Reconciling desired balance for [10]
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciled desired balance: DesiredBalance[lastConvergedIndex=10, assignments={}]
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Recomputing desired balance for [11]: DesiredBalance[lastConvergedIndex=10, assignments={}], routing_nodes:
  1> -----node_id[q24VcceoTY-cq6vgafXpOQ][V]
  1> -----node_id[Hh_cFSDBQxSRw0flPCpXYw][V]
  1> ---- unassigned
  1> , {
  1>   "nodes" : {
  1>     "Hh_cFSDBQxSRw0flPCpXYw" : {
  1>       "node_name" : "node_t1",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-1",
  1>         "total_bytes" : 364024733696,
  1>         "used_bytes" : 78748274688,
  1>         "free_bytes" : 285276459008,
  1>         "free_disk_percent" : 78.4,
  1>         "used_disk_percent" : 21.6
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-1",
  1>         "total_bytes" : 364024733696,
  1>         "used_bytes" : 78748274688,
  1>         "free_bytes" : 285276459008,
  1>         "free_disk_percent" : 78.4,
  1>         "used_disk_percent" : 21.6
  1>       }
  1>     },
  1>     "q24VcceoTY-cq6vgafXpOQ" : {
  1>       "node_name" : "node_t2",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       }
  1>     }
  1>   },
  1>   "shard_sizes" : { },
  1>   "shard_data_set_sizes" : { },
  1>   "shard_paths" : { },
  1>   "reserved_sizes" : [ ]
  1> }, SnapshotShardSizeInfo{snapshotShardSizes=[]}
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] running delegate allocator
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [node_t0] Start allocating unassigned shards
  1> [2024-08-29T12:07:08,515][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [node_t0] Start balancing cluster
  1> [2024-08-29T12:07:08,515][DEBUG][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Desired balance computation for [11] converged after [0s] and [1] iterations
  1> [2024-08-29T12:07:08,515][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance updated: DesiredBalance[lastConvergedIndex=11, assignments={}]. No changes
  1> [2024-08-29T12:07:08,515][DEBUG][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance computation for [11] is completed, scheduling reconciliation
  1> [2024-08-29T12:07:08,515][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciling desired balance: DesiredBalance[lastConvergedIndex=11, assignments={}]
  1> [2024-08-29T12:07:08,515][DEBUG][o.e.c.r.a.a.DesiredBalanceReconciler] [node_t0] Reconciling desired balance for [11]
  1> [2024-08-29T12:07:08,515][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciled desired balance: DesiredBalance[lastConvergedIndex=11, assignments={}]
  1> [2024-08-29T12:07:08,515][WARN ][o.e.c.r.a.DiskThresholdMonitor] [node_t0] high disk watermark [10240b] exceeded on [q24VcceoTY-cq6vgafXpOQ][node_t2][/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2] free: 0b[0%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete

The interesting bit is

  1>     "q24VcceoTY-cq6vgafXpOQ" : {
  1>       "node_name" : "node_t2",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       }
  1>     }

We changed the total_bytes, but the used_bytes match. Even though there are no shards allocated.

elasticsearchmachine commented 5 days ago

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elastic / elasticsearch

[CI] DiskThresholdDeciderIT testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards failing #105331