elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.6k stars 24.63k forks source link

[CI] PrevalidateShardPathIT testCheckShards failing #111134

Open elasticsearchmachine opened 2 months ago

elasticsearchmachine commented 2 months ago

Build Scans:

Reproduction Line:

./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.cluster.PrevalidateShardPathIT.testCheckShards" -Dtests.seed=76EC23E6FF7DCE4C -Dtests.locale=sl-SI -Dtests.timezone=America/Argentina/Rio_Gallegos -Druntime.java=22

Applicable branches: main

Reproduces locally?: N/A

Failure History: See dashboard&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testCheckShards'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.cluster.PrevalidateShardPathIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))

Failure Message:

java.lang.AssertionError: The relocation source node should have removed the shard(s)

Issue Reasons:

Note: This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

elasticsearchmachine commented 2 months ago

This has been muted on branch main

Mute Reasons:

Build Scans:

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-distributed (Team:Distributed)

pxsalehi commented 1 week ago

It seems that the shard really stays on the node after the relocation is successful. It seems we get as far as receiving the shard active response from all the other nodes. However, we don't go ahead with the deletion because of this check:

 not deleting shard [index1][0], the latest cluster state version[23] is not equal to cluster state before shard active api call [22]

and that seems to be it, we won't try ever again! Not sure what else usually triggers a clean up, that potentially removes this much later (in the test it is the after test clean up that trigger org.elasticsearch.indices.IndicesService#processPendingDeletes). We could either try to facillitate that extra clean up in the test, or what might be more reasonable is to see why that cs version check is so strict! We should probably at least retry it.

pxsalehi commented 3 days ago

we won't try ever again!

This is not true. We do retry on every cluster state update in IndicesStore it seems. However, the problem is that we quickly trigger another cluster state update and this can lead to a cycle of newer cluster state updates causing the check above to fail and not delete the shard until we time out.

why that cs version check is so strict

I think the check above needs to be that strict to make sure the shards have not moved and are active where they are since that is the precondition for deleting the local shard store.