[CI] UpgradeClusterClientYamlTestSuiteIT test {p0=mixed_cluster/100_analytics_usage/Basic test for usage stats on analytics indices} failing

elasticsearchmachine commented 6 days ago

Build Scans:

Reproduction Line:

./gradlew ":x-pack:qa:rolling-upgrade:v8.1.3#twoThirdsUpgradedTest" -Dtests.class="org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT" -Dtests.method="test {p0=mixed_cluster/100_analytics_usage/Basic test for usage stats on analytics indices}" -Dtests.seed=31D5B83898963B81 -Dtests.bwc=true -Dtests.locale=en-SD -Dtests.timezone=Australia/Currie -Druntime.java=23

Applicable branches: 8.16

Reproduces locally?: N/A

Failure History: See dashboard&_a=(controlGroupState:(initialChildControlState:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,fieldName:task.keyword,order:0,selectedOptions:!(),title:'GradleTask',type:optionsListControl),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,fieldName:className.keyword,order:1,selectedOptions:!(org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT),title:'Suite',type:optionsListControl),'144933da-5c1b-4257-a969-7f43455a7901':(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,fieldName:name.keyword,order:2,selectedOptions:!(test%20%7Bp0%3Dmixed_cluster%2F100_analytics_usage%2FBasic%20test%20for%20usage%20stats%20on%20analytics%20indices%7D),title:'Test',type:optionsListControl)))))

Failure Message:

java.lang.AssertionError: Failure at [mixed_cluster/100_analytics_usage:64]: field [analytics.stats.cumulative_cardinality_usage] is not greater than [$cumulative_cardinality_usage]
Expected: a value greater than <1>
     but: <1> was equal to <1>

Issue Reasons:

[8.16] 2 failures in test test {p0=mixed_cluster/100_analytics_usage/Basic test for usage stats on analytics indices} (0.2% fail rate in 935 executions)
[8.16] 2 failures in step 8.1.3_bwc (15.4% fail rate in 13 executions)
[8.16] 2 failures in pipeline elasticsearch-periodic (15.4% fail rate in 13 executions)

Note: This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

elasticsearchmachine commented 6 days ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

alex-spies commented 5 days ago

This looks unrelated to the feature under test. It looks like the node setup didn't work properly.

~Both build scans above have logs for nodes 8.1.3-0, 8.1.3-1 and 8.1.3-2 - of which I assume two are supposed to be upgraded to 8.16 in this thoThirdsUpgradedTest scenario. In both build scans, node 8.1.3-0 encounters an error related to downloading the geoip database:~

Update: The build scans only have logs for nodes on 8.1.3 - but these nodes are, I think, supposed to be replaced by 8.16 nodes during the test, so not all of their logs are relevant.

But I found this on node 8.1.3-0 in build scan 4900:

» [2024-11-16T21:41:41,929][WARN ][o.e.c.InternalClusterInfoService] [v8.1.3-0] failed to retrieve stats for node [XytpJibCRZGh6U1o68qI5A] org.elasticsearch.transport.NodeNotConnectedException: [v8.1.3-1][127.0.0.1:40535] Node not connected
»  
» [2024-11-16T21:41:41,938][WARN ][o.e.c.InternalClusterInfoService] [v8.1.3-0] failed to retrieve shard stats from node [XytpJibCRZGh6U1o68qI5A] org.elasticsearch.transport.NodeNotConnectedException: [v8.1.3-1][127.0.0.1:40535] Node not connected
»  
» [2024-11-16T21:41:42,774][WARN ][o.e.c.r.a.AllocationService] [v8.1.3-0] [empty_index][0] marking unavailable shards as stale: [8ZILEg6MSRCPkA_-J5t6xg]
» [2024-11-16T21:41:42,916][WARN ][o.e.c.r.a.AllocationService] [v8.1.3-0] [index_with_replicas][0] marking unavailable shards as stale: [uSR4NDT3S-mFO5gLcHHCuw]
» [2024-11-16T21:42:18,707][WARN ][o.e.c.r.a.AllocationService] [v8.1.3-0] [.ml-notifications-000002][0] marking unavailable shards as stale: [Bq5rEk-LSk6R2J72fFkzYQ]
» [2024-11-16T21:42:21,332][WARN ][o.e.c.r.a.AllocationService] [v8.1.3-0] [analytics_usage][0] marking unavailable shards as stale: [-kqHUVyFRxeQ1IDs28KsKg]

...

» [2024-11-16T21:42:54,978][ERROR][o.e.x.w.Watcher          ] [v8.1.3-0] triggered watches could not be deleted [my_watch_f72ce2e2-2418-41d5-8112-f778507e2351-2024-11-16T21:42:54.954659529Z], failure [[.triggered_watches] org.elasticsearch.index.IndexNotFoundException: no such index [.triggered_watches]]

So, if I read this correctly, some how during the rolling upgrade, we lose some analytics_usage data and, additionally, the watchers somehow get borked.

I am not entirely sure how or if this actually causes the test failure - where an agg is run which is then not registered in the usage stats. I think I need help from @elastic/es-distributed . Could you folks take a look and see if the logs make more sense to you than they do to me, please?

alex-spies commented 5 days ago

Weirdly, I have issues running this test locally to attempt to reproduce. The repro line from above leads to > No tests found for given includes: [**/*$*.class](exclude rules) - and when I remove the -Dtests.method="test {p0=mixed_cluster/100_analytics_usage/Basic test for usage stats on analytics indices}" to run the whole suite, it does run some tests but hangs after about 12 of them.

Additionally, when running the reproducer, I do run into

[2024-11-21T13:16:11,506][ERROR][o.e.i.g.GeoIpDownloader  ] [v8.1.3-2] exception during geoip databases update
»  java.net.UnknownHostException: invalid.endpoint

on my machine as well.

elasticsearchmachine commented 5 days ago

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd commented 5 days ago

The build scans only have logs for nodes on 8.1.3 - but these nodes are, I think, supposed to be replaced by 8.16 nodes during the test, so not all of their logs are relevant.

The directories and files kept their old names during upgrade. If you read through the logs, e.g. the one from v8.1.3-0, the messages show that the upgrade has been successfully for this node.

...
[2024-11-16T21:40:29,938][INFO ][o.e.n.NativeAccess       ] [v8.1.3-0] Using [jdk] native provider and native methods for [Linux]
[2024-11-16T21:40:30,038][INFO ][o.a.l.i.v.PanamaVectorizationProvider] [v8.1.3-0] Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled
[2024-11-16T21:40:31,616][INFO ][o.e.n.Node               ] [v8.1.3-0] version[8.16.1-SNAPSHOT], pid[97732], build[tar/68337ff66ed0e7cf8141bb2557e9ae693d401b9d/2024-11-16T21:15:34.222928356Z], OS[Linux/5.15.0-1070-gcp/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/23/23+37-2369]
...

I think the watcher error is likey a red-herring. In the logs of v8.1.3-2 (the last node that has not been upgraded), there was an AssertionError that killed the it.

[2024-11-16T21:42:59,606][WARN ][o.e.t.TcpTransport       ] [v8.1.3-2] exception caught on transport layer [Netty4TcpChannel{localAddress=/127.0.0.1:35211, remoteAddress=/127.0.0.1:52880, profile=default}], closing connection             java.lang.Exception: java.lang.AssertionError: cluster:internal/xpack/ml/trained_models/cache/info[n]                                                                                                                                             at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:86) [transport-netty4-8.1.3.jar:8.1.3]                                                                                     at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:381) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
...
Caused by: java.lang.AssertionError: cluster:internal/xpack/ml/trained_models/cache/info[n]
    at org.elasticsearch.transport.InboundAggregator.lambda$new$0(InboundAggregator.java:46) ~[elasticsearch-8.1.3.jar:8.1.3]
    at org.elasticsearch.transport.InboundAggregator.initializeRequestState(InboundAggregator.java:197) ~[elasticsearch-8.1.3.jar:8.1.3]
    at org.elasticsearch.transport.InboundAggregator.headerReceived(InboundAggregator.java:66) ~[elasticsearch-8.1.3.jar:8.1

This ML action is available since v8.2.0 (https://github.com/elastic/elasticsearch/pull/83802). The AssertionError complains that the v8.1.3 does not know how to handle this action. In a mixed cluster during upgrade, this action should not have been sent to an old node. I am routing this to the ML team for further investigation.

elasticsearchmachine commented 5 days ago

Pinging @elastic/ml-core (Team:ML)

benwtrent commented 5 days ago

alex-spies commented 5 days ago

Thanks for the help @ywangd , you have keen eyes!

alex-spies commented 5 days ago

Thanks @benwtrent - I can see that the linked PR should have avoided this, but it didn't get backported to 8.16. Probably that backport is all we need.

Also, this looks like a duplicate of https://github.com/elastic/elasticsearch/issues/115170.

davidkyle commented 5 days ago

Fixed by https://github.com/elastic/elasticsearch/pull/117269

elastic / elasticsearch

[CI] UpgradeClusterClientYamlTestSuiteIT test {p0=mixed_cluster/100_analytics_usage/Basic test for usage stats on analytics indices} failing #117204