Open elasticsearchmachine opened 6 days ago
Pinging @elastic/es-analytical-engine (Team:Analytics)
This looks unrelated to the feature under test. It looks like the node setup didn't work properly.
~Both build scans above have logs for nodes 8.1.3-0
, 8.1.3-1
and 8.1.3-2
- of which I assume two are supposed to be upgraded to 8.16 in this thoThirdsUpgradedTest
scenario. In both build scans, node 8.1.3-0
encounters an error related to downloading the geoip database:~
Update: The build scans only have logs for nodes on 8.1.3 - but these nodes are, I think, supposed to be replaced by 8.16 nodes during the test, so not all of their logs are relevant.
But I found this on node 8.1.3-0
in build scan 4900:
» [2024-11-16T21:41:41,929][WARN ][o.e.c.InternalClusterInfoService] [v8.1.3-0] failed to retrieve stats for node [XytpJibCRZGh6U1o68qI5A] org.elasticsearch.transport.NodeNotConnectedException: [v8.1.3-1][127.0.0.1:40535] Node not connected
»
» [2024-11-16T21:41:41,938][WARN ][o.e.c.InternalClusterInfoService] [v8.1.3-0] failed to retrieve shard stats from node [XytpJibCRZGh6U1o68qI5A] org.elasticsearch.transport.NodeNotConnectedException: [v8.1.3-1][127.0.0.1:40535] Node not connected
»
» [2024-11-16T21:41:42,774][WARN ][o.e.c.r.a.AllocationService] [v8.1.3-0] [empty_index][0] marking unavailable shards as stale: [8ZILEg6MSRCPkA_-J5t6xg]
» [2024-11-16T21:41:42,916][WARN ][o.e.c.r.a.AllocationService] [v8.1.3-0] [index_with_replicas][0] marking unavailable shards as stale: [uSR4NDT3S-mFO5gLcHHCuw]
» [2024-11-16T21:42:18,707][WARN ][o.e.c.r.a.AllocationService] [v8.1.3-0] [.ml-notifications-000002][0] marking unavailable shards as stale: [Bq5rEk-LSk6R2J72fFkzYQ]
» [2024-11-16T21:42:21,332][WARN ][o.e.c.r.a.AllocationService] [v8.1.3-0] [analytics_usage][0] marking unavailable shards as stale: [-kqHUVyFRxeQ1IDs28KsKg]
...
» [2024-11-16T21:42:54,978][ERROR][o.e.x.w.Watcher ] [v8.1.3-0] triggered watches could not be deleted [my_watch_f72ce2e2-2418-41d5-8112-f778507e2351-2024-11-16T21:42:54.954659529Z], failure [[.triggered_watches] org.elasticsearch.index.IndexNotFoundException: no such index [.triggered_watches]]
So, if I read this correctly, some how during the rolling upgrade, we lose some analytics_usage
data and, additionally, the watchers somehow get borked.
I am not entirely sure how or if this actually causes the test failure - where an agg is run which is then not registered in the usage stats. I think I need help from @elastic/es-distributed . Could you folks take a look and see if the logs make more sense to you than they do to me, please?
Weirdly, I have issues running this test locally to attempt to reproduce. The repro line from above leads to > No tests found for given includes: [**/*$*.class](exclude rules)
- and when I remove the -Dtests.method="test {p0=mixed_cluster/100_analytics_usage/Basic test for usage stats on analytics indices}"
to run the whole suite, it does run some tests but hangs after about 12 of them.
Additionally, when running the reproducer, I do run into
[2024-11-21T13:16:11,506][ERROR][o.e.i.g.GeoIpDownloader ] [v8.1.3-2] exception during geoip databases update
» java.net.UnknownHostException: invalid.endpoint
on my machine as well.
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)
The build scans only have logs for nodes on 8.1.3 - but these nodes are, I think, supposed to be replaced by 8.16 nodes during the test, so not all of their logs are relevant.
The directories and files kept their old names during upgrade. If you read through the logs, e.g. the one from v8.1.3-0
, the messages show that the upgrade has been successfully for this node.
...
[2024-11-16T21:40:29,938][INFO ][o.e.n.NativeAccess ] [v8.1.3-0] Using [jdk] native provider and native methods for [Linux]
[2024-11-16T21:40:30,038][INFO ][o.a.l.i.v.PanamaVectorizationProvider] [v8.1.3-0] Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled
[2024-11-16T21:40:31,616][INFO ][o.e.n.Node ] [v8.1.3-0] version[8.16.1-SNAPSHOT], pid[97732], build[tar/68337ff66ed0e7cf8141bb2557e9ae693d401b9d/2024-11-16T21:15:34.222928356Z], OS[Linux/5.15.0-1070-gcp/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/23/23+37-2369]
...
I think the watcher error is likey a red-herring. In the logs of v8.1.3-2
(the last node that has not been upgraded), there was an AssertionError
that killed the it.
[2024-11-16T21:42:59,606][WARN ][o.e.t.TcpTransport ] [v8.1.3-2] exception caught on transport layer [Netty4TcpChannel{localAddress=/127.0.0.1:35211, remoteAddress=/127.0.0.1:52880, profile=default}], closing connection java.lang.Exception: java.lang.AssertionError: cluster:internal/xpack/ml/trained_models/cache/info[n] at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:86) [transport-netty4-8.1.3.jar:8.1.3] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:381) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
...
Caused by: java.lang.AssertionError: cluster:internal/xpack/ml/trained_models/cache/info[n]
at org.elasticsearch.transport.InboundAggregator.lambda$new$0(InboundAggregator.java:46) ~[elasticsearch-8.1.3.jar:8.1.3]
at org.elasticsearch.transport.InboundAggregator.initializeRequestState(InboundAggregator.java:197) ~[elasticsearch-8.1.3.jar:8.1.3]
at org.elasticsearch.transport.InboundAggregator.headerReceived(InboundAggregator.java:66) ~[elasticsearch-8.1.3.jar:8.1
This ML action is available since v8.2.0 (https://github.com/elastic/elasticsearch/pull/83802). The AssertionError
complains that the v8.1.3 does not know how to handle this action. In a mixed cluster during upgrade, this action should not have been sent to an old node. I am routing this to the ML team for further investigation.
Pinging @elastic/ml-core (Team:ML)
Thanks for the help @ywangd , you have keen eyes!
Thanks @benwtrent - I can see that the linked PR should have avoided this, but it didn't get backported to 8.16. Probably that backport is all we need.
Also, this looks like a duplicate of https://github.com/elastic/elasticsearch/issues/115170.
Build Scans:
Reproduction Line:
Applicable branches: 8.16
Reproduces locally?: N/A
Failure History: See dashboard&_a=(controlGroupState:(initialChildControlState:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,fieldName:task.keyword,order:0,selectedOptions:!(),title:'GradleTask',type:optionsListControl),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,fieldName:className.keyword,order:1,selectedOptions:!(org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT),title:'Suite',type:optionsListControl),'144933da-5c1b-4257-a969-7f43455a7901':(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,fieldName:name.keyword,order:2,selectedOptions:!(test%20%7Bp0%3Dmixed_cluster%2F100_analytics_usage%2FBasic%20test%20for%20usage%20stats%20on%20analytics%20indices%7D),title:'Test',type:optionsListControl)))))
Failure Message:
Issue Reasons:
Note: This issue was created using new test triage automation. Please report issues or feedback to es-delivery.