Closed davidkyle closed 3 years ago
Pinging @elastic/es-distributed (Team:Distributed)
Yes there's actually quite a number of different failures here. Going through them one-by-one:
https://gradle-enterprise.elastic.co/s/gervfulv6lrku
1> [2021-04-26T04:17:14,386][INFO ][o.e.r.b.t.RepositoryAnalyzeAction] [testRepositoryAnalysis] running analysis of repository [test-repo] using path [temp-analysis-9_zJzWeWQ4a320bWiQRD8w]
1> [2021-04-26T04:17:19,399][WARN ][o.e.r.b.t.RepositoryAnalyzeAction] [node_s4] analysis of repository [test-repo] failed before cleanup phase, attempting best-effort cleanup but you may need to manually remove [temp-analysis-9_zJzWeWQ4a320bWiQRD8w]
1> [2021-04-26T04:17:19,405][INFO ][o.e.r.b.t.RepositoryAnalysisSuccessIT] [testRepositoryAnalysis] --> skipped repo consistency checks because [repository is not used for snapshots]
1> [2021-04-26T04:17:19,406][INFO ][o.e.r.b.t.RepositoryAnalysisSuccessIT] [testRepositoryAnalysis] [RepositoryAnalysisSuccessIT#testRepositoryAnalysis]: cleaning up after test
1> [2021-04-26T04:17:19,501][INFO ][o.e.c.m.MetadataIndexTemplateService] [node_s1] removing template [random_index_template]
1> [2021-04-26T04:17:19,584][INFO ][o.e.r.RepositoriesService] [node_s1] deleted repositories [[test-repo]]
1> [2021-04-26T04:17:19,636][WARN ][o.e.t.TransportService ] [node_s4] Received response for a request that has timed out, sent [5.2s/5207ms] ago, timed out [200ms/200ms] ago, action [cluster:admin/repository/analyze/blob], node [{node_s0}{nC6fsqDaTWCCfWW-vwz03w}{j9tgqpJTSuqh8dCh7HqZcg}{127.0.0.1}{127.0.0.1:49320}{imr}{xpack.installed=true}], id [25]
1> [2021-04-26T04:17:19,640][INFO ][o.e.r.b.t.RepositoryAnalysisSuccessIT] [testRepositoryAnalysis] [RepositoryAnalysisSuccessIT#testRepositoryAnalysis]: cleaned up after test
1> [2021-04-26T04:17:19,641][INFO ][o.e.r.b.t.RepositoryAnalysisSuccessIT] [testRepositoryAnalysis] after test
2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:snapshot-repo-test-kit:internalClusterTest' --tests "org.elasticsearch.repositories.blobstore.testkit.RepositoryAnalysisSuccessIT.testRepositoryAnalysis" -Dtests.seed=4C5D2CBA58FFB44A -Dtests.locale=ms-MY -Dtests.timezone=MST -Druntime.java=8
2> RepositoryVerificationException[[test-repo] analysis failed, you may need to manually remove [temp-analysis-9_zJzWeWQ4a320bWiQRD8w]]; nested: ReceiveTimeoutTransportException[[node_s0][127.0.0.1:49320][cluster:admin/repository/analyze/blob] request_id [25] timed out after [5007ms]];
at __randomizedtesting.SeedInfo.seed([4C5D2CBA58FFB44A:B5608DD4866487A3]:0)
Caused by:
ReceiveTimeoutTransportException[[node_s0][127.0.0.1:49320][cluster:admin/repository/analyze/blob] request_id [25] timed out after [5007ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1185)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm surprised that we exceeded the 5-second timeout here but maybe it's possible if this CI worker was running quite slowly. It completed shortly after the timeout, but I don't see an obvious deadlock that could otherwise explain this, so I opened #72314 to just give this a more generous timeout.
https://gradle-enterprise.elastic.co/s/m2kldzfhmictm https://gradle-enterprise.elastic.co/s/o4qzst6pglg36
1> [2021-04-26T00:13:31,036][INFO ][o.e.r.b.t.S3SnapshotRepoTestKitIT] [testRepositoryAnalysis] creating repository [repository] of type [s3]
1> [2021-04-26T00:14:48,556][INFO ][o.e.r.b.t.S3SnapshotRepoTestKitIT] [testRepositoryAnalysis] after test
2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:snapshot-repo-test-kit:qa:s3:integTest' --tests "org.elasticsearch.repositories.blobstore.testkit.S3SnapshotRepoTestKitIT.testRepositoryAnalysis" -Dtests.seed=2A6F99B8C383414F -Dtests.locale=da-DK -Dtests.timezone=America/Regina -Druntime.java=15
2> org.elasticsearch.client.ResponseException: method [PUT], host [http://127.0.0.1:43133], URI [_snapshot/repository?verify=true], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put_repository [repository]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put_repository [repository]) within 30s"},"status":503}
at __randomizedtesting.SeedInfo.seed([2A6F99B8C383414F:D35238D61D1872A6]:0)
These two were just processing cluster state updates excruciatingly slowly, even before the test started running.:
[2021-04-26T06:13:23,614][INFO ][o.e.n.Node ] [integTest-0] started
[2021-04-26T06:13:23,756][INFO ][o.e.c.s.MasterService ] [integTest-0] elected-as-master ([1] nodes joined)[{integTest-0}{cVjFb4J7QxCd567sAErqRQ}{LndVCvfnTaOZMeSZGmozzQ}{127.0.0.1}{127.0.0.1:37333}{cdfhilmrstw} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 1, version: 1, delta: master node
changed {previous [], current [{integTest-0}{cVjFb4J7QxCd567sAErqRQ}{LndVCvfnTaOZMeSZGmozzQ}{127.0.0.1}{127.0.0.1:37333}{cdfhilmrstw}]}
[2021-04-26T06:13:23,810][INFO ][o.e.c.c.CoordinationState] [integTest-0] cluster UUID set to [uETKElK5Q8Od6WM8cKDEdg]
[2021-04-26T06:13:26,258][INFO ][o.e.c.s.ClusterApplierService] [integTest-0] master node changed {previous [], current [{integTest-0}{cVjFb4J7QxCd567sAErqRQ}{LndVCvfnTaOZMeSZGmozzQ}{127.0.0.1}{127.0.0.1:37333}{cdfhilmrstw}]}, term: 1, version: 1, reason: Publication{term=1, version=1}
[2021-04-26T06:13:27,905][INFO ][o.e.g.GatewayService ] [integTest-0] recovered [0] indices into cluster_state
[2021-04-26T06:13:28,244][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [.ml-anomalies-] for index patterns [.ml-anomalies-*]
[2021-04-26T06:13:31,079][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [.ml-notifications-000001] for index patterns [.ml-notifications-000001]
[2021-04-26T06:13:34,819][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [.ml-state] for index patterns [.ml-state*]
[2021-04-26T06:13:38,311][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [.ml-stats] for index patterns [.ml-stats-*]
[2021-04-26T06:13:41,703][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [logs-settings]
[2021-04-26T06:13:45,009][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [synthetics-mappings]
[2021-04-26T06:13:48,208][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [logs-mappings]
[2021-04-26T06:13:49,871][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [metrics-mappings]
[2021-04-26T06:13:51,669][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [metrics-settings]
[2021-04-26T06:13:55,051][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding component template [synthetics-settings]
[2021-04-26T06:13:57,883][ERROR][o.e.x.m.e.l.LocalExporter] [integTest-0] failed to set monitoring template [.monitoring-alerts-7]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index-template [.monitoring-alerts-7], cause [api]) within 30s
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:132) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:131) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
...
We require the repo registration to complete in <30s but it didn't because of other cluster state updates still not having completed. Not sure what to do about this, there's definitely something wrong with how we're running these clusters if they take ≥3s for each cluster state update. TBD.
https://gradle-enterprise.elastic.co/s/u56mg6brfudq6
This one fails because the MD5 checksum we compute on upload doesn't match the checksum the server computes. I opened https://github.com/elastic/elasticsearch/issues/72358 to track this separately.
I opened https://github.com/elastic/elasticsearch/issues/72358 specifically for the checksum failures because they're definitely unrelated to the other failures.
I opened https://github.com/elastic/elasticsearch/issues/72404 to discuss the cases where cluster state updates were just desperately slow, as I think this needs help from the delivery folks.
Since these failures are all either addressed or tracked elsewhere, I am closing this.
Build scan: https://gradle-enterprise.elastic.co/s/gervfulv6lrku https://gradle-enterprise.elastic.co/s/o4qzst6pglg36 https://gradle-enterprise.elastic.co/s/u56mg6brfudq6 https://gradle-enterprise.elastic.co/s/m2kldzfhmictm
Repro line:
Reproduces locally?: NO
Applicable branches: master, 7.x, 7.13
Failure history: Regularly although I cannot say the cause is the same for all of these https://build-stats.elastic.co/goto/5c313f0165b5e7db32c26bec167ef280
Failure excerpt: The actual tests and errors are vary:
https://gradle-enterprise.elastic.co/s/gervfulv6lrku/tests/:x-pack:plugin:snapshot-repo-test-kit:internalClusterTest/org.elasticsearch.repositories.blobstore.testkit.RepositoryAnalysisSuccessIT/testRepositoryAnalysis?expanded-stacktrace=WyIwLTEiXQ
https://gradle-enterprise.elastic.co/s/o4qzst6pglg36/tests/:x-pack:plugin:snapshot-repo-test-kit:qa:s3:integTest/org.elasticsearch.repositories.blobstore.testkit.S3SnapshotRepoTestKitIT/testRepositoryAnalysis?top-execution=1
https://gradle-enterprise.elastic.co/s/u56mg6brfudq6/tests/:x-pack:plugin:snapshot-repo-test-kit:qa:s3:integTest/org.elasticsearch.repositories.blobstore.testkit.S3SnapshotRepoTestKitIT/testRepositoryAnalysis?top-execution=1