elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.94k stars 24.74k forks source link

[CI] Packaging tests fail with connection failure on opensuse and sles #30295

Closed DaveCTurner closed 5 years ago

DaveCTurner commented 6 years ago

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2645/console

REPRODUCE WITH line says:

./gradlew :x-pack:qa:vagrant:vagrantOpensuse42#batsPackagingTest -Dtests.seed=7E7AE0563BF3B4A3

The main problem seems to be

07:47:13 # Connection failure to: http://127.0.0.1:9200/_xpack/security/user/kibana/_password?pretty failed: Read timed out

Unfortunately I can't find the server logs so can't see any more detail than this. Here's the relevant bit of the CI log:

07:46:01 ok 15 [RPM BOOTSTRAP PASSWORD] stop Elasticsearch
07:47:12 not ok 16 [TAR SETUP PASSWORD] test auto generated passwords
07:47:12 not ok 16 [TAR SETUP PASSWORD] test auto generated passwords
07:47:12 # (in test file /project/build/packaging/bats/tests/30_tar_setup_passwords.bats, line 60)
07:47:12 # (in test file /project/build/packaging/bats/tests/30_tar_setup_passwords.bats, line 60)
07:47:12 #   `false' failed
07:47:12 #   `false' failed
07:47:12 # Unpacking tarball to /tmp/elasticsearch
07:47:12 # checking for bin file elasticsearch-certgen
07:47:12 # checking for bin file elasticsearch-certgen.bat
07:47:12 # checking for bin file elasticsearch-certutil
07:47:12 # checking for bin file elasticsearch-certutil.bat
07:47:12 # checking for bin file elasticsearch-croneval
07:47:12 # checking for bin file elasticsearch-croneval.bat
07:47:12 # checking for bin file elasticsearch-migrate
07:47:12 # checking for bin file elasticsearch-migrate.bat
07:47:12 # checking for bin file elasticsearch-saml-metadata
07:47:12 # checking for bin file elasticsearch-saml-metadata.bat
07:47:12 # checking for bin file elasticsearch-setup-passwords
07:47:12 # checking for bin file elasticsearch-setup-passwords.bat
07:47:12 # checking for bin file elasticsearch-sql-cli
07:47:12 # checking for bin file elasticsearch-sql-cli.bat
07:47:13 # checking for bin file elasticsearch-sql-cli-7.0.0-alpha1-SNAPSHOT.jar
07:47:13 # checking for bin file elasticsearch-syskeygen
07:47:13 # checking for bin file elasticsearch-syskeygen.bat
07:47:13 # checking for bin file elasticsearch-users
07:47:13 # checking for bin file elasticsearch-users.bat
07:47:13 # checking for bin file x-pack-env
07:47:13 # checking for bin file x-pack-env.bat
07:47:13 # checking for bin file x-pack-security-env
07:47:13 # checking for bin file x-pack-security-env.bat
07:47:13 # checking for bin file x-pack-watcher-env
07:47:13 # checking for bin file x-pack-watcher-env.bat
07:47:13 # elasticsearch
07:47:13 # elasticsearch.bat
07:47:13 # elasticsearch-certgen
07:47:13 # elasticsearch-certgen.bat
07:47:13 # elasticsearch-certutil
07:47:13 # elasticsearch-certutil.bat
07:47:13 # elasticsearch-croneval
07:47:13 # elasticsearch-croneval.bat
07:47:13 # elasticsearch-env
07:47:13 # elasticsearch-env.bat
07:47:13 # elasticsearch-keystore
07:47:13 # elasticsearch-keystore.bat
07:47:13 # elasticsearch-migrate
07:47:13 # elasticsearch-migrate.bat
07:47:13 # elasticsearch-plugin
07:47:13 # elasticsearch-plugin.bat
07:47:13 # elasticsearch-saml-metadata
07:47:13 # elasticsearch-saml-metadata.bat
07:47:13 # elasticsearch-service.bat
07:47:13 # elasticsearch-service-mgr.exe
07:47:13 # elasticsearch-service-x64.exe
07:47:13 # elasticsearch-setup-passwords
07:47:13 # elasticsearch-setup-passwords.bat
07:47:13 # elasticsearch-sql-cli
07:47:13 # elasticsearch-sql-cli-7.0.0-alpha1-SNAPSHOT.jar
07:47:13 # elasticsearch-sql-cli.bat
07:47:13 # elasticsearch-syskeygen
07:47:13 # elasticsearch-syskeygen.bat
07:47:13 # Unpacking tarball to /tmp/elasticsearch
07:47:13 # checking for bin file elasticsearch-certgen
07:47:13 # checking for bin file elasticsearch-certgen.bat
07:47:13 # checking for bin file elasticsearch-certutil
07:47:13 # checking for bin file elasticsearch-certutil.bat
07:47:13 # checking for bin file elasticsearch-croneval
07:47:13 # checking for bin file elasticsearch-croneval.bat
07:47:13 # checking for bin file elasticsearch-migrate
07:47:13 # checking for bin file elasticsearch-migrate.bat
07:47:13 # checking for bin file elasticsearch-saml-metadata
07:47:13 # checking for bin file elasticsearch-saml-metadata.bat
07:47:13 # checking for bin file elasticsearch-setup-passwords
07:47:13 # checking for bin file elasticsearch-setup-passwords.bat
07:47:13 # checking for bin file elasticsearch-sql-cli
07:47:13 # checking for bin file elasticsearch-sql-cli.bat
07:47:13 # checking for bin file elasticsearch-sql-cli-7.0.0-alpha1-SNAPSHOT.jar
07:47:13 # checking for bin file elasticsearch-syskeygen
07:47:13 # checking for bin file elasticsearch-syskeygen.bat
07:47:13 # checking for bin file elasticsearch-users
07:47:13 # checking for bin file elasticsearch-users.bat
07:47:13 # checking for bin file x-pack-env
07:47:13 # checking for bin file x-pack-env.bat
07:47:13 # checking for bin file x-pack-security-env
07:47:13 # checking for bin file x-pack-security-env.bat
07:47:13 # checking for bin file x-pack-watcher-env
07:47:13 # checking for bin file x-pack-watcher-env.bat
07:47:13 # elasticsearch
07:47:13 # elasticsearch.bat
07:47:13 # elasticsearch-certgen
07:47:13 # elasticsearch-certgen.bat
07:47:13 # elasticsearch-certutil
07:47:13 # elasticsearch-certutil.bat
07:47:13 # elasticsearch-croneval
07:47:13 # elasticsearch-croneval.bat
07:47:13 # elasticsearch-env
07:47:13 # elasticsearch-env.bat
07:47:13 # elasticsearch-keystore
07:47:13 # elasticsearch-keystore.bat
07:47:13 # elasticsearch-migrate
07:47:13 # elasticsearch-migrate.bat
07:47:13 # elasticsearch-plugin
07:47:13 # elasticsearch-plugin.bat
07:47:13 # elasticsearch-saml-metadata
07:47:13 # elasticsearch-saml-metadata.bat
07:47:13 # elasticsearch-service.bat
07:47:13 # elasticsearch-service-mgr.exe
07:47:13 # elasticsearch-service-x64.exe
07:47:13 # elasticsearch-setup-passwords
07:47:13 # elasticsearch-setup-passwords.bat
07:47:13 # elasticsearch-sql-cli
07:47:13 # elasticsearch-sql-cli-7.0.0-alpha1-SNAPSHOT.jar
07:47:13 # elasticsearch-sql-cli.bat
07:47:13 # elasticsearch-syskeygen
07:47:13 # elasticsearch-syskeygen.bat
07:47:13 # el# el
07:47:13 asticsearch-translog
07:47:13 # elasticsearch-translog.bat
07:47:13 # elasticsearch-users
07:47:13 # elasticsearch-users.bat
07:47:13 # x-pack-env
07:47:13 # x-pack-env.bat
07:47:13 # x-pack-security-env
07:47:13 # x-pack-security-env.bat
07:47:13 # x-pack-watcher-env
07:47:13 # x-pack-watcher-env.bat
07:47:13 # Expected x-pack elasticsearch-setup-passwords tool exit code to be zero
07:47:13 # Initiating the setup of passwords for reserved users elastic,kibana,logstash_system,beats_system.
07:47:13 # The passwords will be randomly generated and printed to the console.
07:47:13 # Please confirm that you would like to continue [y/N]
07:47:13 # 
07:47:13 # 
07:47:13 # Connection failure to: http://127.0.0.1:9200/_xpack/security/user/kibana/_password?pretty failed: Read timed out
07:47:13 # 
07:47:13 # ERROR: Failed to set password for user [kibana].
07:47:13 asticsearch-translog
07:47:13 # elasticsearch-translog.bat
07:47:13 # elasticsearch-users
07:47:13 # elasticsearch-users.bat
07:47:13 # x-pack-env
07:47:13 # x-pack-env.bat
07:47:13 # x-pack-security-env
07:47:13 # x-pack-security-env.bat
07:47:13 # x-pack-watcher-env
07:47:13 # x-pack-watcher-env.bat
07:47:13 # Expected x-pack elasticsearch-setup-passwords tool exit code to be zero
07:47:13 # Initiating the setup of passwords for reserved users elastic,kibana,logstash_system,beats_system.
07:47:13 # The passwords will be randomly generated and printed to the console.
07:47:13 # Please confirm that you would like to continue [y/N]
07:47:13 # 
07:47:13 # 
07:47:13 # Connection failure to: http://127.0.0.1:9200/_xpack/security/user/kibana/_password?pretty failed: Read timed out
07:47:13 # 
07:47:13 # ERROR: Failed to set password for user [kibana].
elasticmachine commented 6 years ago

Pinging @elastic/es-core-infra

DaveCTurner commented 6 years ago

@andyb-elastic saw a similar failure some time back and this failure included the Elasticsearch logs which said this:

# [2018-02-27T10:45:38,258][INFO ][o.e.n.Node               ] initialized
# [2018-02-27T10:45:38,258][INFO ][o.e.n.Node               ] [OpH_qGq] starting ...
# [2018-02-27T10:45:38,409][INFO ][o.e.t.TransportService   ] [OpH_qGq] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}
# [2018-02-27T10:45:38,422][WARN ][o.e.b.BootstrapChecks    ] [OpH_qGq] max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
# [2018-02-27T10:45:41,463][INFO ][o.e.c.s.MasterService    ] [OpH_qGq] zen-disco-elected-as-master ([0] nodes joined), reason: master node changed {previous [], current [{OpH_qGq}{OpH_qGqbSjOWzlsc2tPxfg}{3UZwOqnAT5KXgUcXlqfCxA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8# [2018-02-27T10:45:41,463][INFO ][o.e.c.s.MasterService    ] [OpH_qGq] zen-disco-elected-as-master ([0] nodes joined), reason: master node changed {previous [], current [{OpH_qGq}{OpH_qGqbSjOWzlsc2tPxfg}{3UZwOqnAT5KXgUcXlqfCxA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8
370659328, ml.max_open_jobs=20, ml.enabled=true}]}
# [2018-02-27T10:45:41,467][INFO ][o.e.c.s.ClusterApplierService] [OpH_qGq] master node changed {previous [], current [{OpH_qGq}{OpH_qGqbSjOWzlsc2tPxfg}{3UZwOqnAT5KXgUcXlqfCxA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8370659328, ml.max_open_jobs=20, ml.enabled=true}]}, reason: apply cluster state (from master [master {OpH_qGq}{OpH_qGqbSjOWzlsc2tPxfg}{3UZwOqnAT5KXgUcXlqfCxA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8370659328, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
# [2018-02-27T10:45:41,477][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [OpH_qGq] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
# [2018-02-27T10:45:41,478][INFO ][o.e.n.Node               ] [OpH_qGq] started
# [2018-02-27T10:45:41,652][INFO ][o.e.g.GatewayService     ] [OpH_qGq] recovered [0] indices into cluster_state
# [2018-02-27T10:46:11,661][DEBUG][o.e.a.a.i.t.p.TransportPutIndexTemplateAction] [OpH_qGq] failed to put template [.monitoring-beats]
# org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index-template [.monitoring-beats], cause [api]) within 30s
#   at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:125) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
#   at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_144]
#   at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:124) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
#   at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
#   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
370659328, ml.max_open_jobs=20, ml.enabled=true}]}
# [2018-02-27T10:45:41,467][INFO ][o.e.c.s.ClusterApplierService] [OpH_qGq] master node changed {previous [], current [{OpH_qGq}{OpH_qGqbSjOWzlsc2tPxfg}{3UZwOqnAT5KXgUcXlqfCxA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8370659328, ml.max_open_jobs=20, ml.enabled=true}]}, reason: apply cluster state (from master [master {OpH_qGq}{OpH_qGqbSjOWzlsc2tPxfg}{3UZwOqnAT5KXgUcXlqfCxA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8370659328, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
# [2018-02-27T10:45:41,477][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [OpH_qGq] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
# [2018-02-27T10:45:41,478][INFO ][o.e.n.Node               ] [OpH_qGq] started
# [2018-02-27T10:45:41,652][INFO ][o.e.g.GatewayService     ] [OpH_qGq] recovered [0] indices into cluster_state
# [2018-02-27T10:46:11,661][DEBUG][o.e.a.a.i.t.p.TransportPutIndexTemplateAction] [OpH_qGq] failed to put template [.monitoring-beats]
# org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index-template [.monitoring-beats], cause [api]) within 30s
#   at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:125) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
#   at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_144]
#   at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:124) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
#   at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
#   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
#   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1#    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1
.8.0_144]
#   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
...

Not sure why the Elasticsearch log wasn't included in the build failure above but it'd be mighty useful if it could appear in future.

If it is the same situation then this is failing to put a bunch of index templates within the 30 seconds timeout. It's not a network issue, since this is a one-node cluster. It'd be useful to see DEBUG logging from:

org.elasticsearch.action.support.master.TransportMasterNodeAction
org.elasticsearch.action.admin.indices.template.put.TransportPutIndexTemplateAction

(edited - I originally mentioned mappings, but this is not that)

andyb-elastic commented 6 years ago

Another similar one

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2658/console build-2658-opensuse.txt

andyb-elastic commented 6 years ago

And another

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2660/console build-2660-opensuse.txt

@DaveCTurner do you have a recommendation for any logging I can add to get more information about why the node is failing to process these cluster state events?

DaveCTurner commented 6 years ago

The logs seem somewhat garbled - messages seem to be duplicated in a funny order. Nonetheless, this bit seems unusually slow:

# [2018-05-06T06:46:39,483][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.monitoring-es] for index patterns [.monitoring-es-6-*]
# [2018-05-06T06:46:43,843][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.monitoring-beats] for index patterns [.monitoring-beats-6-*]
# [2018-05-06T06:46:48,220][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.monitoring-alerts] for index patterns [.monitoring-alerts-6]
# [2018-05-06T06:46:52,596][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-6-*]
open_jobs=20, ml.enabled=true}]}, reason: apply cluster state (from master [master {I3r5omj}{I3r5omjKR5evfgfM8ftPbA}{v_6FY1TySDyGEKnupaxjCg}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8370659328, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)[, ]]])
# [2018-05-06T06:46:18,995][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [I3r5omj] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
# [2018-05-06T06:46:18,995][INFO ][o.e.n.Node               ] [I3r5omj] started
# [2018-05-06T06:46:23,813][INFO ][o.e.g.GatewayService     ] [I3r5omj] recovered [0] indices into cluster_state
# [2018-05-06T06:46:23,931][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.watch-history-7] for index patterns [.watcher-history-7*]
# [2018-05-06T06:46:28,449][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.watches] for index patterns [.watches*]
# [2018-05-06T06:46:32,919][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.triggered_watches] for index patterns [.triggered_watches*]
# [2018-05-06T06:46:35,044][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.monitoring-logstash] for index patterns [.monitoring-logstash-6-*]
# [2018-05-06T06:46:39,483][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.monitoring-es] for index patterns [.monitoring-es-6-*]
# [2018-05-06T06:46:43,843][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.monitoring-beats] for index patterns [.monitoring-beats-6-*]
# [2018-05-06T06:46:48,220][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.monitoring-alerts] for index patterns [.monitoring-alerts-6]
# [2018-05-06T06:46:52,596][INFO ][o.e.c.m.MetaDataIndexTemplateService] [I3r5omj] adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-6-*]
# [2018-05-06T06:46:53,834][ERROR][o.e.x.m.e.l.LocalExporter] failed to set monitoring pipeline [xpack_# [2018-05-06T06:46:53,834][ERROR][o.e.x.m.e.l.LocalExporter] failed to set monitoring pipeline [xpack_
monitoring_6]
# org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-pipeline-xpack_monitoring_6) within 30s

On my laptop it takes <1s to make all the templates:

[2018-05-07T18:25:36,684][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [0MEKNSQ] Failed to clear cache for realms [[]]
[2018-05-07T18:25:36,726][INFO ][o.e.g.GatewayService     ] [0MEKNSQ] recovered [0] indices into cluster_state
[2018-05-07T18:25:36,884][INFO ][o.e.c.m.MetaDataIndexTemplateService] [0MEKNSQ] adding template [.watch-history-7] for index patterns [.watcher-history-7*]
[2018-05-07T18:25:36,924][INFO ][o.e.c.m.MetaDataIndexTemplateService] [0MEKNSQ] adding template [.triggered_watches] for index patterns [.triggered_watches*]
[2018-05-07T18:25:36,955][INFO ][o.e.c.m.MetaDataIndexTemplateService] [0MEKNSQ] adding template [.watches] for index patterns [.watches*]
[2018-05-07T18:25:37,020][INFO ][o.e.c.m.MetaDataIndexTemplateService] [0MEKNSQ] adding template [.monitoring-logstash] for index patterns [.monitoring-logstash-6-*]
[2018-05-07T18:25:37,070][INFO ][o.e.c.m.MetaDataIndexTemplateService] [0MEKNSQ] adding template [.monitoring-es] for index patterns [.monitoring-es-6-*]
[2018-05-07T18:25:37,107][INFO ][o.e.c.m.MetaDataIndexTemplateService] [0MEKNSQ] adding template [.monitoring-alerts] for index patterns [.monitoring-alerts-6]
[2018-05-07T18:25:37,148][INFO ][o.e.c.m.MetaDataIndexTemplateService] [0MEKNSQ] adding template [.monitoring-beats] for index patterns [.monitoring-beats-6-*]
[2018-05-07T18:25:37,180][INFO ][o.e.c.m.MetaDataIndexTemplateService] [0MEKNSQ] adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-6-*]
[2018-05-07T18:25:37,319][INFO ][o.e.l.LicenseService     ] [0MEKNSQ] license [6aae53f8-5ecf-4fc8-b66a-50f5e31c3ee7] mode [basic] - valid

Perhaps this will shed a little more light into where the time is going:

logger.org.elasticsearch.indices: DEBUG
logger.org.elasticsearch.gateway: TRACE
droberts195 commented 6 years ago

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2717/console is another instance of this.

Again the outermost error is:

07:01:21 # Connection failure to: http://127.0.0.1:9200/_xpack/security/user/kibana/_password?pretty failed: Read timed out

and the error in the Elasticsearch log is:

07:01:21 # [2018-05-25T07:01:13,198][ERROR][o.e.x.m.e.l.LocalExporter] failed to set monitoring pipeline [xpack_monitoring_2]
07:01:21 # org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-pipeline-xpack_monitoring_2) within 30s
07:01:21 # org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-pipeline-xpack_monitoring_2) within 30s
07:01:21 #  at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:124) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
07:01:21 #  at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:124) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
07:01:21 #  at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_144]
07:01:21 #  at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_144]
07:01:21 #  at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:123) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
07:01:21 #  at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:123) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
07:01:21 #  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
07:01:21 #  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
07:01:21 #  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
07:01:21 #  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
07:01:21 #  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
07:01:21 #  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
07:01:21 #  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
07:01:21 #  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
DaveCTurner commented 6 years ago

Similar story: adding templates is slow:

# [2018-05-25T07:00:43,002][INFO ][o.e.g.GatewayService     ] [rWPAvXn] recovered [0] indices into cluster_state
# [2018-05-25T07:00:43,134][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.watches] for index patterns [.watches*]
# [2018-05-25T07:00:43,871][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.watch-history-8] for index patterns [.watcher-history-8*]
# [2018-05-25T07:00:43,134][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.watches] for index patterns [.watches*]
# [2018-05-25T07:00:43,871][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.watch-history-8] for index patterns [.watcher-history-8*]
# [2018-05-25T07:00:44,744][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.triggered_watches] for index patterns [.triggered_watches*]
# [2018-05-25T07:00:49,884][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-logstash] for index patterns [.monitoring-logstash-6-*]
# [2018-05-25T07:00:44,744][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.triggered_watches] for index patterns [.triggered_watches*]
# [2018-05-25T07:00:49,884][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-logstash] for index patterns [.monitoring-logstash-6-*]
# [2018-05-25T07:00:54,904][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-es] for index patterns [.monitoring-es-6-*]
# [2018-05-25T07:00:59,906][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-beats] for index patterns [.monitoring-beats-6-*]
# [2018-05-25T07:00:54,904][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-es] for index patterns [.monitoring-es-6-*]
# [2018-05-25T07:00:59,906][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-beats] for index patterns [.monitoring-beats-6-*]
# [2018-05-25T07:01:04,968][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-alerts] for index patterns [.monitoring-alerts-6]
# [2018-05-25T07:01:09,960][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-6-*]
# [2018-05-25T07:01:04,968][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-alerts] for index patterns [.monitoring-alerts-6]
# [2018-05-25T07:01:09,960][INFO ][o.e.c.m.MetaDataIndexTemplateService] [rWPAvXn] adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-6-*]
# [2018-05-25T07:01:13,010][ERROR][o.e.x.m.e.l.LocalExporter] failed to set monitoring pipeline [xpack_monitoring_6]

@andyb-elastic it doesn't look like we're getting extra detail yet. Could you check these settings are being applied in the right places?

logger.org.elasticsearch.indices: DEBUG
logger.org.elasticsearch.gateway: TRACE
andyb-elastic commented 6 years ago

They're not applied because I didn't follow up on this, my bad. I'll add them now

andyb-elastic commented 6 years ago

Setting that logging the right way ended up being a little involved, I'll push it next week in the interest of not breaking the build Friday afternoon

cbuescher commented 6 years ago

This looks very similar to me: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+packaging-tests/890/console

@andyb-elastic not sure which additional detail you are looking for, maybe this one shows more?

andyb-elastic commented 6 years ago

I turned on the logging David suggested when the tests run on opensuse https://github.com/elastic/elasticsearch/commit/ba8bb1d4a15fd5d26357ea0cc7c04b28645b50bf

@cbuescher it's likely the same, it's hard to tell since it looks like that one didn't dump the server logs. The details we're looking for is more information about why the server sometimes takes a long time to start on opensuse

ywelsch commented 6 years ago

@jasontedor can you assign someone to address these test failures?

jasontedor commented 6 years ago

I think this is caused by internal infrastructure issues. I will follow up internally.

andyb-elastic commented 6 years ago

Here's one with the logging turned up. I'm not sure it really revealed anything, or at least I don't see any additional detail on why the template creation is slow.

packaging-2774.txt

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2774/consoleText

andyb-elastic commented 6 years ago

I turned up org.elasticsearch.indices logging to TRACE and org.elasticsearch.cluster to DEBUG https://github.com/elastic/elasticsearch/commit/01d64b128b214c2ac5afc1b0b514dd0d45e60c1b

DaveCTurner commented 6 years ago

The log is still suggestive of extremely slow IO when writing updated cluster states to disk. More detail from logger.org.elasticsearch.cluster is a good move, thanks, and I also opened #31319 because we currently only log the start of the write, not the end.

# [2018-06-13T07:01:12,159][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] creating Index [[Uo4Ce-TTSfWnn5cLyKCReQ/ZwtuWREBSMeWp4v-LQ8hrQ]], shards [1]/[0] - reason [create index]
# [2018-06-13T07:01:12,262][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [Uo4Ce-TTSfWnn5cLyKCReQ] closing ... (reason [NO_LONGER_ASSIGNED])
# [2018-06-13T07:01:12,262][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [Uo4Ce-TTSfWnn5cLyKCReQ/ZwtuWREBSMeWp4v-LQ8hrQ] closing index service (reason [NO_LONGER_ASSIGNED][ created for parsing template mapping])
# [2018-06-13T07:01:12,263][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [Uo4Ce-TTSfWnn5cLyKCReQ/ZwtuWREBSMeWp4v-LQ8hrQ] closed... (reason [NO_LONGER_ASSIGNED][ created for parsing template mapping])
# [2018-06-13T07:01:12,263][INFO ][o.e.c.m.MetaDataIndexTemplateService] [eIhOL9F] adding template [.triggered_watches] for index patterns [.triggered_watches*]
# [2018-06-13T07:01:12,270][TRACE][o.e.g.MetaStateService   ] [eIhOL9F] [_global] writing state, reason [changed]
# [2018-06-13T07:01:16,680][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] creating Index [[ngefNOwdS4OHzxLC6TD_Gg/hwwgND-gSp6iMN5daYAsHg]], shards [1]/[0] - reason [create index]
# [2018-06-13T07:01:16,684][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [ngefNOwdS4OHzxLC6TD_Gg] closing ... (reason [N# [2018-06-13T07:01:16,684][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [ngefNOwdS4OHzxLC6TD_Gg] closing ... (reason [N
# [2018-06-13T07:01:16,684][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [ngefNOwdS4OHzxLC6TD_Gg/hwwgND-gSp6iMN5daYAsHg] closing index service (reason [NO_LONGER_ASSIGNED][ created for parsing template mapping])
# [2018-06-13T07:01:16,684][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [ngefNOwdS4OHzxLC6TD_Gg/hwwgND-gSp6iMN5daYAsHg] closed... (reason [NO_LONGER_ASSIGNED][ created for parsing template mapping])
# [2018-06-13T07:01:16,684][INFO ][o.e.c.m.MetaDataIndexTemplateService] [eIhOL9F] adding template [.watches] for index patterns [.watches*]
# [2018-06-13T07:01:16,685][TRACE][o.e.g.MetaStateService   ] [eIhOL9F] [_global] writing state, reason [changed]
# [2018-06-13T07:01:20,988][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] creating Index [[WAeaTyI4TnK3jFmYrKOvAg/En2gqm75SuW_w5EQxYfTog]], shards [1]/[0] - reason [create index]
# [2018-06-13T07:01:21,001][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [WAeaTyI4TnK3jFmYrKOvAg] closing ... (reason [NO_LONGER_ASSIGNED])
# [2018-06-13T07:01:21,001][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [WAeaTyI4TnK3jFmYrKOvAg/En2gqm75SuW_w5EQxYfTog] closing index service (reason [NO_LONGER_ASSIGNED][ created for parsing template mapping])
# [2018-06-13T07:01:21,001][INFO ][o.e.c.m.MetaDataIndexTemplateService] [eIhOL9F] adding template [.watch-history-8] for index patterns [.watcher-history-8*]
# [2018-06-13T07:01:21,002][TRACE][o.e.g.MetaStateService   ] [eIhOL9F] [_global] writing state, reason [changed]
# [2018-06-13T07:01:25,044][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] creating Index [[opPILoh1TcuBw9CXG-GNPg/ULbJ2LOERo-GoWAEkGbHJA]], shards [1]/[0] - reason [create index]
# [2018-06-13T07:01:25,050][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [opPILoh1TcuBw9CXG-GNPg] closing ... (reason [NO_LONGER_ASSIGNED])
# [2018-06-13T07:01:25,050][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [opPILoh1TcuBw9CXG-GNPg/ULbJ2LOERo-GoWAEkGbHJA] closing index service (reason [NO_LONGER_ASSIGNED][ created for parsing template mapping])
# [2018-06-13T07:01:25,050][DEBUG][o.e.i.IndicesService     ] [eIhOL9F] [opPILoh1TcuBw9CXG-GNPg/ULbJ2LOERo-GoWAEkGbHJA] closed... (reason [NO_LONGER_ASSIGNED][ created for parsing template mapping])
# [2018-06-13T07:01:25,050][INFO ][o.e.c.m.MetaDataIndexTemplateService] [eIhOL9F] adding template [.monitoring-logstash] for index patterns [.monitoring-logstash-6-*]
# [2018-06-13T07:01:25,056][TRACE][o.e.g.MetaStateService   ] [eIhOL9F] [_global] writing state, reason [changed]
nik9000 commented 6 years ago

Again! https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+packaging-tests/977/console:

12:59:50 > Task :x-pack:qa:vagrant:vagrantOpensuse42#batsPackagingTest
12:59:50 Task ':x-pack:qa:vagrant:vagrantOpensuse42#batsPackagingTest' is not up-to-date because:
12:59:50   Task has not declared any outputs despite executing actions.
12:59:50 Starting process 'command 'vagrant''. Working directory: /var/lib/jenkins/workspace/elastic+elasticsearch+6.x+packaging-tests/x-pack/qa/vagrant Command: vagrant ssh opensuse-42 --command cd $PACKAGING_ARCHIVES && sudo bats --tap $BATS_TESTS/*.bats
12:59:50 Successfully started process 'command 'vagrant''
12:59:50 1..41
12:59:51 ok 1 [X-PACK] install default distribution
12:59:53 ok 2 [X-PACK] verify x-pack installation
12:59:54 ok 3 [X-PACK] verify croneval works
13:00:01 ok 4 [TAR BOOTSTRAP PASSWORD] add bootstrap.password setting
13:00:11 ok 5 [TAR BOOTSTRAP PASSWORD] test bootstrap.password is in setting list
13:00:13 ok 6 [TAR BOOTSTRAP PASSWORD] test auto generated passwords with modified bootstrap.password
13:00:14 ok 7 [TAR BOOTSTRAP PASSWORD] test elasticsearch-sql-cli
13:00:15 ok 8 [TAR BOOTSTRAP PASSWORD] test elasticsearch-sql-cli when user refuses password
13:00:15 ok 9 [TAR BOOTSTRAP PASSWORD] stop Elasticsearch
13:00:23 ok 10 [RPM BOOTSTRAP PASSWORD] add bootstrap.password setting
13:00:43 ok 11 [RPM BOOTSTRAP PASSWORD] test bootstrap.password is in setting list
13:00:46 ok 12 [RPM BOOTSTRAP PASSWORD] test auto generated passwords with modified bootstrap.password
13:00:47 ok 13 [RPM BOOTSTRAP PASSWORD] test elasticsearch-sql-cli
13:00:48 ok 14 [RPM BOOTSTRAP PASSWORD] test elasticsearch-sql-cli when user refuses password
13:00:49 ok 15 [RPM BOOTSTRAP PASSWORD] stop Elasticsearch
13:02:13 not ok 16 [TAR SETUP PASSWORD] test auto generated passwords
13:02:13 not ok 16 [TAR SETUP PASSWORD] test auto generated passwords
13:02:13 # (in test file /project/build/packaging/bats/tests/30_tar_setup_passwords.bats, line 60)
13:02:13 # (in test file /project/build/packaging/bats/tests/30_tar_setup_passwords.bats, line 60)
13:02:13 #   `false' failed
13:02:13 #   `false' failed
13:02:13 # Unpacking tarball to /tmp/elasticsearch
13:02:13 # checking for bin file elasticsearch-certgen
13:02:13 # checking for bin file elasticsearch-certgen.bat
13:02:13 # checking for bin file elasticsearch-certutil
13:02:13 # checking for bin file elasticsearch-certutil.bat
13:02:13 # checking for bin file elasticsearch-croneval
13:02:13 # checking for bin file elasticsearch-croneval.bat
13:02:13 # checking for bin file elasticsearch-migrate
13:02:13 # checking for bin file elasticsearch-migrate.bat
13:02:13 # checking for bin file elasticsearch-saml-metadata
13:02:13 # checking for bin file elasticsearch-saml-metadata.bat
13:02:13 # checking for bin file elasticsearch-setup-passwords
13:02:13 # checking for bin file elasticsearch-setup-passwords.bat
13:02:13 # checking for bin file elasticsearch-sql-cli
13:02:13 # checking for bin file elasticsearch-sql-cli.bat
13:02:13 # checking for bin file elasticsearch-sql-cli-6.4.0-SNAPSHOT.jar
13:02:13 # checking for bin file elasticsearch-syskeygen
13:02:13 # checking for bin file elasticsearch-syskeygen.bat
13:02:13 # checking for bin file elasticsearch-users
13:02:13 # checking for bin file elasticsearch-users.bat
13:02:13 # checking for bin file x-pack-env
13:02:13 # checking for bin file x-pack-env.bat
13:02:13 # checking for bin file x-pack-security-env
13:02:13 # checking for bin file x-pack-security-env.bat
13:02:13 # checking for bin file x-pack-watcher-env
13:02:13 # checking for bin file x-pack-watcher-env.bat
13:02:14 # elasticsearch
13:02:14 # elasticsearch.bat
13:02:14 # elasticsearch-certgen
13:02:14 # elasticsearch-certgen.bat
13:02:14 # elasticsearch-certutil
13:02:14 # elasticsearch-certutil.bat
13:02:14 # elasticsearch-cli
13:02:14 # elasticsearch-cli.bat
13:02:14 # elasticsearch-croneval
13:02:14 # elasticsearch-croneval.bat
13:02:14 # elasticsearch-env
13:02:14 # elasticsearch-env.bat
13:02:14 # elasticsearch-keystore
13:02:14 # elasticsearch-keystore.bat
13:02:14 # elasticsearch-migrate
13:02:14 # elasticsearch-migrate.bat
13:02:14 # elasticsearch-plugin
13:02:14 # elasticsearch-plugin.bat
13:02:14 # elasticsearch-saml-metadata
13:02:14 # elasticsearch-saml-metadata.bat
13:02:14 # elasticsearch-service.bat
13:02:14 # elasticsearch-service-mgr.exe
13:02:14 # elasticsearch-service-x64.exe
13:02:14 # elasticsearch-setup-passwords
13:02:14 # elasticsearch-setup-passwords.bat
13:02:14 # elasticsearch-sql-cli
13:02:14 # elasticsearch-sql-cli-6.4.0-SNAPSHOT.jar
13:02:14 # elasticsearch-sql-cli.bat
13:02:14 # elasticsearch-syskeygen
13:02:14 # Unpacking tarball to /tmp/elasticsearch
13:02:14 # checking for bin file elasticsearch-certgen
13:02:14 # checking for bin file elasticsearch-certgen.bat
13:02:14 # checking for bin file elasticsearch-certutil
13:02:14 # checking for bin file elasticsearch-certutil.bat
13:02:14 # checking for bin file elasticsearch-croneval
13:02:14 # checking for bin file elasticsearch-croneval.bat
13:02:14 # checking for bin file elasticsearch-migrate
13:02:14 # checking for bin file elasticsearch-migrate.bat
13:02:14 # checking for bin file elasticsearch-saml-metadata
13:02:14 # checking for bin file elasticsearch-saml-metadata.bat
13:02:14 # checking for bin file elasticsearch-setup-passwords
13:02:14 # checking for bin file elasticsearch-setup-passwords.bat
13:02:14 # checking for bin file elasticsearch-sql-cli
13:02:14 # checking for bin file elasticsearch-sql-cli.bat
13:02:14 # checking for bin file elasticsearch-sql-cli-6.4.0-SNAPSHOT.jar
13:02:14 # checking for bin file elasticsearch-syskeygen
13:02:14 # checking for bin file elasticsearch-syskeygen.bat
13:02:14 # checking for bin file elasticsearch-users
13:02:14 # checking for bin file elasticsearch-users.bat
13:02:14 # checking for bin file x-pack-env
13:02:14 # checking for bin file x-pack-env.bat
13:02:14 # checking for bin file x-pack-security-env
13:02:14 # checking for bin file x-pack-security-env.bat
13:02:14 # checking for bin file x-pack-watcher-env
13:02:14 # checking for bin file x-pack-watcher-env.bat
13:02:14 # elasticsearch
13:02:14 # elasticsearch.bat
13:02:14 # elasticsearch-certgen
13:02:14 # elasticsearch-certgen.bat
13:02:14 # elasticsearch-certutil
13:02:14 # elasticsearch-certutil.bat
13:02:14 # elasticsearch-cli
13:02:14 # elasticsearch-cli.bat
13:02:14 # elasticsearch-croneval
13:02:14 # elasticsearch-croneval.bat
13:02:14 # elasticsearch-env
13:02:14 # elasticsearch-env.bat
13:02:14 # elasticsearch-keystore
13:02:14 # elasticsearch-keystore.bat
13:02:14 # elasticsearch-migrate
13:02:14 # elasticsearch-migrate.bat
13:02:14 # elasticsearch-plugin
13:02:14 # elasticsearch-plugin.bat
13:02:14 # elasticsearch-saml-metadata
13:02:14 # elasticsearch-saml-metadata.bat
13:02:14 # elasticsearch-service.bat
13:02:14 # elasticsearch-service-mgr.exe
13:02:14 # elasticsearch-service-x64.exe
13:02:14 # elasticsearch-setup-passwords
13:02:14 # elasticsearch-setup-passwords.bat
13:02:14 # elasticsearch-sql-cli
13:02:14 # elasticsearch-sql-cli-6.4.0-SNAPSHOT.jar
13:02:14 # elasticsearch-sql-cli.bat
13:02:14 # elasticsearch-syskeygen
13:02:14 # el# el
13:02:14 asticsearch-syskeygen.bat
13:02:14 # elasticsearch-translog
13:02:14 # elasticsearch-translog.bat
13:02:14 # elasticsearch-users
13:02:14 # elasticsearch-users.bat
13:02:14 # x-pack
13:02:14 # x-pack-env
13:02:14 # x-pack-env.bat
13:02:14 # x-pack-security-env
13:02:14 # x-pack-security-env.bat
13:02:14 # x-pack-watcher-env
13:02:14 # x-pack-watcher-env.bat
13:02:14 # Expected x-pack elasticsearch-setup-passwords tool exit code to be zero
13:02:14 # Initiating the setup of passwords for reserved users elastic,kibana,logstash_system,beats_system.
13:02:14 # The passwords will be randomly generated and printed to the console.
13:02:14 # Please confirm that you would like to continue [y/N]
13:02:14 # 
13:02:14 # 
13:02:14 # Connection failure to: http://127.0.0.1:9200/_xpack/security/user/kibana/_password?pretty failed: Read timed out
13:02:14 # 
13:02:14 # ERROR: Failed to set password for user [kibana].
13:02:14 asticsearch-syskeygen.bat
jtibshirani commented 6 years ago

Another instance, this time in 25_package_bootstrap_password.bats: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2823/console

Full log: bats-packaging-log.txt

jtibshirani commented 6 years ago

@andyb-elastic do you think we should disable these tests while we investigate? I'm not familiar with how to best isolate + mute the bats tests.

andyb-elastic commented 6 years ago

cc @polyfractal since he commented on another one of these issues

We're unable to reproduce this outside of CI so we're collecting logging in the CI runs to debug it. I can disable these tests on opensuse, but I'm not sure how else to get insight into what's happening.

I'll at least leave them enabled on master and disable them on the other branches we test so there aren't so many failures - does that sound reasonable @rjernst @nik9000

Also just to make it easier for people on triage to find this issue here's another way that this can look in the logs

# /project/build/packaging/bats/utils/xpack.bash: connect: Cannot assign requested address
# /project/build/packaging/bats/utils/xpack.bash: line 100: /dev/tcp/localhost/9200: Cannot assign requested address
rjernst commented 6 years ago

Sounds reasonable to me.

droberts195 commented 6 years ago

This happened again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+packaging-tests/997/console

The full log attached in https://github.com/elastic/elasticsearch/issues/30295#issuecomment-401464157 contains the new trace logging from #31319 and seems to confirm @DaveCTurner's theory from https://github.com/elastic/elasticsearch/issues/30295#issuecomment-397203872. For example, storing cluster state version 2 took 6 seconds:

# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.MasterService    ] [fqeOyku] cluster state updated, version [2], source [# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.MasterService    ] [fqeOyku] cluster state updated, version [2], source [
local-gateway-elected-state]
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.MasterService    ] [fqeOyku] publishing cluster state version [2]
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.ClusterApplierService] [fqeOyku] processing [apply cluster state (from master [master {fqeOyku}{fqeOykunSeWpa5zpqiinkw}{Ospb1N7bSOmKqY9FlOJ6QA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8370659328, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [2] source [local-gateway-elected-state]])]: execute
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.ClusterApplierService] [fqeOyku] cluster state updated, version [2], source [apply cluster state (from master [master {fqeOyku}{fqeOykunSeWpa5zpqiinkw}{Ospb1N7bSOmKqY9FlOJ6QA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8370659328, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [2] source [local-gateway-elected-state]])]
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.ClusterApplierService] [fqeOyku] applying cluster state version 2
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.ClusterApplierService] [fqeOyku] apply cluster state with version 2
# [2018-06-29T15:31:24,287][TRACE][o.e.g.MetaStateService   ] [fqeOyku] [_global] writing state, reason [changed]
# [2018-06-29T15:31:28,799][TRACE][o.e.i.IndexingMemoryController] [fqeOyku] total indexing heap bytes used [0b] vs indices.memory.index_buffer_size [99mb], currently writing bytes [0b]
# [2018-06-29T15:31:30,456][TRACE][o.e.g.MetaDataStateFormat] written state to /var/lib/elasticsearch/nodes/0/_state/global-1.st
# [2018-06-29T15:31:30,457][TRACE][o.e.g.MetaDataStateFormat] cleaned up /var/lib/elasticsearch/nodes/0/_state/global-1.st.tmp
# [2018-06-29T15:31:30,457][TRACE][o.e.g.MetaDataStateFormat] cleanupOldFiles: cleaning up /var/lib/elasticsearch/nodes/0
# [2018-06-29T15:31:30,457][TRACE][o.e.g.MetaDataStateFormat] cleanupOldFiles: cleaned up /var/lib/elasticsearch/nodes/0/_state/global-0.st
# [2018-06-29T15:31:30,457][TRACE][o.e.g.MetaStateService   ] [fqeOyku] [_global] state written
local-gateway-elected-state]
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.MasterService    ] [fqeOyku] publishing cluster state version [2]
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.ClusterApplierService] [fqeOyku] processing [apply cluster state (from master [master {fqeOyku}{fqeOykunSeWpa5zpqiinkw}{Ospb1N7bSOmKqY9FlOJ6QA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8370659328, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [2] source [local-gateway-elected-state]])]: execute
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.ClusterApplierService] [fqeOyku] cluster state updated, version [2], source [apply cluster state (from master [master {fqeOyku}{fqeOykunSeWpa5zpqiinkw}{Ospb1N7bSOmKqY9FlOJ6QA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8370659328, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [2] source [local-gateway-elected-state]])]
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.ClusterApplierService] [fqeOyku] applying cluster state version 2
# [2018-06-29T15:31:24,286][DEBUG][o.e.c.s.ClusterApplierService] [fqeOyku] apply cluster state with version 2
# [2018-06-29T15:31:24,287][TRACE][o.e.g.MetaStateService   ] [fqeOyku] [_global] writing state, reason [changed]
# [2018-06-29T15:31:28,799][TRACE][o.e.i.IndexingMemoryController] [fqeOyku] total indexing heap bytes used [0b] vs indices.memory.index_buffer_size [99mb], currently writing bytes [0b]
# [2018-06-29T15:31:30,456][TRACE][o.e.g.MetaDataStateFormat] written state to /var/lib/elasticsearch/nodes/0/_state/global-1.st
# [2018-06-29T15:31:30,457][TRACE][o.e.g.MetaDataStateFormat] cleaned up /var/lib/elasticsearch/nodes/0/_state/global-1.st.tmp
# [2018-06-29T15:31:30,457][TRACE][o.e.g.MetaDataStateFormat] cleanupOldFiles: cleaning up /var/lib/elasticsearch/nodes/0
# [2018-06-29T15:31:30,457][TRACE][o.e.g.MetaDataStateFormat] cleanupOldFiles: cleaned up /var/lib/elasticsearch/nodes/0/_state/global-0.st
# [2018-06-29T15:31:30,457][TRACE][o.e.g.MetaStateService   ] [fqeOyku] [_global] state written
# [# [
2018-06-29T15:31:30,457][DEBUG][o.e.c.s.ClusterApplierService] [fqeOyku] set locally applied cluster state to version 2
droberts195 commented 6 years ago

I found https://www.electricmonk.nl/log/2016/03/14/terrible-virtualbox-disk-performance/, which recommends enabling the "Host I/O Cache" for better SATA disk performance in VirtualBox.

I wonder if this is also relevant: https://forums.virtualbox.org/viewtopic.php?f=1&t=61428

It claims that IDE controllers are much faster than SATA controllers in VirtualBox. However, as pointed out in the comments, the main reason could be that IDE controllers have Host I/O Cache enabled by default and SATA controllers don't (which tallies with what the first link says). The reason for that decision seems to be the risk of data loss in the VM if there's a power failure on the host, but we don't care about that in packaging tests: http://www.virtualbox.org/manual/ch05.html#iocaching

Interestingly the VM used for the opensuse-42 VM has a SATA controller with Host I/O Cache switched off:

    <StorageControllers>
      <StorageController name="IDE Controller" type="PIIX4" PortCount="2" useHostIOCache="true" Bootable="true"/>
      <StorageController name="SATA Controller" type="AHCI" PortCount="1" useHostIOCache="false" Bootable="true" IDE0MasterEmulationPort="0" IDE0SlaveEmulationPort="1" IDE1MasterEmulationPort="2" IDE1SlaveEmulationPort="3">
        <AttachedDevice type="HardDisk" hotpluggable="false" port="0" device="0">
          <Image uuid="{cb0669e5-d742-4549-a85b-1be6ab5ca55c}"/>
        </AttachedDevice>
      </StorageController>
    </StorageControllers>

In contrast the centos-6 VM which doesn't suffer from slow writes only has an IDE controller with Host I/O Cache enabled - it has no SATA controller:

    <StorageControllers>
      <StorageController name="IDE Controller" type="PIIX4" PortCount="2" useHostIOCache="true" Bootable="true">
        <AttachedDevice type="HardDisk" hotpluggable="false" port="0" device="0">
          <Image uuid="{bda286eb-5d96-4970-b89c-3b7dd570fde8}"/>
        </AttachedDevice>
      </StorageController>
    </StorageControllers>

This is all just an unsubstantiated theory at the moment. I'm not sure how to change these low level VirtualBox settings from within the Vagrant config file.

andyb-elastic commented 6 years ago

I disabled these tests on opensuse and sles on 6.x, 6.3, and 5.6 https://github.com/elastic/elasticsearch/pull/31749

Thanks for the info @droberts195 that looks promising, I'll try setting the controller in the Vagrantfile

andyb-elastic commented 6 years ago

It seems like configuring the disk to be attached to the ide controller would be more appropriate to do when the image is built, it looks like we do something similar for sles https://github.com/elastic/infra/blob/a8344443005bf72d8b674e12583e0d5c7a338d6a/packer/packer-sles-vagrantcloud.json#L137-L148

I ran into an issue with the packer build for the opensuse vbox image, so that may need to be resolved first. If it's tricky I think we can also set it by calling storageattach in the Vagrantfile https://github.com/elastic/infra/issues/5852

ywelsch commented 6 years ago

More of this: URL: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+packaging-tests/1022/console

martijnvg commented 6 years ago

A similar failure of this today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+packaging-tests/1024/console

03:45:36 Starting process 'command 'vagrant''. Working directory: /var/lib/jenkins/workspace/elastic+elasticsearch+6.x+packaging-tests/qa/vagrant Command: vagrant ssh opensuse-42 --command sudo bash "$PACKAGING_TESTS/run-tests.sh"
03:45:36 Successfully started process 'command 'vagrant''
03:45:36 JUnit version 4.12
03:49:43 .....E..E.E........................
03:49:43 Time: 247.113
03:49:43 There were 3 failures:
03:49:43 1) test50StartAndStop(org.elasticsearch.packaging.test.DefaultTarTests)
03:49:43 java.net.SocketTimeoutException: Read timed out
03:49:43    at java.net.SocketInputStream.socketRead0(Native Method)
03:49:43    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
03:49:43    at java.net.SocketInputStream.read(SocketInputStream.java:171)
03:49:43    at java.net.SocketInputStream.read(SocketInputStream.java:141)
03:49:43    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
03:49:43    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
03:49:43    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
03:49:43    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
03:49:43    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
03:49:43    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
03:49:43    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
03:49:43    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
03:49:43    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
03:49:43    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
03:49:43    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
03:49:43    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
03:49:43    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
03:49:43    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
03:49:43    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
03:49:43    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
03:49:43    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
03:49:43    at org.apache.http.client.fluent.Request.internalExecute(Request.java:173)
03:49:43    at org.apache.http.client.fluent.Request.execute(Request.java:177)
03:49:43    at org.elasticsearch.packaging.util.ServerUtils.makeRequest(ServerUtils.java:113)
03:49:43    at org.elasticsearch.packaging.util.ServerUtils.runElasticsearchTests(ServerUtils.java:98)
03:49:43    at org.elasticsearch.packaging.test.ArchiveTestCase.test50StartAndStop(ArchiveTestCase.java:196)
03:49:43    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
03:49:43    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
03:49:43    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
03:49:43    at java.lang.reflect.Method.invoke(Method.java:498)
03:49:43    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
03:49:43    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
03:49:43    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
03:49:43    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
03:49:43    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
03:49:43    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
03:49:43    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
03:49:43    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
03:49:43    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
03:49:43    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
03:49:43    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
03:49:43    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
03:49:43    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
03:49:43    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
03:49:43    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:128)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:27)
03:49:43    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
03:49:43    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
03:49:43    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
03:49:43    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
03:49:43    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
03:49:43    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:128)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:27)
03:49:43    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
03:49:43    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
03:49:43    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
03:49:43    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
03:49:43    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
03:49:43    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
03:49:43    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
03:49:43    at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
03:49:43    at org.junit.runner.JUnitCore.runMain(JUnitCore.java:77)
03:49:43    at org.junit.runner.JUnitCore.main(JUnitCore.java:36)
03:49:43    at org.elasticsearch.packaging.VMTestRunner.main(VMTestRunner.java:34)
03:49:43 2) test70CustomPathConfAndJvmOptions(org.elasticsearch.packaging.test.DefaultTarTests)
03:49:43 java.net.SocketTimeoutException: Read timed out
03:49:43    at java.net.SocketInputStream.socketRead0(Native Method)
03:49:43    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
03:49:43    at java.net.SocketInputStream.read(SocketInputStream.java:171)
03:49:43    at java.net.SocketInputStream.read(SocketInputStream.java:141)
03:49:43    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
03:49:43    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
03:49:43    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
03:49:43    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
03:49:43    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
03:49:43    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
03:49:43    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
03:49:43    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
03:49:43    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
03:49:43    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
03:49:43    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
03:49:43    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
03:49:43    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
03:49:43    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
03:49:43    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
03:49:43    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
03:49:43    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
03:49:43    at org.apache.http.client.fluent.Request.internalExecute(Request.java:173)
03:49:43    at org.apache.http.client.fluent.Request.execute(Request.java:177)
03:49:43    at org.elasticsearch.packaging.util.ServerUtils.makeRequest(ServerUtils.java:113)
03:49:43    at org.elasticsearch.packaging.util.ServerUtils.waitForElasticsearch(ServerUtils.java:93)
03:49:43    at org.elasticsearch.packaging.util.ServerUtils.waitForElasticsearch(ServerUtils.java:45)
03:49:43    at org.elasticsearch.packaging.util.Archives.runElasticsearch(Archives.java:302)
03:49:43    at org.elasticsearch.packaging.test.ArchiveTestCase.test70CustomPathConfAndJvmOptions(ArchiveTestCase.java:258)
03:49:43    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
03:49:43    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
03:49:43    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
03:49:43    at java.lang.reflect.Method.invoke(Method.java:498)
03:49:43    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
03:49:43    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
03:49:43    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
03:49:43    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
03:49:43    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
03:49:43    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
03:49:43    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
03:49:43    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
03:49:43    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
03:49:43    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
03:49:43    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
03:49:43    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
03:49:43    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
03:49:43    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
03:49:43    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:128)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:27)
03:49:43    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
03:49:43    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
03:49:43    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
03:49:43    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
03:49:43    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
03:49:43    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:128)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:27)
03:49:43    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
03:49:43    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
03:49:43    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
03:49:43    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
03:49:43    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
03:49:43    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
03:49:43    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
03:49:43    at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
03:49:43    at org.junit.runner.JUnitCore.runMain(JUnitCore.java:77)
03:49:43    at org.junit.runner.JUnitCore.main(JUnitCore.java:36)
03:49:43    at org.elasticsearch.packaging.VMTestRunner.main(VMTestRunner.java:34)
03:49:43 3) test80RelativePathConf(org.elasticsearch.packaging.test.DefaultTarTests)
03:49:43 java.net.SocketTimeoutException: Read timed out
03:49:43    at java.net.SocketInputStream.socketRead0(Native Method)
03:49:43    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
03:49:43    at java.net.SocketInputStream.read(SocketInputStream.java:171)
03:49:43    at java.net.SocketInputStream.read(SocketInputStream.java:141)
03:49:43    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
03:49:43    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
03:49:43    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
03:49:43    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
03:49:43    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
03:49:43    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
03:49:43    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
03:49:43    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
03:49:43    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
03:49:43    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
03:49:43    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
03:49:43    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
03:49:43    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
03:49:43    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
03:49:43    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
03:49:43    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
03:49:43    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
03:49:43    at org.apache.http.client.fluent.Request.internalExecute(Request.java:173)
03:49:43    at org.apache.http.client.fluent.Request.execute(Request.java:177)
03:49:43    at org.elasticsearch.packaging.util.ServerUtils.makeRequest(ServerUtils.java:113)
03:49:43    at org.elasticsearch.packaging.util.ServerUtils.waitForElasticsearch(ServerUtils.java:93)
03:49:43    at org.elasticsearch.packaging.util.ServerUtils.waitForElasticsearch(ServerUtils.java:45)
03:49:43    at org.elasticsearch.packaging.util.Archives.runElasticsearch(Archives.java:302)
03:49:43    at org.elasticsearch.packaging.test.ArchiveTestCase.test80RelativePathConf(ArchiveTestCase.java:303)
03:49:43    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
03:49:43    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
03:49:43    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
03:49:43    at java.lang.reflect.Method.invoke(Method.java:498)
03:49:43    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
03:49:43    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
03:49:43    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
03:49:43    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
03:49:43    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
03:49:43    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
03:49:43    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
03:49:43    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
03:49:43    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
03:49:43    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
03:49:43    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
03:49:43    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
03:49:43    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
03:49:43    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
03:49:43    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:128)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:27)
03:49:43    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
03:49:43    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
03:49:43    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
03:49:43    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
03:49:43    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
03:49:43    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:128)
03:49:43    at org.junit.runners.Suite.runChild(Suite.java:27)
03:49:43    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
03:49:43    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
03:49:43    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
03:49:43    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
03:49:43    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
03:49:43    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
03:49:43    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
03:49:43    at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
03:49:43    at org.junit.runner.JUnitCore.runMain(JUnitCore.java:77)
03:49:43    at org.junit.runner.JUnitCore.main(JUnitCore.java:36)
03:49:43    at org.elasticsearch.packaging.VMTestRunner.main(VMTestRunner.java:34)
03:49:43 
martijnvg commented 6 years ago

This build failure failed differently then the others reported here, but looking in the logs a lot of things time out, so assume it is caused by the same issue (super slow IO): https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2860/console

23:47:43 # Testing if ES ports are open:
23:47:43 # port 9200 on host 127.0.0.1 is open
23:47:43 # /project/build/packaging/bats/utils/utils.bash: connect: Connection refused
23:47:43 # Testing if ES ports are open:
23:47:43 # port 9200 on host 127.0.0.1 is open
23:47:43 # /project/build/packaging/bats/utils/utils.bash: connect: Connection refused
23:47:43 # /project/build/packaging/bats/utils/utils.bash: line 428: /dev/tcp/127.0.0.1/9201: Connection refused
23:47:43 # /project/build/packaging/bats/utils/utils.bash: line 428: /dev/tcp/127.0.0.1/9201: Connection refused
23:47:43 # port 9201 on host 127.0.0.1 is not open
23:47:43 # port 9201 on host 127.0.0.1 is not open
23:47:57 not ok 13 [RPM BOOTSTRAP PASSWORD] test elasticsearch-sql-cli
23:47:57 not ok 13 [RPM BOOTSTRAP PASSWORD] test elasticsearch-sql-cli
23:47:57 # (in test file /project/build/packaging/bats/tests/25_package_bootstrap_password.bats, line 146)
23:47:57 # (in test file /project/build/packaging/bats/tests/25_package_bootstrap_password.bats, line 146)
23:47:57 #   `false' failed
23:47:57 #   `false' failed
23:47:57 # {
23:47:57 #   "error" : {
23:47:57 #     "root_cause" : [
23:47:57 #       {
23:47:57 #         "type" : "security_exception",
23:47:57 #         "reason" : "failed to authenticate user [elastic]",
23:47:57 #         "header" : {
23:47:57 #           "WWW-Authenticate" : "Basic realm=\"security\" charset=\"UTF-8\""
23:47:57 #         }
23:47:57 #       }
23:47:57 #     ],
23:47:57 #     "type" : "security_exception",
23:47:57 #     "reason" : "failed to authenticate user [elastic]",
23:47:57 #     "header" : {
23:47:57 #       "WWW-Authenticate" : "Basic realm=\"security\" charset=\"UTF-8\""
23:47:57 #     }
23:47:57 #   },
23:47:57 #   "status" : 401
23:47:57 # }
23:47:57 # SQL cli failed:\nJul 11, 2018 11:47:51 PM org.jline.utils.Log logr
23:47:57 # WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
23:47:57 # password: Client Exception [java.sql.SQLInvalidAuthorizationSpecException: failed to authenticate user [elastic]]
23:47:57 # org.elasticsearch.xpack.sql.client.ClientException: java.sql.SQLInvalidAuthorizationSpecException: failed to authenticate user [elastic]
23:47:57 #  at org.elasticsearch.xpack.sql.cli.command.CliSession.checkConnection(CliSession.java:65)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.Cli.checkConnection(Cli.java:130)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.Cli.execute(Cli.java:120)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.Cli.execute(Cli.java:102)
23:47:57 #  at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
23:47:57 #  at org.elasticsearch.cli.Command.main(Command.java:90)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.Cli.main(Cli.java:50)
23:47:57 # Caused by: java.sql.SQLInvalidAuthorizationSpecException: failed to authenticate user [elastic]
23:47:57 #  at org.elasticsearch.xpack.sql.client.JreHttpUrlConnection$SqlExceptionType.asException(JreHttpUrlConnection.java:306)
23:47:57 #  at org.elasticsearch.xpack.sql.client.JreHttpUrlConnection.parserError(JreHttpUrlConnection.java:183)
23:47:57 #  at org.elasticsearch.xpack.sql.client.JreHttpUrlConnection.request(JreHttpUrlConnection.java:158)
23:47:57 #  at org.elasticsearch.xpack.sql.client.HttpClient.lambda$get$4(HttpClient.java:122)
23:47:57 # {
23:47:57 #   "error" : {
23:47:57 #     "root_cause" : [
23:47:57 #       {
23:47:57 #         "type" : "security_exception",
23:47:57 #         "reason" : "failed to authenticate user [elastic]",
23:47:57 #         "header" : {
23:47:57 #           "WWW-Authenticate" : "Basic realm=\"security\" charset=\"UTF-8\""
23:47:57 #         }
23:47:57 #       }
23:47:57 #     ],
23:47:57 #     "type" : "security_exception",
23:47:57 #     "reason" : "failed to authenticate user [elastic]",
23:47:57 #     "header" : {
23:47:57 #       "WWW-Authenticate" : "Basic realm=\"security\" charset=\"UTF-8\""
23:47:57 #     }
23:47:57 #   },
23:47:57 #   "status" : 401
23:47:57 # }
23:47:57 # SQL cli failed:\nJul 11, 2018 11:47:51 PM org.jline.utils.Log logr
23:47:57 # WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
23:47:57 # password: Client Exception [java.sql.SQLInvalidAuthorizationSpecException: failed to authenticate user [elastic]]
23:47:57 # org.elasticsearch.xpack.sql.client.ClientException: java.sql.SQLInvalidAuthorizationSpecException: failed to authenticate user [elastic]
23:47:57 #  at org.elasticsearch.xpack.sql.cli.command.CliSession.checkConnection(CliSession.java:65)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.Cli.checkConnection(Cli.java:130)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.Cli.execute(Cli.java:120)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.Cli.execute(Cli.java:102)
23:47:57 #  at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
23:47:57 #  at org.elasticsearch.cli.Command.main(Command.java:90)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.Cli.main(Cli.java:50)
23:47:57 # Caused by: java.sql.SQLInvalidAuthorizationSpecException: failed to authenticate user [elastic]
23:47:57 #  at org.elasticsearch.xpack.sql.client.JreHttpUrlConnection$SqlExceptionType.asException(JreHttpUrlConnection.java:306)
23:47:57 #  at org.elasticsearch.xpack.sql.client.JreHttpUrlConnection.parserError(JreHttpUrlConnection.java:183)
23:47:57 #  at org.elasticsearch.xpack.sql.client.JreHttpUrlConnection.request(JreHttpUrlConnection.java:158)
23:47:57 #  at org.elasticsearch.xpack.sql.client.HttpClient.lambda$get$4(HttpClient.java:122)
23:47:57 #  at org.elasticsearch.xpack.sql.client.JreHttpUrlConnection.http(JreHttpUrlConnection.java#  at org.elasticsearch.xpack.sql.client.JreHttpUrlConnection.http(JreHttpUrlConnection.java
23:47:57 :62)
23:47:57 #  at org.elasticsearch.xpack.sql.client.HttpClient.lambda$get$5(HttpClient.java:121)
23:47:57 #  at java.security.AccessController.doPrivileged(Native Method)
23:47:57 #  at org.elasticsearch.xpack.sql.client.HttpClient.get(HttpClient.java:120)
23:47:57 #  at org.elasticsearch.xpack.sql.client.HttpClient.serverInfo(HttpClient.java:60)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.command.CliSession.checkConnection(CliSession.java:63)
23:47:57 #  ... 6 more
23:47:57 # ERROR: Cannot communicate with the server http://elastic@127.0.0.1:9200. This version of CLI only works with Elasticsearch version v7.0.0-alpha1 [e955ffc]
23:47:57 :62)
23:47:57 #  at org.elasticsearch.xpack.sql.client.HttpClient.lambda$get$5(HttpClient.java:121)
23:47:57 #  at java.security.AccessController.doPrivileged(Native Method)
23:47:57 #  at org.elasticsearch.xpack.sql.client.HttpClient.get(HttpClient.java:120)
23:47:57 #  at org.elasticsearch.xpack.sql.client.HttpClient.serverInfo(HttpClient.java:60)
23:47:57 #  at org.elasticsearch.xpack.sql.cli.command.CliSession.checkConnection(CliSession.java:63)
23:47:57 #  ... 6 more
23:47:57 # ERROR: Cannot communicate with the server http://elastic@127.0.0.1:9200. This version of CLI only works with Elasticsearch version v7.0.0-alpha1 [e955ffc]
andyb-elastic commented 6 years ago

Been at this for a while - I'm not sure swapping the disk to the other controller will work in the Vagrantfile. The commands to do it are basically

# disconnect the disk from the sata controller
vboxmanage storageattach [machine id] \
  --storagectl 'SATA Controller' \
  --device 0 \
  --port 0 \
  --type hdd \
  --medium none

# attach the disk to the ide controller
vboxmanage storageattach [machine id] \
  --storagectl 'IDE Controller' \
  --device 0 \
  --port 0 \
  --type hdd \
  --medium [disk path or uuid]

You need to know the disk path or uuid, and to get that from vboxmanage showvminfo you need to know the machine id, which doesn't exist at the point where you can run vboxmanage showvminfo, because the machine hasn't been created.

I'm going to take another shot at doing this in the packer build

andyb-elastic commented 6 years ago

I merged #32053 to turn on host io caching for the opensuse box and re-enabled the tests for the suse boxes on master 14d7e2c7b2d3761a361edd720f98353f856936a4

We'll see how they do on the opensuse box - I'm not sure where to start on the sles box as it has its disk on the ide controller. It's possible that it's unrelated to what we're seeing here

andyb-elastic commented 6 years ago

Looks like it did not work

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2876/console packaging-2876.txt

andyb-elastic commented 6 years ago

Infra merged https://github.com/elastic/infra/pull/5975 and a new version of the box with the disk attached to the ide controller has landed in vagrant cloud

I reverted enabling host io caching for the sata controller (https://github.com/elastic/elasticsearch/commit/16fe22047e4ce243e9858da568b4e7a8eb73ef35) and turned tests back on for the suse boxes (https://github.com/elastic/elasticsearch/commit/418540e630280d7e4f899ebff2f5202295ac0de0)

andyb-elastic commented 6 years ago

This one timed out when waiting for the server to start - while we don't have the es logs to confirm it's this same slow io problem, it's not encouraging

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2936/console

build-2936.txt

albertzaharovits commented 6 years ago

And another one: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2938/consoleFull

andyb-elastic commented 6 years ago

Have logs from a bats run now, we're definitely still seeing the issue. I disabled the tests for these boxes again https://github.com/elastic/elasticsearch/commit/0b601c663178dcdd9a7361dc193317c3817251eb

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2954/console build-2954.txt

andyb-elastic commented 5 years ago

I'm re-enabling tests on these boxes on master to confirm we're still seeing this issue in CI, while I dig into it a little more. It did not occur in three PR CI runs in #38864, and I was not able to reproduce it locally still

To re-disable tests on these boxes, you can either revert 94a4c3227bff5aa19fd859c1541dfab1790b066d or cherry-pick 0b601c663178dcdd9a7361dc193317c3817251eb and push it to the branch again, they should have the same effect

andyb-elastic commented 5 years ago

Failed again in the periodic job https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/454/console

build-master-packaging-454.txt.zip

Disabled again b88760b704ba7a80ecb550b6c360d2b340210e4c

Same symptom, 4 seconds to publish cluster state

# {"type": "server", "timestamp": "2019-02-15T13:46:28,387+0000", "level": "DEBUG", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "publishing cluster state version [2]"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:28,387+0000", "level": "DEBUG", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "publishing cluster state version [2]"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:28,388+0000", "level": "TRACE", "component": "o.e.g.MetaStateService", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "[_meta] writing state, reason [changed]"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:28,388+0000", "level": "TRACE", "component": "o.e.g.MetaStateService", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "[_meta] writing state, reason [changed]"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:28,388+0000", "level": "TRACE", "component": "o.e.g.MetaDataStateFormat", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "cleaned up /var/lib/elasticsearch/nodes/0/manifest-6.st.tmp"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:28,388+0000", "level": "TRACE", "component": "o.e.g.MetaDataStateFormat", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "cleaned up /var/lib/elasticsearch/nodes/0/manifest-6.st.tmp"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:31,013+0000", "level": "TRACE", "component": "o.e.i.IndexingMemoryController", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "total indexing heap bytes used [0b] vs indices.memory.index_buffer_size [99mb], currently writing bytes [0b]"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:31,013+0000", "level": "TRACE", "component": "o.e.i.IndexingMemoryController", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "total indexing heap bytes used [0b] vs indices.memory.index_buffer_size [99mb], currently writing bytes [0b]"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:32,912+0000", "level": "TRACE", "component": "o.e.g.MetaDataStateFormat", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "cleaned up /var/lib/elasticsearch/nodes/0/manifest-6.st.tmp"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:32,912+0000", "level": "TRACE", "component": "o.e.g.MetaDataStateFormat", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "cleaned up /var/lib/elasticsearch/nodes/0/manifest-6.st.tmp"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:32,912+0000", "level": "TRACE", "component": "o.e.g.MetaDataStateFormat", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "cleanupOldFiles: cleaning up /var/lib/elasticsearch/nodes/0"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:32,912+0000", "level": "TRACE", "component": "o.e.g.MetaDataStateFormat", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "cleanupOldFiles: cleaning up /var/lib/elasticsearch/nodes/0"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:32,912+0000", "level": "TRACE", "component": "o.e.g.MetaDataStateFormat", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "cleaned up /var/lib/elasticsearch/nodes/0/_state/manifest-5.st"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:32,912+0000", "level": "TRACE", "component": "o.e.g.MetaDataStateFormat", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "cleaned up /var/lib/elasticsearch/nodes/0/_state/manifest-5.st"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:32,912+0000", "level": "TRACE", "component": "o.e.g.MetaStateService", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "[_meta] state written (generation: 6)"  }
# {"type": "server", "timestamp": "2019-02-15T13:46:32,912+0000", "level": "TRACE", "component": "o.e.g.MetaStateService", "cluster.name": "elasticsearch", "node.name": "opensuse-leap", "cluster.uuid": "7t2z5K-BRU6g62p6qkuX3Q", "node.id": "3FfHA4InSKasjqnUYx4oMw",  "message": "[_meta] state written (generation: 6)"  }
andyb-elastic commented 5 years ago

I did some testing to look at the disk usage of the opensuse image, while I'm not sure it's conclusive it does seem consistent with our theory here that the disk is causing an issue

This was running the geonames rally track on a single node and monitoring with iostat at a 1 second interval, with both the node and rally running inside a vm image. The node was started like we start it in the packaging tests, e.g. rally did not manage it. The conditions I tested were using the centos-7 image vs the opensuse-42 image, with or without added stress on the host machine (where stress was an elasticsearch gradle check build running at the same time). This was hosted on my macbook, not a CI machine, using 6.6 (there was an issue with rally's use of typeless APIs on master at the time)

(these plots are truncated to an x range where most of the activity is, the time range after this doesn't have much going on)

sda-rrqm sda-wrqm sda-avgqu-sz

So the opensuse runs had a lot more merged read and write requests queued to the disk per second than the centos runs. Strangely though the average queue size looks similar for both

Looking at average service time, the opensuse runs seem spikier but again not super qualitatively different

sda-r_await sda-w_await

The merged r/w requests queued per second bit seems the most interesting, but I'm not very familiar with how that queueing works so I'll look into it more

Data these plots are from

opensuse-nostress-iostat-docs-tagged.json.txt centos-stress-iostat-docs-tagged.json.txt opensuse-nostress-iostat-docs-tagged.json.txt opensuse-stress-iostat-docs-tagged.json.txt

rjernst commented 5 years ago

Both opensuse and sles were re-enabled with a recent refactoring of the packaging tests, and we haven't (yet) seen this issue crop up. Additionally, we are moving to each OS running within GCP instead of vagrant (at least for CI). I'm going to optimistically close this again.