elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.54k stars 24.34k forks source link

[CI] Failing cluster start in ':qa:repository-multi-version:v8.3.0#Step2NewClusterTest', maybe FIPS related #87106

Open cbuescher opened 2 years ago

cbuescher commented 2 years ago

CI Link

https://gradle-enterprise.elastic.co/s/32omqddh3kule

Repro line

none

Does it reproduce?

No

Applicable branches

main

Failure history

-

Failure excerpt

2022-05-25T10:05:40,491][WARN ][stderr                   ] [yamlRestTestV7CompatTest-0] INFO: Initializing with trust store at path: /dev/shm/elastic+elasticsearch+main+periodic+java-fips-matrix/x-pack/plugin/stack/qa/rest/build/testclusters/yamlRestTestV7CompatTest-0/config/cacerts.bcfks
»   ↑ repeated 4 times ↑
» [2022-05-25T10:05:36,675][ERROR][o.e.x.c.s.SSLService     ] [yamlRestTestV7CompatTest-0] unsupported ciphers [[TLS_AES_256_GCM_SHA384, TLS_AES_128_GCM_SHA256, TLS_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384]] were requested but cannot be used in this JVM, however there are supported ciphers that will be used [[TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA]]. If you are trying to use ciphers with a key length greater than 128 bits on an Oracle JVM, you will need to install the unlimited strength JCE policy files.
» [2022-05-25T10:05:37,615][WARN ][stderr                   ] [yamlRestTestV7CompatTest-0] May 25, 2022 10:05:37 AM org.bouncycastle.jsse.provider.ProvTrustManagerFactorySpi getDefaultTrustStore
» [2022-05-25T10:05:40,491][WARN ][stderr                   ] [yamlRestTestV7CompatTest-0] May 25, 2022 10:05:40 AM org.bouncycastle.jsse.provider.ProvTrustManagerFactorySpi getDefaultTrustStore
`node{:qa:repository-multi-version:v8.3.0-new-0}` failed to wait for ports files after 120000 MILLISECONDS
elasticmachine commented 2 years ago

Pinging @elastic/es-security (Team:Security)

cbuescher commented 2 years ago

I'm marking this as FIPS just because of this line in the logs:

» [2022-05-25T10:05:36,675][ERROR][o.e.x.c.s.SSLService     ] [yamlRestTestV7CompatTest-0] unsupported ciphers [[TLS_AES_256_GCM_SHA384, TLS_AES_128_GCM_SHA256, TLS_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384]] were requested but cannot be used in this JVM, however there are supported ciphers that will be used [[TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA]]. If you are trying to use ciphers with a key length greater than 128 bits on an Oracle JVM, you will need to install the unlimited strength JCE policy files.

I don't know if that is enough to explain the cluster not booting up under 120000ms.

slobodanadamovic commented 2 years ago

I don't think that the logged error is the actual cause. It looks like the node just took longer to start.

» [2022-05-25T10:03:37,564][INFO ][o.e.n.Node               ] [yamlRestTestV7CompatTest-0] node name [yamlRestTestV7CompatTest-0], node ID [6ctzuUuKRYqC6USwGww_Rw], cluster name [yamlRestTestV7CompatTest], roles [ingest, data_frozen, ml, data_hot, transform, data_content, data_warm, master, remote_cluster_client, data, data_cold]
» [2022-05-25T10:05:37,689][INFO ][o.e.x.s.Security         ] [yamlRestTestV7CompatTest-0] Security is enabled
» [2022-05-25T10:05:37,968][INFO ][o.e.x.s.a.s.FileRolesStore] [yamlRestTestV7CompatTest-0] parsed [1] roles from file [/dev/shm/elastic+elasticsearch+main+periodic+java-fips-matrix/x-pack/plugin/stack/qa/rest/build/testclusters/yamlRestTestV7CompatTest-0/config/roles.yml]
» [2022-05-25T10:05:39,406][INFO ][o.e.t.n.NettyAllocator   ] [yamlRestTestV7CompatTest-0] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=1mb, factors={es.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=4mb, heap_size=512mb}]
» [2022-05-25T10:05:39,457][INFO ][o.e.i.r.RecoverySettings ] [yamlRestTestV7CompatTest-0] using rate limit [40mb] with [default=40mb, read=0b, write=0b, max=0b]
» [2022-05-25T10:05:39,522][INFO ][o.e.d.DiscoveryModule    ] [yamlRestTestV7CompatTest-0] using discovery type [multi-node] and seed hosts providers [settings, file]
» [2022-05-25T10:05:40.730410371Z] [BUILD] Stopping node
» [2022-05-25T10:05:40,752][INFO ][o.e.n.Node               ] [yamlRestTestV7CompatTest-0] initialized
» [2022-05-25T10:05:40,753][INFO ][o.e.n.Node               ] [yamlRestTestV7CompatTest-0] starting ...

There are 2 minutes delay between node info message and security enabled message.

[2022-05-25T10:03:37,564][INFO ][o.e.n.Node               ] 
[2022-05-25T10:05:37,689][INFO ][o.e.x.s.Security         ] 

The node started immediately after the build initiated stop. Looking at the history of failures it does seem that it mostly happens in FIPS mode. My best guess is that it depends on a number of loaded modules which caused the initialization to sometime take longer.