elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.91k stars 24.73k forks source link

main - java fips compatibility matrix openjdk17 bwcTestSnapshots general-purpose ccs-rolling-upgrade-remote-cluster #96134

Open kingherc opened 1 year ago

kingherc commented 1 year ago

CI Link

https://gradle-enterprise.elastic.co/s/br3zbuqz6alce

Repro line

N/A

Does it reproduce?

Didn't try

Applicable branches

main, 8.7, possibly others

Failure history

No response

Failure excerpt


:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11#oldClusterTest FAILED |  
-- | --
  |   |  
  | === Log output of node `node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11-local-0}` === |  
  |   |  
  | »    ↓ errors and warnings from /dev/shm/elastic+elasticsearch+main+periodic+java-fips-matrix/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/logs/es.out ↓ |  
  | » [2023-05-15T21:20:22,822][ERROR][o.e.b.Elasticsearch      ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service |  
  | »   at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:1169) |  
  | »   at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:329) |  
  | »   at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:216) |  
  | »   at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:216) |  
  | »   at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:67) |  
  | »  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0]. |  
  | »   at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:515) |  
  | »   at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.upgradeLegacyNodeFolders(NodeEnvironment.java:414) |  
  | »   at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:307) |  
  | »   at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:485) |  
  | »   ... 4 more |  
  | » |  
  | »  ERROR: Elasticsearch did not exit normally - check the logs at /dev/shm/elastic+elasticsearch+main+periodic+java-fips-matrix/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/logs/v7.17.11-local.log |  
  | » |  
  | »  ERROR: Elasticsearch exited unexpectedly |  
  | »   ↓ last 40 non error or warning messages from /dev/shm/elastic+elasticsearch+main+periodic+java-fips-matrix/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/logs/es.out ↓ |  
  | » [2023-05-15T21:20:18,670][INFO ][o.e.p.PluginsService     ] [v7.17.11-local-0] loaded module [mapper-version] |  
  | » [2023-05-15T21:20:18,670][INFO ][o.e.p.PluginsService     ] [v7.17.11-local-0] loaded module [mapper-extras] |  
  | » [2023-05-15T21:20:18,670][INFO ][o.e.p.PluginsService     ] [v7.17.11-local-0] loaded module [apm] |  
  | » [2023-05-15T21:20:18,671][INFO ][o.e.p.PluginsService     ] [v7.17.11-local-0] loaded module [x-pack-aggregate-metric]
elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

kingherc commented 1 year ago

Another one at https://gradle-enterprise.elastic.co/s/x4zw72rykmhv2/console-log?task=:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11%23oldClusterTest in 8.7


 
--
  | »    ↓ errors and warnings from /dev/shm/elastic+elasticsearch+8.7+periodic+java-fips-matrix/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/logs/es.out ↓ |  
  | » [2023-05-16T02:30:23,971][ERROR][o.e.b.Elasticsearch      ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service |  
  | »   at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:1148) |  
  | »   at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:324) |  
  | »   at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:216) |  
  | »   at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:216) |  
  | »   at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:67) |  
  | »  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.7.2] is only supported from version [7.17.0]. |  
  | »   at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:515) |  
  | »   at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.upgradeLegacyNodeFolders(NodeEnvironment.java:414) |  
  | »   at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:307) |  
  | »   at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:480) |  
  | »   ... 4 more
kingherc commented 1 year ago

Also https://gradle-enterprise.elastic.co/s/bb4ztrkx2f3ka/console-log?task=:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11%23oldClusterTest in 8.8

n1v0lg commented 1 year ago

Seeing another failure today: https://gradle-enterprise.elastic.co/s/jmn5iecrre7to

I think the important bit for these is Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.0] is only supported from version [7.17.0]

mark-vieira commented 1 year ago

Strange this is only happening in the FIPS jobs. I can't imagine why that would affect backward compatibility.

astefan commented 1 year ago

Another one. FIPS again. https://gradle-enterprise.elastic.co/s/iemgifnzjjdi4

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

craigtaverner commented 1 year ago

I se this again today at https://gradle-enterprise.elastic.co/s/tfb2o66gvmxyk/console-log?anchor=4546&page=5

[2023-05-26T09:22:41,547][ERROR][o.e.b.Elasticsearch      ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service |  
  | »  Caused by: org.elasticsearch.gateway.CorruptStateException:
        Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0].

Later in the logs there are perhaps unrelated or irrelevant errors:

[2023-05-26T09:22:33,871][ERROR][o.e.x.c.s.SSLService     ] [v7.17.11-local-1] unsupported ciphers
    [[TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384]] were requested but cannot be used in this JVM,
    however there are supported ciphers that will be used [[TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA]]. If you are trying to use ciphers with a key length greater than 128 bits on an Oracle JVM, you will need to install the unlimited strength JCE policy files. |  
edsavage commented 1 year ago

Another one: https://gradle-enterprise.elastic.co/s/qpj6ucfk3w2hi/console-log?anchor=3399&page=4

ywangd commented 1 year ago

Today again https://gradle-enterprise.elastic.co/s/7vxpvijmzch44

stu-elastic commented 1 year ago

The failures are always format version is not supported:

»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.7.2] is only supported from version [7.17.0].
»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.0] is only supported from version [7.17.0].
»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.1] is only supported from version [7.17.0].
»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0].
»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.0] is only supported from version [7.17.0].
stu-elastic commented 1 year ago

The test passes for me locally: https://gradle-enterprise.elastic.co/s/d3evq5lhnvqtw

stu-elastic commented 1 year ago

The actual check in NodeEnvironment.checkForIndexCompatibility is a check for metadata, I wonder if that's having issues on the fips machines.

    static void checkForIndexCompatibility(Logger logger, DataPath... dataPaths) throws IOException {
        final Path[] paths = Arrays.stream(dataPaths).map(np -> np.path).toArray(Path[]::new);
        NodeMetadata metadata = PersistedClusterStateService.nodeMetadata(paths);

        // We are upgrading the cluster, but we didn't find any previous metadata. Corrupted state or incompatible version.
        if (metadata == null) {
            throw new CorruptStateException(
                "Format version is not supported. Upgrading to ["
                    + Version.CURRENT
                    + "] is only supported from version ["
                    + Version.CURRENT.minimumCompatibilityVersion()
                    + "]."
            );
        }
stu-elastic commented 1 year ago

Seems to have started failing on ~May 3rd for the fips job https://gradle-enterprise.elastic.co/s/whs4mppb44lz2~ May 2nd https://gradle-enterprise.elastic.co/s/qiqtib7ht77dw

stu-elastic commented 1 year ago

The only three recent successes for the main job are: https://gradle-enterprise.elastic.co/s/svnovjom4juue - May 19 2023 15:51:39 CDT https://gradle-enterprise.elastic.co/s/6u5llg576bue2 - May 22 2023 03:52:05 CDT https://gradle-enterprise.elastic.co/s/2xao3b55zbgra - Jun 1 2023 15:52:02 CDT

stu-elastic commented 1 year ago

Seems to have started around this commit 7ae8408082f7b6cc0172d0612f3a1aa843aeb50f.

I'm wondering if there's an issue with the fips setup. Tagging security to have y'all take a look

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-security (Team:Security)

tvernum commented 1 year ago

Has anyone been able to reproduce this locally? (I can't)

stu-elastic commented 1 year ago

I haven't been able to trigger the fips version locally, my runs have all been non-fips.

bpintea commented 1 year ago

FWIW, another failure: https://gradle-enterprise.elastic.co/s/25zqbmucs2ylk

davidkyle commented 1 year ago

And another https://gradle-enterprise.elastic.co/s/tohboeyimyarq/console-log?task=:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11%23oldClusterTest

stu-elastic commented 1 year ago

I'm removing my assignment until security can determine it's an index version issue rather than a fips issue.

FWIW, this task has constantly been failing, there have only been three successes recently. @mark-vieira is there a good way to mute this just for fips for now?

mark-vieira commented 1 year ago

FWIW, this task has constantly been failing, there have only been three successes recently. @mark-vieira is there a good way to mute this just for fips for now?

You can add an assumeFalse() to the test. For example.

stu-elastic commented 1 year ago

Silenced in https://github.com/elastic/elasticsearch/pull/96776

tvernum commented 1 year ago

I can reproduce on a dedicated worker.

./gradlew :qa:ccs-rolling-upgrade-remote-cluster:check  -Druntime.java=17 -Dtests.fips.enabled=true
* What went wrong:
Execution failed for task ':qa:ccs-rolling-upgrade-remote-cluster:v7.17.11#oldClusterTest'.
> process was found dead while waiting for ports files, node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11-local-0}
[2023-06-13T03:07:38,498][ERROR][o.e.b.Elasticsearch      ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service
        at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:1190)
        at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:334)
        at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:231)
        at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:231)
        at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:71)
Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0].
        at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:515)

It looks like that data directory is empty on the node:

$ ls -sR /dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data
/dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data:
total 0
0 node.lock  0 nodes

/dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data/nodes:
total 0
0 0

/dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data/nodes/0:
total 0
0 node.lock

~When I run the test without FIPS, and kill it at the same point, there is an indices/ directory, and various data files in _state/~

~So, I think the question is why doesn't any data get written when FIPS is enabled~.

tvernum commented 1 year ago

It looks like I killed the non-FIPS test too late, and my conclusion was wrong.

With more debugging, it now seems that the FIPS test is creating a node.lock etc too early. There should be nothing in the data directory at this point, but there is. I wonder if the clusters are interfering with each other somehow (on FIPS only).

slobodanadamovic commented 1 year ago

Not sure if this is something that could potentially explain the cause of this issue, but I've noticed that there is a pattern for these failures. Whenever the execution fails it seems that :qa:ccs-rolling-upgrade-remote-cluster:v8.9.0 task was started before :qa:ccs-rolling-upgrade-remote-cluster:v7.17.11:

image

In case of successful executions, the versions are in ascending order:

image

mark-vieira commented 1 year ago

The order shouldn't matter. These clusters use fresh working directories so they won't interfere with eachother.

tvernum commented 1 year ago

I think (wild theory incoming ...) this test is just broken, and the brokenness shows up differently with FIPS. I'm not sure why it just started happening.

Essentially when forming the clusters for this test, we start to upgrade the node versions before we make sure a cluster has formed

On non-FIPS that seems to kill the cluster while it still has an empty data dir, and the upgraded node is fine (it works like a clean boot). On FIPS it kills it while it has a node lock but no actual state. That means that the upgraded node doesn't start because it has a data-dir that isn't safe (the failure message isn't ideal, but the state is genuinely bad)

I tried to add localCluster.get().waitForAllConditions() before upgrading the node version, but then it fails with :

Caused by: java.lang.IllegalStateException: node version [7.17.11] may not join a cluster comprising only nodes of version [8.9.0] or greater
»       at org.elasticsearch.cluster.coordination.NodeJoinExecutor.ensureVersionBarrier(NodeJoinExecutor.java:414) ~[?:?]
»       at org.elasticsearch.cluster.coordination.Coordinator.validateJoinRequest(Coordinator.java:689) ~[elasticsearch-7.17.11-SNAPSHOT.jar:7.17.11-SNAPSHOT]
»       at org.elasticsearch.cluster.coordination.Coordinator$2.onResponse(Coordinator.java:631) ~[elasticsearch-7.17.11-SNAPSHOT.jar:7.17.11-SNAPSHOT]
»       at org.elasticsearch.cluster.coordination.Coordinator$2.onResponse(Coordinator.java:626) ~[elasticsearch-7.17.11-SNAPSHOT.jar:7.17.11-SNAPSHOT]

(even on non-FIPS).

I haven't tried to work out why we're ending up in a state where we're adding a 7.x node to an 8.x cluster. This test confuses me a bit and I'm not sure exactly what it's trying to do.

quux00 commented 1 year ago

This is still failing - on 8.8 today: https://gradle-enterprise.elastic.co/s/4zwyhji33bvl4

kingherc commented 1 year ago

Another one 8.8 at https://gradle-enterprise.elastic.co/s/qdti5ijqsjxaq