Open kingherc opened 1 year ago
Pinging @elastic/es-search (Team:Search)
Another one at https://gradle-enterprise.elastic.co/s/x4zw72rykmhv2/console-log?task=:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11%23oldClusterTest in 8.7
--
| » ↓ errors and warnings from /dev/shm/elastic+elasticsearch+8.7+periodic+java-fips-matrix/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/logs/es.out ↓ |
| » [2023-05-16T02:30:23,971][ERROR][o.e.b.Elasticsearch ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service |
| » at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:1148) |
| » at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:324) |
| » at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:216) |
| » at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:216) |
| » at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:67) |
| » Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.7.2] is only supported from version [7.17.0]. |
| » at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:515) |
| » at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.upgradeLegacyNodeFolders(NodeEnvironment.java:414) |
| » at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:307) |
| » at org.elasticsearch.server@8.7.2-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:480) |
| » ... 4 more
Seeing another failure today: https://gradle-enterprise.elastic.co/s/jmn5iecrre7to
I think the important bit for these is Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.0] is only supported from version [7.17.0]
Strange this is only happening in the FIPS jobs. I can't imagine why that would affect backward compatibility.
Another one. FIPS again. https://gradle-enterprise.elastic.co/s/iemgifnzjjdi4
Pinging @elastic/es-core-infra (Team:Core/Infra)
I se this again today at https://gradle-enterprise.elastic.co/s/tfb2o66gvmxyk/console-log?anchor=4546&page=5
[2023-05-26T09:22:41,547][ERROR][o.e.b.Elasticsearch ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service |
| » Caused by: org.elasticsearch.gateway.CorruptStateException:
Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0].
Later in the logs there are perhaps unrelated or irrelevant errors:
[2023-05-26T09:22:33,871][ERROR][o.e.x.c.s.SSLService ] [v7.17.11-local-1] unsupported ciphers
[[TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384]] were requested but cannot be used in this JVM,
however there are supported ciphers that will be used [[TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA]]. If you are trying to use ciphers with a key length greater than 128 bits on an Oracle JVM, you will need to install the unlimited strength JCE policy files. |
The failures are always format version is not supported:
» Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.7.2] is only supported from version [7.17.0].
» Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.0] is only supported from version [7.17.0].
» Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.1] is only supported from version [7.17.0].
» Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0].
» Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.0] is only supported from version [7.17.0].
The test passes for me locally: https://gradle-enterprise.elastic.co/s/d3evq5lhnvqtw
The actual check in NodeEnvironment.checkForIndexCompatibility
is a check for metadata, I wonder if that's having issues on the fips machines.
static void checkForIndexCompatibility(Logger logger, DataPath... dataPaths) throws IOException {
final Path[] paths = Arrays.stream(dataPaths).map(np -> np.path).toArray(Path[]::new);
NodeMetadata metadata = PersistedClusterStateService.nodeMetadata(paths);
// We are upgrading the cluster, but we didn't find any previous metadata. Corrupted state or incompatible version.
if (metadata == null) {
throw new CorruptStateException(
"Format version is not supported. Upgrading to ["
+ Version.CURRENT
+ "] is only supported from version ["
+ Version.CURRENT.minimumCompatibilityVersion()
+ "]."
);
}
Seems to have started failing on ~May 3rd for the fips job https://gradle-enterprise.elastic.co/s/whs4mppb44lz2~ May 2nd https://gradle-enterprise.elastic.co/s/qiqtib7ht77dw
The only three recent successes for the main job are: https://gradle-enterprise.elastic.co/s/svnovjom4juue - May 19 2023 15:51:39 CDT https://gradle-enterprise.elastic.co/s/6u5llg576bue2 - May 22 2023 03:52:05 CDT https://gradle-enterprise.elastic.co/s/2xao3b55zbgra - Jun 1 2023 15:52:02 CDT
Seems to have started around this commit 7ae8408082f7b6cc0172d0612f3a1aa843aeb50f.
I'm wondering if there's an issue with the fips setup. Tagging security to have y'all take a look
Pinging @elastic/es-security (Team:Security)
Has anyone been able to reproduce this locally? (I can't)
I haven't been able to trigger the fips version locally, my runs have all been non-fips.
FWIW, another failure: https://gradle-enterprise.elastic.co/s/25zqbmucs2ylk
I'm removing my assignment until security can determine it's an index version issue rather than a fips issue.
FWIW, this task has constantly been failing, there have only been three successes recently. @mark-vieira is there a good way to mute this just for fips for now?
FWIW, this task has constantly been failing, there have only been three successes recently. @mark-vieira is there a good way to mute this just for fips for now?
You can add an assumeFalse()
to the test. For example.
I can reproduce on a dedicated worker.
./gradlew :qa:ccs-rolling-upgrade-remote-cluster:check -Druntime.java=17 -Dtests.fips.enabled=true
* What went wrong:
Execution failed for task ':qa:ccs-rolling-upgrade-remote-cluster:v7.17.11#oldClusterTest'.
> process was found dead while waiting for ports files, node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11-local-0}
[2023-06-13T03:07:38,498][ERROR][o.e.b.Elasticsearch ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service
at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:1190)
at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:334)
at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:231)
at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:231)
at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:71)
Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0].
at org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:515)
It looks like that data directory is empty on the node:
$ ls -sR /dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data
/dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data:
total 0
0 node.lock 0 nodes
/dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data/nodes:
total 0
0 0
/dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data/nodes/0:
total 0
0 node.lock
~When I run the test without FIPS, and kill it at the same point, there is an indices/
directory, and various data files in _state/
~
~So, I think the question is why doesn't any data get written when FIPS is enabled~.
It looks like I killed the non-FIPS test too late, and my conclusion was wrong.
With more debugging, it now seems that the FIPS test is creating a node.lock
etc too early. There should be nothing in the data directory at this point, but there is.
I wonder if the clusters are interfering with each other somehow (on FIPS only).
Not sure if this is something that could potentially explain the cause of this issue, but I've noticed that there is a pattern for these failures. Whenever the execution fails it seems that :qa:ccs-rolling-upgrade-remote-cluster:v8.9.0
task was started before :qa:ccs-rolling-upgrade-remote-cluster:v7.17.11
:
In case of successful executions, the versions are in ascending order:
The order shouldn't matter. These clusters use fresh working directories so they won't interfere with eachother.
I think (wild theory incoming ...) this test is just broken, and the brokenness shows up differently with FIPS. I'm not sure why it just started happening.
Essentially when forming the clusters for this test, we start to upgrade the node versions before we make sure a cluster has formed
On non-FIPS that seems to kill the cluster while it still has an empty data dir, and the upgraded node is fine (it works like a clean boot). On FIPS it kills it while it has a node lock but no actual state. That means that the upgraded node doesn't start because it has a data-dir that isn't safe (the failure message isn't ideal, but the state is genuinely bad)
I tried to add localCluster.get().waitForAllConditions()
before upgrading the node version, but then it fails with :
Caused by: java.lang.IllegalStateException: node version [7.17.11] may not join a cluster comprising only nodes of version [8.9.0] or greater
» at org.elasticsearch.cluster.coordination.NodeJoinExecutor.ensureVersionBarrier(NodeJoinExecutor.java:414) ~[?:?]
» at org.elasticsearch.cluster.coordination.Coordinator.validateJoinRequest(Coordinator.java:689) ~[elasticsearch-7.17.11-SNAPSHOT.jar:7.17.11-SNAPSHOT]
» at org.elasticsearch.cluster.coordination.Coordinator$2.onResponse(Coordinator.java:631) ~[elasticsearch-7.17.11-SNAPSHOT.jar:7.17.11-SNAPSHOT]
» at org.elasticsearch.cluster.coordination.Coordinator$2.onResponse(Coordinator.java:626) ~[elasticsearch-7.17.11-SNAPSHOT.jar:7.17.11-SNAPSHOT]
(even on non-FIPS).
I haven't tried to work out why we're ending up in a state where we're adding a 7.x node to an 8.x cluster. This test confuses me a bit and I'm not sure exactly what it's trying to do.
This is still failing - on 8.8 today: https://gradle-enterprise.elastic.co/s/4zwyhji33bvl4
Another one 8.8 at https://gradle-enterprise.elastic.co/s/qdti5ijqsjxaq
This issue has been closed because it has been open for too long with no activity.
Any muted tests that were associated with this issue have been unmuted.
If the tests begin failing again, a new issue will be opened, and they may be muted again.
Any muted tests that were associated with this issue have been unmuted.
Reopening because the test is still excluded from running in FIPS mode.
CI Link
https://gradle-enterprise.elastic.co/s/br3zbuqz6alce
Repro line
N/A
Does it reproduce?
Didn't try
Applicable branches
main, 8.7, possibly others
Failure history
No response
Failure excerpt