OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.14k stars 587 forks source link

unexpected eof while reading:ssl/record/rec_layer_s3.c:320 when enable semeruCloudCompiler with Liberty Java21 #27665

Open tam512 opened 6 months ago

tam512 commented 6 months ago

Test InstantOn with Open Liberty 24.0.0.2 Java21 on Power9 OCP 4.14 cluster. I deployed the checkpoint app image to the cluster using Open Liberty Operator 1.3.1 with JIT enable (.spec.semeruCloudCompiler.enable: true)

apiVersion: apps.openliberty.io/v1
kind: OpenLibertyApplication
metadata:
  name: myapp
spec:
  applicationImage: <my-checkpoint-app-image>
  semeruCloudCompiler:
    enable: true
    replicas: 1
    ...............

JITServer is ready to accept incoming requests A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320:

JITServer: t= 21652 A new client (clientUID=18080393426921679733) connected. Server allocated a new client session.

A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320: A0F04FE4FE7F0000:error:0A000126:SSL routines:ssl3_read_n:unexpected eof while reading:ssl/record/rec_layer_s3.c:320:

tam512 commented 6 months ago

I also see the same errors on Azure Kubernetes Service (AKS) cluster

mpirvu commented 6 months ago

I'll have a look. A question though: is this seen only on Power (i.e. is x86-64 running fine)?

tam512 commented 6 months ago

@mpirvu I also see the errors on AKS cluster which is x86-64

mpirvu commented 6 months ago

Thanks. I have a few more questions to limit my search space:

mpirvu commented 6 months ago

Another question: did we ever had a combination of InstantOn + Semeru Cloud Compiler that worked ok in OCP when deployed with the Liberty Operator?

I have tested Liberty 24.0.0.1 + Java 21 release 0.42.0 with InstantON and Semeru Cloud Compiler in containers, but not in OCP and it's working ok. This is why I am trying to understand whether this test using the Liberty Operator has ever passed.

tam512 commented 6 months ago

this is the first time I test InstantOn + Semeru Cloud Compiler on OCP4.14

mpirvu commented 6 months ago

this is the first time I test InstantOn + Semeru Cloud Compiler on OCP4.14

Then, it's possible that there is something wrong at the Operator level on how the client and server are setup. The symptoms indicate that the certificate at the client does not match the one at the server.

mpirvu commented 6 months ago

@leochr could you please weigh-in on this issue?

tam512 commented 6 months ago

I also see the error on x86 OCP 4.14 cluster with Java 21 but not with Java 17

mpirvu commented 6 months ago

@tam512 In another issue(performance related) you deployed Liberty+Semeru Cloud Compiler with Java 21 in OCP using yaml files. Did that experiment use encryption? If so, I would rather say it's an Operator issue rather than a JVM issue. Thanks

leochr commented 6 months ago

InstantOn app with Liberty Operator and Semeru Cloud Compiler was validated on OCP as part of this issue: https://github.com/OpenLiberty/open-liberty-operator/issues/438#issuecomment-1894359940

I am not sure which Java version was used, but Liberty images with Java 21 weren't available then.

Liberty Operator generates the certificate and sets up the server and client the same way whether the app is InstantOn or not. The only difference is that the InstantOn app client gets the server details via the OPENJ9_RESTORE_JAVA_OPTIONS env.

leochr commented 6 months ago

I'll suggest trying this scenario on OCP without Liberty Operator.

I also wonder whether InstantOn could possibly be overridding the certificate, so instead of using the certificate mounted at runtime (by the Operator), a certificate from the snapshot is used.

vijaysun-omr commented 6 months ago

I agree with @leochr last thought, and one way to verify if InstantOn could be failing to pick up from the restore environment is to try the same test (using Operator and Semeru Cloud Compiler) that @tam512 was running without InstantOn being used.

tam512 commented 6 months ago

I tried the same test, deploy app without InstantOn using Open Liberty operator+semeruCloudCompiler, I still see the same errors so it's not instantOn related

mpirvu commented 6 months ago

Once, I'll get my hands on a K8 cluster I'll try to reproduce. Just using containers I don't see the issue.

mpirvu commented 6 months ago

I created a Kubernetes cluster based on Microk8s and a container with Liberty 24.0.0.1 and AcmeAir app. If I deploy the container with OLO using Java 21 I see the problem mentioned in this issue. If I deploy the container with OLO using Java 17, everything is fine. If I deploy the container with yaml, using either Java 17 or Java 21 everything is fine.

Thinking that the certificates created by OLO+Java21 may be wrong, I deployed my client+server solution with yaml files, but pointing at the certificates from the TLS secret created by OLO. This was also successful which means that the certificates are good.

In another experiment, I connected my AcmeAir container deployed with yaml to the server deployed by OLO. This also fails, so this may point at a difference between the server launched by OLO and the server launched with yaml, but so far I haven't found any.

mpirvu commented 6 months ago

I added verbose options to the client and I see that it successfully connects to the server and the server successfully compiles methods even though those "unexpected eof" messages appear at the server.

I googled about those messages and found this text that may or may be not relevant:

OpenSSL 3 (a major release) changed some behaviour compared to 1.1.1 with respect to peers that 
fail to shutdown a TLS connection cleanly.
Previously, if a peer unexpectedly shutdown a connection an OpenSSL IO function (such as SSL_read()) 
would report an error and SSL_get_error() would report SSL_ERROR_SYSCALL and errno would be 0. 
This was considered a bug in 1.1.1 (you should never get SSL_ERROR_SYSCALL but with errno as 0). 
However fixing it in 1.1.1 broke some apps. We delayed the fix until the next major version (OpenSSL 3.0).
In OpenSSL 3.0 this error is now reported from SSL_get_error() as SSL_ERROR_SSL and the 
unexpeced eof while reading error is put on the OpenSSL error stack. 
We also added a new option SSL_OP_IGNORE_UNEXPECTED_EOF which treats an unexpected EOF 
from the peer as if they had performed an orderly shutdown.

In order to further debug this issue, it would be very helpful to enable logging at the server by defining an environment variable: _JAVA_OPTIONS="-Xjit:verbose={JITServer},verbose={compilePerformance} @leochr Is there a way to inject JVM options to the Semeru Cloud Compiler pod? I tried

  semeruCloudCompiler:
    enable: true
    replicas: 1
    env:
      - name: _JAVA_OPTIONS
        value: "-Xjit:verbose"

but spec.semeruCloudCompiler.env is not defined.

mpirvu commented 6 months ago

When I create my containers with Java 21 I see

#JITServer: Successfully initialized SSL context (OpenSSL 3.0.7 1 Nov 2022)

but when I create my containers with Java 17 I see

#JITServer: Successfully initialized SSL context (OpenSSL 1.1.1k  FIPS 25 Mar 2021)

Based on the text that I pasted in the previous comment, it's very likely that the change from OpenSSL 1.1.1 to OpenSSL 3.0.7 made those "unexpected eof" messages appear. I still cannot explain why those messages do not appear when I start JITServer with yaml instead of OLO.

leochr commented 6 months ago

@mpirvu It's not possible to set env on Semeru Cloud Compiler via the OpenLibertyApplication CR. But you could set it directly on the Semeru Cloud Compiler Deployment (i.e. -semeru-compiler-1) that's created by OLO. Make sure to scale down the OLO Deployment olo-controller-manager to 0, otherwise OLO will revert the manual changes to the compiler's Deployment. Once debug is complete, scale OLO Deployment back to 1 to resume Operator reconciliation.

Please attach the yamls that were using. I'll take a look to see if I can spot any difference that could cause the behaviour change. Thank you

mpirvu commented 6 months ago

I think I understand what's causing those 'unexpected eof' messages : the server defined by OLO adds liveness and readiness probes. I believe the probes establish a connection on the 38400 port, but not a full ssl connection. I managed to replicate the errors on my JITServer deployed with yaml files by adding liveness/readiness probes like OLO does. I was puzzled by the fact that the server shows some connection errors, while the client does not show any trace of a connection problem. The probing mechanism explains the mystery.

mpirvu commented 6 months ago

Searching the net I did not find a way to send a probe on a TCP port using encryption.

There are two possible solutions here:

  1. Filter out the error message (basically going for an OpenSSL 1.1.1 solution).
  2. Add another port that will be dedicated to receiving unencrypted probes.

The second solution is used in Hazelcast. Citing from their documentation:

since Kubernetes liveness/readiness probes cannot use SSL, you need to prepare a Hazelcast configuration with a separate non-secured port opened for health checks.