RADAR-base / RADAR-Kubernetes

Kubernetes deployment of RADAR-base
Apache License 2.0
17 stars 9 forks source link

Management portal expects secret #201

Closed 2bPro closed 1 year ago

2bPro commented 2 years ago

Describe the bug Management portal service crashing because of missing secret.

Error: unable to build kubernetes objects from release manifest: error validating "": error validating data: 
unknown object type "nil" in Secret.data.keystore.p12

The installation documentation gives the impression that secrets and passwords are configured inside etc/production.yaml file but there is a mention of a missing /secrets directory and "To create an encrypted password string and put it inside kube_prometheus_stack.nginx_auth variable.", which is confusing.

To Reproduce Steps to reproduce the behavior:

Expected behavior The installation documentation specifies that secrets other than in the etc/production.yaml file are required for the management portal and gives an example of how to set one up. The management portal service starts.

Version of Helm and Kubernetes:

version.BuildInfo{Version:"v3.9.0", GitCommit:"7ceeda6c585217a19a1131663d8cd1f7d641b2a7", GitTreeState:"clean", GoVersion:"go1.17.5"}
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.8+k3s2", GitCommit:"fe3cecc219175ea85d7a95ed9e44349d94734bc7", GitTreeState:"clean", BuildDate:"2022-07-06T20:35:20Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.8+k3s2", GitCommit:"fe3cecc219175ea85d7a95ed9e44349d94734bc7", GitTreeState:"clean", BuildDate:"2022-07-06T20:35:20Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}

Additional context One-node dev install.

blootsvoets commented 2 years ago

I can revisit the README regarding the secrets directory. The keystore.p12 needs to be generated with bin/keystore-init and loaded in your etc/production.yaml.gotmpl file. There are two examples for that in the file.

2bPro commented 2 years ago

Aaah, got it. Yes, it would be great if that could be clarified in the README. Thanks. I'm now getting:

in helmfile.d/10-managementportal.yaml: error during 10-managementportal.yaml.part.1 parsing: template:
stringTemplate:27:23: executing "stringTemplate" at <.Values.management_portal._chart_version>:
can't evaluate field _chart_version in type interface {}

Any ideas about what this could be caused by?

blootsvoets commented 2 years ago

This can happen if you specify

management_portal:
# nothing or just comments

This makes management_portal a nil value which will override any values given to it in other values files.

2bPro commented 2 years ago

Thanks for the quick replies. Is that in the etc/production.yaml.gotmpl? Because I just followed the comments and have:

management_portal:
  {{/* keystore: {{ readFile "../etc/management-portal/keystore.p12" | b64enc | quote }} */}}

I double-checked the path is correct and the key is there and they seem to be ok.

2bPro commented 2 years ago

Oh, I thought I'd try a fresh re-install, and interestingly helmfile destroy doesn't work now either. It gives me a similar error but for oauth_clients:

$ helmfile destroy
Adding repo radar https://radar-base.github.io/radar-helm-charts
"radar" has been added to your repositories

Listing releases matching ^velero$
in helmfile.d/30-push-endpoint.yaml: error during 30-push-endpoint.yaml.part.1 parsing: template: stringTemplate:36:26: executing "stringTemplate" at <.Values.management_portal.oauth_clients>: nil pointer evaluating interface {}.oauth_clients
keyvaann commented 2 years ago

@2bPro {{/* is also considered part of the comment section and you should remove them and your file should look like this:

management_portal:
  keystore: {{ readFile "../etc/management-portal/keystore.p12" | b64enc | quote }}
2bPro commented 2 years ago

Bah, I had a feeling I was doing something stupid. Many thanks to both of you!

2bPro commented 2 years ago

Sorry to open this again but while the service started, it's not stable. It has restarted close to 200 times since yesterday and is currently stuck in a CrashLoopBackOff state. Here's the pod logs:

INFO 1 --- [main] com.hazelcast.core.LifecycleService: [10.42.0.159]:5701 [dev] [3.12.10] [10.42.0.159]:5701 is STARTED
INFO 1 --- [main] c.h.h.HazelcastCacheRegionFactory: Starting up HazelcastCacheRegionFactory
INFO 1 --- [main] c.h.h.instance.HazelcastInstanceFactory: Using existing HazelcastInstance [ManagementPortal].
INFO 1 --- [main] s.j.ManagementPortalOauthKeyStoreHandler: Using Management Portal base-url http://localhost:8080/managementportal
WARN 1 --- [main] s.j.ManagementPortalOauthKeyStoreHandler: JWT key store class path resource [config/keystore.p12] 
does not contain private key pair for alias radarbase-managementportal-ec
WARN 1 --- [main] s.j.ManagementPortalOauthKeyStoreHandler : JWT key store class path resource [config/keystore.p12]
does not contain private key pair for alias radarbase-managementportal-ec
INFO 1 --- [main] o.r.management.config.WebConfigurer: Web application configuration, using profiles: prod
INFO 1 --- [main] o.r.management.config.WebConfigurer: Web application fully configured
WARN 1 --- [main] s.j.ManagementPortalOauthKeyStoreHandler: JWT key store class path resource [config/keystore.p12] 
does not contain private key pair for alias radarbase-managementportal-ec
WARN 1 --- [main] ConfigServletWebServerApplicationContext: Exception encountered during context initialization - 
cancelling refresh attempt: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with
 name 'OAuth2ServerConfiguration.ResourceServerConfiguration': Unsatisfied dependency expressed through field
'tokenStore'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name
 'tokenStore' defined in class path resource 
[org/radarbase/management/config/OAuth2ServerConfiguration$AuthorizationServerConfiguration.class]: Bean instantiation 
via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate 
[org.springframework.security.oauth2.provider.token.TokenStore]: Factory method 'tokenStore' threw exception; nested 
exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 
'accessTokenConverter' defined in class path resource 
[org/radarbase/management/config/OAuth2ServerConfiguration$AuthorizationServerConfiguration.class]: Bean instantiation
 via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate 
[org.radarbase.management.security.jwt.ManagementPortalJwtAccessTokenConverter]: Factory method 
'accessTokenConverter' threw exception; nested exception is java.lang.IllegalArgumentException: Cannot load JWT signing
 key radarbase-managementportal-ec from JWT key store.
INFO 1 --- [main] c.h.h.HazelcastCacheRegionFactory: Shutting down HazelcastCacheRegionFactory
WARN 1 --- [main] c.h.h.instance.HazelcastInstanceFactory: hibernate.cache.hazelcast.shutdown_on_session_factory_close
 property is set to 'false'. Leaving current HazelcastInstance active! (Warning: Do not disable Hazelcast hazelcast.shutdownhook.enabled property!)
INFO 1 --- [main] com.hazelcast.core.LifecycleService: [10.42.0.159]:5701 [dev] [3.12.10] [10.42.0.159]:5701 is SHUTTING_DOWN
INFO 1 --- [main] com.hazelcast.instance.Node: [10.42.0.159]:5701 [dev] [3.12.10] Shutting down multicast service...
INFO 1 --- [main] com.hazelcast.instance.Node: [10.42.0.159]:5701 [dev] [3.12.10] Shutting down connection manager...
INFO 1 --- [main] com.hazelcast.instance.Node: [10.42.0.159]:5701 [dev] [3.12.10] Shutting down node engine...
INFO 1 --- [main] com.hazelcast.instance.NodeExtension: [10.42.0.159]:5701 [dev] [3.12.10] Destroying node NodeExtension.
INFO 1 --- [main] com.hazelcast.instance.Node: [10.42.0.159]:5701 [dev] [3.12.10] Hazelcast Shutdown is completed in 17 ms.
INFO 1 --- [main] com.hazelcast.core.LifecycleService: [10.42.0.159]:5701 [dev] [3.12.10] [10.42.0.159]:5701 is SHUTDOWN
INFO 1 --- [main] o.r.m.config.CacheConfiguration: Closing Cache Manager
ERROR 1 --- [main] o.s.boot.SpringApplication: Application run failed

I restarted the pod but to no success.

keyvaann commented 2 years ago
[org/radarbase/management/config/OAuth2ServerConfiguration$AuthorizationServerConfiguration.class]: Bean instantiation
 via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate 
[org.radarbase.management.security.jwt.ManagementPortalJwtAccessTokenConverter]: Factory method 
'accessTokenConverter' threw exception; nested exception is java.lang.IllegalArgumentException: **Cannot load JWT signing
 key radarbase-managementportal-ec from JWT key store.**

Looks like the keystore.p12 file isn't correctly loaded.

2bPro commented 2 years ago

Thanks for your reply. I tried lots of things, making sure the reference to the key in etc/production.yaml.gotmpl is as you suggested, restarting the pod, completely destroying and cleaning everything and re-installing, re-generating the key...no luck. I don't know what's going on but it looks like the above error alternates with the following on each pod crash and restart:

INFO 1 --- [main] o.r.management.ManagementPortalApp: The following profiles are active: prod,swagger
WARN 1 --- [main] o.s.boot.actuate.endpoint.EndpointId: Endpoint ID 'hystrix.stream' contains invalid characters, please migrate to a valid format.
WARN 1 --- [main] c.n.c.sources.URLConfigurationSource: No URLs will be polled as dynamic configuration sources.
INFO 1 --- [main] c.n.c.sources.URLConfigurationSource: To enable URLs as dynamic configuration sources, define System property archaius.configurationSource.additionalUrls or make config.properties available on classpath.
INFO 1 --- [main] c.netflix.config.DynamicPropertyFactory: DynamicPropertyFactory is initialized with configuration sources: com.netflix.config.ConcurrentCompositeConfiguration@33ecbd6c
DEBUG 1 --- [main] i.g.j.c.liquibase.AsyncSpringLiquibase: Starting Liquibase synchronously
WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.ProxyLeakTask: Connection leak detection triggered for org.postgresql.jdbc.PgConnection@288f173f on thread main, stack trace follows

java.lang.Exception: Apparent connection leak detected
...

Any ideas?

ThomasKassiotis commented 2 years ago

Got the same issue. Something wrong with the keystore.p12 file.

image

nivemaham commented 2 years ago

@ThomasKassiotis I don't think what you have is same as what @2bPro is facing. They can be both on ManagementPortal or related. Your logs from ManagermentPortal says that it can't find the private key pair for alias radarbase-managementportal-ec2022-07-27. So there is something wrong with the Keystore that is generated. The Keystore should generate keys aliases radarbase-managementortal-ec and selfsigned. See keystore-init. Not sure why the date is appended to the alias. This could be the bug. @ThomasKassiotis can you try removing the keystore and create the keystore file again? then restart the ManagementPortal?

@2bPro I can't really locate the issue with the last logs you have shared. Can you share a longer stack trace?

Can you both share the Java version and version of keytool if you can find it?

2bPro commented 2 years ago

@nivemaham, I did a fresh install from the internal-chart-version branch but I'm seeing the same issue. When I say "fresh install" I mean I destroyed and cleaned everything on k8s, re-cloned the repo, re-set up the configuration and re-generated the keystore file. You can find the longer stack trace attached including what pods are currently up and running and a look at the management portal pod logs:

radark8s_stacktrace.txt

The Java version:

$ java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

I can't seem to be able to find the version of keytool itself though.

blootsvoets commented 2 years ago

I have fixed the issue on the app-config-frontend shown in your stack trace in commit 3ae49cba4dd6af9b92cce66178b53c1c2c4f1559.

You are still somehow missing the keystore. When you run

helmfile -f helmfile.d/10-managementportal.yaml --selector name=management-portal template

you should have an entry

# Source: management-portal/templates/secrets-keystore.yaml
apiVersion: v1
kind: Secret
metadata:
  name: management-portal-keystore
  labels:
    app: management-portal
    chart: management-portal-0.2.5
    release: "management-portal"
    heritage: "Helm"
type: Opaque
data:
  keystore.p12: <approximately 7000 characters>

If you replace the value of keystore.p12 in the above yaml in the command below where it says CHARACTERS:

base64 -d <<< CHARACTERS > keystore.p12
diff keystore.p12 etc/management-portal/keystore.p12

then you should get no output. If it says, binary files differ, then the keystore file is not properly loaded.

When running

keytool -list -keystore etc/management-portal/keystore.p12 -storepass radarbase

you should get the following output:

Keystore type: PKCS12
Keystore provider: SUN

Your keystore contains 2 entries

radarbase-managementportal-ec, 13 May 2019, PrivateKeyEntry, 
Certificate fingerprint (SHA-256): 
<fingerprint>
selfsigned, 13 May 2019, PrivateKeyEntry, 
Certificate fingerprint (SHA-256):
<fingerprint>
2bPro commented 2 years ago

Thanks for the quick reply. The diff comes back empty but the output of the last command is different:

Keystore type: PKCS12
Keystore provider: SUN

Your keystore contains 1 entry

selfsigned, Aug 16, 2022, PrivateKeyEntry, 
Certificate fingerprint (SHA-256): 
<fingerprint>
blootsvoets commented 2 years ago

I'm wondering - it could because of some space or overriding issue. I've updated the keystore-init script in ef8c7e95a93e41c695d489fac35be3b62d82c97a and 76eccfc. Could you please remove etc/management-portal/keystore.p12 and try bin/keystore-init again with these updates? If you provide a DNAME as follows

DNAME="CN=<your name>,O=<your organization>,L=<your city>,C=<2 letter country code>" bin/keystore-init 

you don't have to go through the interactive shell from keytool to query these variables. For full DNAME syntax, see https://docs.oracle.com/javase/8/docs/technotes/tools/windows/keytool.html#CHDHBFGJ.

2bPro commented 2 years ago

From 73e7f91 I get this:

$ DNAME="CN=Test,O=Test,L=Test,C=TS" bin/keystore-init
--> Generating keystore to hold EC keypair for JWT signing
Illegal option:  -groupname
keytool -genkeypair [OPTION]...

Generates a key pair

Options:

 -alias <alias>                  alias name of the entry to process
 -keyalg <keyalg>                key algorithm name
 -keysize <keysize>              key bit size
 -sigalg <sigalg>                signature algorithm name
 -destalias <destalias>          destination alias
 -dname <dname>                  distinguished name
 -startdate <startdate>          certificate validity start date/time
 -ext <value>                    X.509 extension
 -validity <valDays>             validity number of days
 -keypass <arg>                  key password
 -keystore <keystore>            keystore name
 -storepass <arg>                keystore password
 -storetype <storetype>          keystore type
 -providername <providername>    provider name
 -providerclass <providerclass>  provider class name
 -providerarg <arg>              provider argument
 -providerpath <pathlist>        provider classpath
 -v                              verbose output
 -protected                      password through protected mechanism

Use "keytool -help" for all available commands

--> Generating keystore to hold RSA keypair for JWT signing

FAILED TO CREATE ECDSA KEY radarbase-managementportal-ec in etc/management-portal/keystore.p12. Please try again.
blootsvoets commented 2 years ago

It looks like although your Java version is 8, your keytool is still from Java 7. I've changed the script in d2d7ac8 to allow for that as well. Could you please try again?

2bPro commented 2 years ago

I tried this and can't even get as far as the management portal pod being installed now. I now get this:

Error: Ingress.extensions "kube-prometheus-stack-grafana" is invalid: annotations.kubernetes.io/ingress.class: Invalid value: "nginx": can not be set when the class field is also set

I updated the helm charts but still got the error so I thought maybe latest commits caused it. I made a couple of reverts to older commits keeping the same key file generated with the updated script but that doesn't work either. Should I open a new issue for this?

blootsvoets commented 2 years ago

I can fix that error. For now, can you install management-portal with

helmfile -f helmfile.d/10-managementportal.yaml apply
blootsvoets commented 2 years ago

I've updated the keystore-init again in c5bdba5, since the current version seems to have decreased validity to only 90 days.

2bPro commented 2 years ago

The management portal is now running. Do you know by any chance how I can test it? I tried curling its external address, did pod port-fowarding, entered the container and did a curl on localhost, but all I get is 404.

blootsvoets commented 2 years ago

You should be able to access it via https://myhost/managementportal/

2bPro commented 2 years ago

Hmm, getting 502 now. I don't have a certificate so I just did http://myhost/managementportal/. I thought it might be that the port is inaccessible from the outside but I also tried http://localhost/managementportal from inside the ec2 and got the same 502 response. I added 8080 to the EC2 inbound rules but as suspected, it didn't make a difference. Any idea why this might happen?

blootsvoets commented 2 years ago

cert-manager should be creating a HTTPS certificate for you if the host is available from the internet. I'm not sure how to proceed either, apparently nginx cannot successfully connect to managementportal. kubectl get pods should show ManagementPortal as 1/1 ready, is that the case?

2bPro commented 2 years ago

That's correct, the pod is up and running (1/1 ready), hasn't restarted/stuck in a crash loop, and the logs say it's up.

WalkerWalker commented 1 year ago
WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.ProxyLeakTask: Connection leak detection triggered for org.postgresql.jdbc.PgConnection@288f173f on thread main, stack trace follows

java.lang.Exception: Apparent connection leak detected

management portal ran stably for quite a long time (more than 210 days) and now suddenly gets error -- connection leak detected, exactly as shown here. I tried to reinstall with

helmfile sync --concurrency 1

and it went well but the error remains the same in the new management_portal pod. Yet I see the discussion following this error report, for example, the errrortrace posted later, is not about this error anymore. Hope to get some help

my java version, if that helps

openjdk 11.0.16 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Debian-1deb11u1, mixed mode, sharing)

I also posted help in slack https://radardevelopment.slack.com/archives/C021AGGESC9/p1685913110854589

Thank you in advance.