canonical / istio-operators

Charmed Istio
2 stars 17 forks source link

Integration with charms for Let's encrypt certificates #379

Closed natalytvinova closed 4 months ago

natalytvinova commented 4 months ago

Bug Description

I deployed Kubeflow 1.8 and tryed to integrate Istio-pilot charm with httpreq acme operator by following their how-to-guide. Unfortunately, after adding the relation, the certificate is not being created, because Let's encrypt doesn't allow provisioning the certificates for IP addresses. And the istio-pilot charm is supplying the IP address with the CSR to the certificate charm. This can be seen in the juju debug-log bellow.

It makes sense to me that istio-pilot charm is using the IP because it is not aware of the url that needs to be used in our case. Thi url is supplied only as a parameter in oidc-gatekeeper and dex-auth charms in my bundle. The same way, in kubernetes, this service is somehow no aware of this url. istio-ingressgateway ClusterIP 10.152.183.184 <none> 65535/TCP 10d istio-ingressgateway-endpoints ClusterIP None <none> <none> 10d istio-ingressgateway-workload LoadBalancer 10.152.183.89 <IP> 80:30481/TCP,443:30282/TCP 10d istio-pilot ClusterIP 10.152.183.57 <none> 65535/TCP 10d

I'm not sure if this is a bug or is there a way to configure istio-pilot to use the url we need

To Reproduce

  1. deploy kubeflow 1.8/stable
  2. juju deploy httprequest-lego-k8s
  3. juju config httprequest-lego-k8s \ server= \ email= \ httpreq_endpoint=
  4. juju integrate istio-pilot httprequest-lego-k8s

Environment

Kubeflow 1.8/stable https://github.com/canonical/bundle-kubeflow/tree/main/releases/1.8/stable/kubeflow Charmed Kubernetes 1.28 on Charmed Openstack Yoga juju 3.1

Relevant Log Output

unit-istio-pilot-0: 08:33:38 INFO juju.worker.uniter.operation ran "certificates-relation-changed" hook (via hook dispatching script: dispatch)
unit-httprequest-lego-k8s-0: 08:33:37 INFO unit.httprequest-lego-k8s/0.juju-log certificates:172: Received Certificate Creation Request for domain <IP>
unit-httprequest-lego-k8s-0: 08:33:38 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:172: Exited with code 1. Stderr:
unit-httprequest-lego-k8s-0: 08:33:38 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:172:     2024/02/08 08:33:37 No key found for account <customer-email>. Generating a P256 key.
unit-httprequest-lego-k8s-0: 08:33:38 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:172:     2024/02/08 08:33:37 Saved key to /tmp/.lego/accounts/acme-v02.api.letsencrypt.org/<customer-email>/keys/<customer-email>.key
unit-httprequest-lego-k8s-0: 08:33:38 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:172:     2024/02/08 08:33:38 [INFO] acme: Registering account for <customer-email>
unit-httprequest-lego-k8s-0: 08:33:38 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:172:     2024/02/08 08:33:38 [INFO] [<IP>, istio-pilot-0.istio-pilot-endpoints.kubeflow.svc.cluster.local] acme: Obtaining bundled SAN certificate given a CSR
unit-httprequest-lego-k8s-0: 08:33:38 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:172:     2024/02/08 08:33:38 Could not obtain certificates:
unit-httprequest-lego-k8s-0: 08:33:38 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:172:      acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:unsupportedIdentifier :: NewOrder request included invalid non-DNS type identifier: type "ip", value "<IP>"

Additional Context

No response

syncronize-issues-to-jira[bot] commented 4 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5337.

This message was autogenerated

DnPlas commented 4 months ago

Hi @natalytvinova, thanks for filing this issue.

I have identified the following:

  1. The Istio ingress configuration is not correct. From the log that you provided, I get that the ingress gateway service (istio-ingressgateway-workload) is not getting any IP.
istio-ingressgateway                              ClusterIP      10.152.183.184   <none>                             65535/TCP                               10d
istio-ingressgateway-endpoints                    ClusterIP      None             <none>                             <none>                                  10d
istio-ingressgateway-workload                     LoadBalancer   10.152.183.89    <IP>                       80:30481/TCP,443:30282/TCP              10d
istio-pilot                                       ClusterIP      10.152.183.57    <none>                             65535/TCP                               10d

This can be caused by the k8s node not having a Loadbalancer by default. Could you please share what are the networking details of your node? Also, if you are using something that is not a LB, I recommend you change it using the gateway_service_type config option.

  1. It seems like acme is internally complaining when trying to get certificates with the message
...
acme: Obtaining bundled SAN certificate given a CSR
Could not obtain certificates:
...NewOrder request included invalid non-DNS type identifier: type "ip", value "<IP>"

This is happening because, whenever the istio-pilot generates the CSR, we share the ingress gateway service IP. Because of (1), we are sharing an incorrect value <IP>. We have to make sure that the ingress gateway service is correctly configured and it has an IP address.

  1. I have tried to reproduce this issue locally and have found that even with a correctly configured ingress gateway service, acme will still complain:
unit-httprequest-lego-k8s-0: 11:13:16 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:5: Exited with code 1. Stderr:
unit-httprequest-lego-k8s-0: 11:13:16 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:5:     2024/02/14 10:13:15 [INFO] [10.64.140.43, istio-pilot-0.istio-pilot-endpoints.test-tls.svc.cluster.local] acme: Obtaining bundled SAN certificate given a CSR
unit-httprequest-lego-k8s-0: 11:13:16 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:5:     2024/02/14 10:13:16 Could not obtain certificates:
unit-httprequest-lego-k8s-0: 11:13:16 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:5:        acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:unsupportedIdentifier :: NewOrder request included invalid non-DNS type identifier: type "ip", value "10.64.140.43"

For debugging this further, I need to connect with the maintainers of this charm to understand what's the issue with this, as it not only happens with istio, but also with traefik.

unit-httprequest-lego-k8s-0: 11:12:29 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:4: Exited with code 1. Stderr:
unit-httprequest-lego-k8s-0: 11:12:29 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:4:     2024/02/14 10:12:29 [INFO] [10.64.140.44] acme: Obtaining bundled SAN certificate given a CSR
unit-httprequest-lego-k8s-0: 11:12:29 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:4:     2024/02/14 10:12:29 Could not obtain certificates:
unit-httprequest-lego-k8s-0: 11:12:29 ERROR unit.httprequest-lego-k8s/0.juju-log certificates:4:        acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:unsupportedIdentifier :: NewOrder request included invalid non-DNS type identifier: type "ip", value "10.64.140.44"
natalytvinova commented 4 months ago

@DnPlas Hi! Thank you for investigating this. So we're using a LB. This is my config: gateway_service_type: default: LoadBalancer description: | Type of service for the ingress gateway out of: 'ClusterIP', 'LoadBalancer', or 'NodePort'. source: default type: string value: LoadBalancer

Please correct me if I'm wrong, you're saying that istio-ingressgateway service should get an ip, not the istio-ingressgateway-workload service? Because the second one does have an IP, I just redacted it.

DnPlas commented 4 months ago

Please correct me if I'm wrong, you're saying that istio-ingressgateway service should get an ip, not the istio-ingressgateway-workload service? Because the second one does have an IP, I just redacted it.

Yes, the ingressgateway-workload is the svc that should have the IP, so:

istio-ingressgateway-workload                     LoadBalancer   10.152.183.89    <IP>                       80:30481/TCP,443:30282/TCP              10d

Should have some IP instead of just <IP> and the service type should match what you have in your host.

So we're using a LB. This is my config: gateway_service_type: default: LoadBalancer description: | Type of service for the ingress gateway out of: 'ClusterIP', 'LoadBalancer', or 'NodePort'. source: default type: string value: LoadBalancer

Right, so that is the configuration of the charm. In your node, are you positive you have configured a loadbalancer for your ingress (the one that sits at the edge of your k8s node/cluster)? I understand your deployment is in Charmed Kubernetes. Historically people have just used NodePort instead of Loadbalancer and configure the istio-ingressgateway charm accordingly.

natalytvinova commented 4 months ago

@DnPlas oh it does have the ip, sorry for confusion. We have istio-ingressgateway-workload LoadBalancer 10.152.183.89 <IP redacted> 80:30481/TCP,443:30282/TCP 10d

That LoadBalancer does exist in Openstack, if this is what you're asking for

natalytvinova commented 4 months ago

@DnPlas also, as I understood from how Let's encrypt works. It can't issue the certificate from an IP. So CSR needs to contain the url in order for it to function properly https://community.letsencrypt.org/t/neworder-request-included-invalid-non-dns-type-identifier-type-ip/170623. And the istio-pilot charm generates the CSR, so we need the CSR to be generated with the url instead of IP

DnPlas commented 4 months ago

@DnPlas also, as I understood from how Let's encrypt works. It can't issue the certificate from an IP. So CSR needs to contain the url in order for it to function properly https://community.letsencrypt.org/t/neworder-request-included-invalid-non-dns-type-identifier-type-ip/170623. And the istio-pilot charm generates the CSR, so we need the CSR to be generated with the url instead of IP

Correct, but what URL should that be? Is it the name of the service, or is it something else?

natalytvinova commented 4 months ago

@DnPlas my understanding was it should be the name of the service, but from istio-pilot point it doesn't see any service name. And that is true, because istio-ingressgateway does not expose it, but maybe it should? This it would be a good idea to confirm with both Telco and instio-ingressgateway team

DnPlas commented 4 months ago

After talking to the maintainers of the tls-interface library and of the certificate provider charms, we can confirm that:

all lego charms will only for with CSRs for domain names, and not for IPs.

This confirms what we stated in a previous comment and here.

The problem with istio-pilot at the moment is that it only shares the ingress gateway Service IP to generate the CSR, which will work for most of the certificate providers, but not for lego, they will just simply reject the request.

In order to fix this issue and be able to actually connect istio-pilot with lego charms, we need to start sharing the domain name instead of the IP. Before committing any changes, let's consider the following:

  1. How will the DNS configuration work?
  2. What DNS configuration should we have?
  3. What domain name the istio-pilot will put in the CSR?
  4. Should this be a default from now on, or should we have a special case for integrating with lego charms?

@natalytvinova I will allocate a bit more time to work on this and see how we can extend support for this integration.

DnPlas commented 4 months ago

Update on the things we have to do in order to support the lego integration better.

Update the current implementation

The following diagram presents the proposed architecture for the integration with different certificate providers. We expect the istio-pilot to keep using the tls-certificates-interface, but instead of just using the IP address of the istio-ingresgateway-workload Service to generate the CSR, it will now generate a cert_subject, which will be "calculated" as follows:

  1. A domain name provided by users through a configuration option. If this is in fact a domain name (not an IP), the cert_subject will be exactly this. The validity and host DNS configurations are responsibility of the user.
  2. In the absence of a domain name, but in the presence of an IP address, we'll try to guess the domain name of such IP.
  3. In the absence of any of the above, an IP address will be used.

The istio-pilot charm will provide enough information to users in case of an invalid or missing domain-name.

The following diagram presents a high level overview of the proposed model:

image

Execution plan

  1. Update istio-pilot charm with the new configuration option, and to "calculate" the cert_subject based on the conditions above.
  2. Test the charm works with self-signed certificates and other CA, like Let's encrypt. This task will be completed on an EKS cluster to be able to configure domain names.
  3. Test how this affects the ingress and authentication story. We may need to provide extra steps at deployment time.
  4. Have public documentation that states how to integrate with certificate providers and user responsibilities when connecting with CAs like Let's encrypt.

Timeline

EDIT: we are going to work on this improvement for 24.04, but the priority now will be https://github.com/canonical/istio-operators/issues/380. After discussing with @natalytvinova, we agreed that for now her deployment doesn't require an integration with a TLS certificates provider as they already have a ssl key and cert, which can be passed to istio-pilot for configuring the Gateway accordingly.

If we are able to land these changes in main before Wednesday, that'll move the release date for 1.17 one or two days earlier.

EDIT: after discussing with the telco team about best approaches to integrate with the TLS certificate providers, we have concluded that it would be better to enable istio-pilot to get a domain_name (see image above) so it can correctly send a CSR to the tls-cert-provider charm, which in turn will handle all the logic to get a signed cert from a CA. I will explain this in more detail in a later comment.