canonical / traefik-k8s-operator

https://charmhub.io/traefik-k8s
Apache License 2.0
11 stars 27 forks source link

Unable to use Load Balancer's IP address for the ingress gateway #361

Closed Gmerold closed 1 week ago

Gmerold commented 6 months ago

Bug Description

New version of pydantic-core breaks falling back to the Load Balancer's IP for the ingress gateway when the external-hostname is not configured:

pydantic_core._pydantic_core.ValidationError: 1 validation error for IngressProviderAppData
ingress.url
  Input should be a valid URL, invalid IPv4 address [type=url_parsing, input_value='http://sdcore-nms.10.0.0.2/', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/url_parsing

Potential solution here could be using nip.io to pretend LB IP is a legit URL (e.g. 10.0.0.2.nip.io)

To Reproduce

https://canonical-charmed-aether-sd-core.readthedocs-hosted.com/en/stable/tutorials/getting_started/

Environment

Juju 3.4 Microk8s 1.27-strict/stable Traefik latest/stable

Relevant log output

pydantic_core._pydantic_core.ValidationError: 1 validation error for IngressProviderAppData
ingress.url
  Input should be a valid URL, invalid IPv4 address [type=url_parsing, input_value='http://sdcore-nms.10.0.0.2/', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/url_parsing

Additional context

No response

PietroPasotti commented 6 months ago

We think the issue is that the url being submitted to traefik is wrong because it is in fact not a valid ipv4 address: http://sdcore-nms.10.0.0.2/ pydantic deduces it's ipv4 because it ends in digits.

Is it an option to turn the address around and let it be http://10.0.0.2.sdcore-nms/ instead, which would be a valid DNS record?

Gmerold commented 6 months ago

I agree with your thinking ;) That's why I proposed using nip.io. It turns the IP into a valid URL, eliminates a need of adding entries to /etc/hosts and makes the URL feel natural (unlike http://10.0.0.2.sdcore-nms/, which kinda reverses the natural order, don't you think?).

mmkay commented 3 months ago

@Gmerold: I see that the documentation is using nip.io at the moment. Is there anything that you think we should do on the traefik side as well? Or maybe this is something we should improve in traefik's documentation?

Gmerold commented 3 months ago

Hello @mmkay, which documentation do you mean? SD-Core? We are using nip.io indeed (as an alternative to setting up the DNS server), but Traefik is still broken. I don't it's a matter of documentation, but rather handling the case when the external-hostname is not set and the charm falls back to the LB's IP.

lucabello commented 3 months ago

Currently, the ingress library is using AnyHttpUrl to validate the field; however, that fails.

We could solve this by either contributing a change upstream to pydantic (so that AnyHttpUrl accepts this type of url), or by writing a custom validator to accept it.

ca-scribner commented 3 months ago

I think what @PietroPasotti is getting at is that the linked doc uses https://sdcore-nms.10.0.0.4.nip.io, but this bug report used https://sdcore-nms.10.0.0.4 (which should not be valid because a top level domain's end cannot be purely numerical)

Doing some pure pydantic testing (not with traefik's lib, just pydantic itself), we can see:

from pydantic import BaseModel, AnyHttpUrl, ValidationError

class MyModel(BaseModel):
    url: AnyHttpUrl

# Will pass validation
MyModel(url="http://valid.com")  # a control
MyModel(url="http://valid.com1")  # Valid even though it ends with a number
MyModel(url="http://10.0.0.4.nip.io")
MyModel(url="http://sdcore-nms.10.0.0.4.nip.io")

# Will fail validation
try:
    MyModel(url="http://invalid url")  # a control
except ValidationError:
    pass
else:
    raise Exception("I should have failed")

try:
    # fails because last segment is entirely numeric
    MyModel(url="http://sdcore-nms.10.0.0.4")
except ValidationError:
    pass
else:
    raise Exception("I should have failed")

This feels consistent with other places too. For example, type https://sdcore-nms.10.0.0.4 in your chrome url bar and it'll automatically notice it is not a url and search on it instead.

So having said all that (and having not actually looked at the traefik charm), is the missing .nip.io in the url because it was missing in the input, or did traefik strip it somewhere?

Gmerold commented 2 months ago

Hi @sed-i, Actually it's neither :) First of all, the behavior of Chrome you are describing is new. Chrome used to accept https://sdcore-nms.10.0.0.4. But that's not the main problem. The external_hostname config of the Traefik charm is optional. If you don't specify it, LB IP will be used for building URLs of the proxied applications. In our case, we don't have an external, publicly available URL for Traefik. We're using nip.io to keep things as simple as possible. The problem is that the default "URL" produced by Traefik (client application name + Traefik's LB IP) doesn't pass the validation anymore and that fails the deployment of the bundle. On the other hand, we can't use nip.io to set the external_hostname config before Traefik is deployed, because we don't know the LB IP (it's assigned from the pool). That's why I'm proposing using nip.io at the charm level - to make sure that if the optional external_hostname is not set by the user we still end up getting a valid URL instead of charm in error state.

gruyaume commented 2 months ago

Can this issue be prioritised? Every deployment of our charmed 5G deployment is affected by it. In addition, our tutorials and documentation look bad as we're having to reference this issue and let users know that it's expected for traefik to be an error state.

Reference:

Model      Controller                  Cloud/Region                Version  SLA          Timestamp
private5g  microk8s-classic-localhost  microk8s-classic/localhost  3.4.5    unsupported  08:08:50Z

App                       Version  Status   Scale  Charm                     Channel        Rev  Address         Exposed  Message
amf                       1.4.4    active       1  sdcore-amf-k8s            1.5/edge       707  10.152.183.176  no       
ausf                      1.4.2    active       1  sdcore-ausf-k8s           1.5/edge       520  10.152.183.65   no       
grafana-agent             0.32.1   waiting      1  grafana-agent-k8s         latest/stable   45  10.152.183.221  no       installing agent
mongodb                            active       1  mongodb-k8s               6/beta          38  10.152.183.92   no       Primary
nms                       1.0.0    active       1  sdcore-nms-k8s            1.5/edge       580  10.152.183.141  no       
nrf                       1.4.1    active       1  sdcore-nrf-k8s            1.5/edge       580  10.152.183.130  no       
nssf                      1.4.1    active       1  sdcore-nssf-k8s           1.5/edge       462  10.152.183.62   no       
pcf                       1.4.3    active       1  sdcore-pcf-k8s            1.5/edge       512  10.152.183.144  no       
router                             active       1  sdcore-router-k8s         1.5/edge       341  10.152.183.218  no       
self-signed-certificates           active       1  self-signed-certificates  latest/stable  155  10.152.183.33   no       
smf                       1.5.2    active       1  sdcore-smf-k8s            1.5/edge       590  10.152.183.64   no       
traefik                   v2.11.0  waiting      1  traefik-k8s               latest/stable  194  10.152.183.198  no       installing agent
udm                       1.4.3    active       1  sdcore-udm-k8s            1.5/edge       489  10.152.183.31   no       
udr                       1.4.1    active       1  sdcore-udr-k8s            1.5/edge       486  10.152.183.82   no       
upf                       1.4.0    active       1  sdcore-upf-k8s            1.5/edge       591  10.152.183.164  no       

Unit                         Workload  Agent  Address      Ports  Message
amf/0*                       active    idle   10.1.10.181         
ausf/0*                      active    idle   10.1.10.186         
grafana-agent/0*             blocked   idle   10.1.10.133         grafana-cloud-config: off, logging-consumer: off
mongodb/0*                   active    idle   10.1.10.155         Primary
nms/0*                       active    idle   10.1.10.174         
nrf/0*                       active    idle   10.1.10.151         
nssf/0*                      active    idle   10.1.10.136         
pcf/0*                       active    idle   10.1.10.146         
router/0*                    active    idle   10.1.10.145         
self-signed-certificates/0*  active    idle   10.1.10.141         
smf/0*                       active    idle   10.1.10.154         
traefik/0*                   error     idle   10.1.10.160         hook failed: "ingress-relation-changed"
udm/0*                       active    idle   10.1.10.187         
udr/0*                       active    idle   10.1.10.176         
upf/0*                       active    idle   10.1.10.169
simskij commented 1 month ago

@dstathis can you please make sure this is included in the pulse that starts on Monday? Thanks.

dstathis commented 1 month ago

Yup no problem

ca-scribner commented 1 month ago

I think the issue here is just misconfiguration. Traefik has two routing_modes:

If you're using the loadbalancer IP as the domain, then subdomain really isn't valid (since mymodel.myapp.1.2.3.4) isn't a valid domain based on the above conversation. Feels like path is the only valid config here.

Is there a reason why path wouldn't work here? that seems like the easy fix that can be implemented user-side and no risk of side effects if we add .nip.io

Gmerold commented 3 weeks ago

This kinda reminds me a story of my buddy. He used to have a car with a broken gearbox; only second and fourth gear would work. One day I had to drive this car and obviously I wanted to start with a first gear. After I struggled for a short while, my buddy told me to use the second gear instead. After starting on a second gear, I had to push the RPMs really high to be able to change to fourth gear directly, because the third wouldn't work as well. When I asked him about fixing the gearbox, he was like "nah, two of them still work". Traefik has two routing modes and it should be user's decision which one he wants to use. If the correct charm configuration produces incorrect output, it is a problem in the charm. If you're afraid of side effects of using .nip.io, the alternative approach could be making the charm require external_hostname when subdomain is used.

ca-scribner commented 3 weeks ago

Yes agreed, the root issue here is that if subdomain is used, then we need to require an external_hostname to be configured. I'm working to implement that constraint now. In future, expect that this charm will (more gracefully) block someone from using IP+subdomain

ca-scribner commented 2 weeks ago

420 adds a fix to this, in that we now clearly state that this charm should not be deployed with routing_mode=subdomain and an unset external_hostname. That's added to the config descriptions, and there's some warning messages that'll appear if this comes up.

420 stops short of actually putting the charm into BlockedStatus and forcing a user to avoid this setting combination. tl/dr: the current architecture of the charm makes actually blocking on bad config difficult. There's a near-term plan (definitely this sprint, probably in the next month or two) to refactor the charm entirely and hopefully address this better, but for now we get just the warnings.