CLI should not use non-public Cilium APIs

joestringer commented 1 year ago

The following code uses a non-public API from https://github.com/cilium/cilium in order to extract configurations out from an individual agent:

https://github.com/cilium/cilium-cli/blob/e97bddd0e98139d70416371c50e3740625e9e4d7/connectivity/check/features.go#L274

There are a few issues with this:

This is the internal structure used by Cilium and is subject to change without warning
Over time, Cilium will migrate away from including all daemon configurations in this structure towards per-Cell configuration, so it will not contain all configurations.
This ties the Cilium CLI to a specific version of the internal agent code, which may cause compatibility issues between multiple versions.
- https://github.com/cilium/cilium/pull/27352 points out an example where a new flag was introduced during cilium development, then cilium-cli was synced to cilium, then the flag was changed. This caused connectivity failures because the type was made invalid: connectivity test failed: unmarshaling cilium runtime config json: json: cannot unmarshal bool into Go struct field DaemonConfig.HubbleRedact of type []string

This code should probably unmarshal the configuration into a more liberal type to account for differences between Cilium versions.

joestringer commented 1 year ago

Would it be reasonable to do this autodetection from the helm configuration level or using cilium-config ConfigMap rather than executing into a Cilium pod to assess desired state from the agent?

michi-covalent commented 1 year ago

yeah i was thinking we should just look at cilium-config config map 💡

tklauser commented 1 year ago

FWIW, the PR adding the runtime feature detection (https://github.com/cilium/cilium-cli/pull/1022) has some context on why this is needed and why the information cannot easily be gathered from the ConfigMap:

Previously, we have been relying on the Cilium ConfigMap to detect the presence of features. Relying on the ConfigMap has however some downsides, as one has to know the default value of a certain feature if the option is absent from the ConfigMap. To make things worse, this default value is also version dependent in some cases.

/cc @gandro

michi-covalent commented 1 year ago

that's a good point. i guess helm values will have the same issue? 💭

tklauser commented 1 year ago

Yeah, I think we'll have the same issue with helm values. IIRC we don't specify defaults for them (in all cases).

michi-covalent commented 1 year ago

struggle

3u13r commented 1 year ago

There was also an issue with the default used for the tunneling feature, which tried to use the config-map value if available (https://github.com/cilium/cilium-cli/pull/1870/commits/da5c0c9eb9cff9e8baebabb0d9f16acd371c38db).

Therefore I tried to document the current guidelines in the code. I mostly copied the descriptions from the commits, bit I thought it makes sense to also have them in code (https://github.com/cilium/cilium-cli/pull/1870/commits/7c54c5fc7b20b26e4bc434a7b58635161d929b30).

gandro commented 1 year ago

I agree that relying on the agent-runtime-config.json is a hack. But as already mentioned, we want to know the runtime configuration of the agent, not what has been configured via Helm or ConfigMap, since Cilium does a lot of auto-detection or runtime-dependent configuration of certain features. I opted to use the JSON file because it was already there and worked with existing versions of Cilium (something that the CLI also needs to consider).

I think one way answer here is to start exposing per-cell configurations in the Cilium API. We already have an API endpoint that exposes a subset the global runtime daemon configuration https://github.com/cilium/cilium/blob/95e25bfb132750e97304dfbe0f5770f4b6debd96/api/v1/openapi.yaml#L74-L85

The /config endpoint is lacking per-cell configurations, but that can be solved by having cells exposing their configuration via API too. It could even be semi-automated, i.e. have a registry where cells register themselves as "having some configuration" which then automatically exposes that state via an API endpoint.

joestringer commented 1 year ago

I wonder if we could put some infrastructure in place so that those configuration API endpoints are somehow tied to feature stability in the code. For instance, in the case from the issue description, the feature is brand new, so could be considered alpha. Maybe for alpha options we don't expose it in the API because it's subject to change. Then we wouldn't trigger this overall issue in the first place because Cilium-CLI wouldn't be relying on an alpha setting having a specific type or format. For other settings, we could graduate the features themselves to beta/GA and somewhere later in that process we start exposing the flags as a public API from Cilium.

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

cilium / cilium-cli

CLI should not use non-public Cilium APIs #1896