OpenCHAMI / deployment-recipes

Ochami deployment recipes
MIT License
8 stars 10 forks source link

[DEV] Add CoreDHCP to Helm #86

Open rainest opened 2 weeks ago

rainest commented 2 weeks ago

Short Description Remove the existing dnsmasq Deployment from the chart and replace it with CoreDHCP, for https://github.com/OpenCHAMI/roadmap/issues/50

Definition of Done

Additional context Ref https://github.com/OpenCHAMI/deployment-recipes/pull/78 and https://github.com/OpenCHAMI/deployment-recipes/pull/84 for equivalent work on the Podman side.

rainest commented 2 weeks ago

I'm not sure how much we considered the needs of the DHCP server for the original dnsmasq Deployment. It was running, but IDK if we ever had a proof of concept for anything talking to it.

The DHCP server will handle requests for hosts outside the Kubernetes network. Normal broadcast delivery will not work as such, and we'd need to forward traffic to it.

This is apparently how CSM handles DHCP also--it has a Kea DHCP instance exposing a LoadBalancer, with metallb BGP peering to node networks and forwarding rules to the LB address on the node network (see CSM's PXE troubleshooting and DHCP troubleshooting guides).

I'm unsure where all that gets configured for full CSM setups, but found a basic example of minimal configuration for such a setup.

I don't think there's any way to handle dynamic population of the server_id or router config or that dynamic handling would even be desirable. AFAIK these will need to be config parameters that we just trust you've set to the correct value. The V4 ID needs to match the spec.loadBalancerIP. I don't think any other in-cluster config cares about the V6 ID.


We oddly had an existing tftpd key, but weren't using it in any of the templates. It was added alongside dnsmasq and the dnsmasq built-in TFTP server configuration.

rainest commented 2 weeks ago

https://github.com/OpenCHAMI/deployment-recipes/pull/87 provides a basic "it runs!" set of values and templates, with some caveats:

synackd commented 2 weeks ago

The coresmd image is currently busted after some possibly incomplete file rearrangement upstream. The /coredhcp path in the image is a directory with a README.md; it's apparently supposed to get replaced with a binary built from some templated Go. I hulk smashed the (working) release binary from the repo into a local image build instead.

Does the latest version v0.0.5 work for you? I examined it and the /coredhcp is a binary in this version.

alexlovelltroy commented 2 weeks ago

SMD in the chart does not appear to serve TLS. I stuffed a fake cert into the SMD plugin config. The plugin appears to have connected over HTTP fine (it logged level=debug msg="EthernetInterfaces: map[]" prefix="plugins/coresmd" for my empty SMD instance with no errors).

At this point, we have not enabled TLS at the SMD level and rely on the API gateway for TLS termination and signed tokens for authN/authZ. Having said that, we have the ACME pieces running and we could create and rotate TLS certificates for the microservices using that or we could protect them using an mTLS service mesh. This matters more for k8S deployments than it does in our podman deployments

Do you have a proposal for mTLS within k8s for SMD that doesn't preclude the current operations?

alexlovelltroy commented 2 weeks ago

I don't think there's any way to handle dynamic population of the server_id or router config or that dynamic handling would even be desirable. AFAIK these will need to be config parameters that we just trust you've set to the correct value. The V4 ID needs to match the spec.loadBalancerIP. I don't think any other in-cluster config cares about the V6 ID.

You're driving at the right stuff here. We may need to explore options outside of the standard k8s networking in order to get this to work reliably. I've never understood how networking would work to bring DHCP properly into a remote k8s cluster without complex and unpleasant VLAN incantations. The solution in CSM only works because of direct connections to the worker nodes and plenty of VLAN tagging.

synackd commented 2 weeks ago

The coresmd image is currently busted after some possibly incomplete file rearrangement upstream. The /coredhcp path in the image is a directory with a README.md; it's apparently supposed to get replaced with a binary built from some templated Go. I hulk smashed the (working) release binary from the repo into a local image build instead.

Does the latest version v0.0.5 work for you? I examined it and the /coredhcp is a binary in this version.

@rainest Ah, I found the issue. We were originally pushing coresmd as the container name and then started pushing coredhcp. This led to the former one not working while the latter did. We have deleted the coresmd container to eliminate confusion. Going forward, we should use ghcr.io/openchami/coredhcp as the CoreDHCP container that has the coresmd plugins built-in.

Thanks for reporting the issue!

synackd commented 2 weeks ago

I will update the quickstart docker-compose recipe to use the correct container.

synackd commented 1 week ago

The above PR also fixes an issue where 'permission denied' would be seen when binding to port 67. Fixed in coresmd v0.0.6.