flatcar / Flatcar

Flatcar project repository for issue tracking, project documentation, etc.
https://www.flatcar.org/
Apache License 2.0
671 stars 28 forks source link

locksmith fails when -etcd-cafile is specified #948

Open adborden opened 1 year ago

adborden commented 1 year ago

Description

When -etcd-cafile is specified without a client cert/key, locksmith fails with the error:

$ locksmithctl -etcd-cafile=/etc/ssl/certs/ca-certificates.crt status
Error initializing etcd client: open : no such file or directory

I've configured etcd with TLS using self-signed certificates but not TLS client authentication. locksmith seems to be looking for a certificate and key, even though these options are not applicable.

Impact

Error message is confusing, because it relates to an unrelated command line option.

Environment and steps to reproduce

  1. Set-up: Flatcar Linux 3374.2.2
  2. Task: Configuring locksmith with TLS communication and server-only authentication
  3. Action(s): a. locksmithctl -etcd-cafile=/etc/ssl/certs/ca-certificates.crt status
  4. Error: Error initializing etcd client: open : no such file or directory

Expected behavior

locksmith uses the specified CA to authenticate the server without client authentication.

Additional information

N/A.

bmbeverst commented 4 months ago

Seeing this issue with k3s and the embedded etcd as well. k3s with an embedded etcd does work with etcdctl but not locksmithctl.

I noticed a pull request to upgrade locksmith to etcd3 link, maybe that is the issue?

Impact

Not able to use etcd based locksmith reboots with k3s

Environment and steps to reproduce

  1. Set-up: Flatcar Linux 3815.2.1
  2. Task: Install k3s with etcd and configuring locksmith with TLS communication
  3. Action(s):
    1. Install k3s curl -sfL https://get.k3s.io | sh -s - --secrets-encryption --token SuperSecrect --cluster-init
    2. Test with etcdctl: etcdctl --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key=/var/lib/rancher/k3s/server/tls/etcd/server-client.key endpoint status
    3. Test with locksmith: locksmithctl --etcd-cafile="/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt" --etcd-certfile="/var/lib/rancher/k3s/server/tls/etcd/server-client.crt" --etcd-keyfile="/var/lib/rancher/k3s/server/tls/etcd/server-client.key" status
  4. Error: Error initializing etcd client: creating etcd lock client: EOF

Edit:

Manually passing endpoints instead of using the defaults worked a little more:

locksmithctl --etcd-cafile="/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt" --etcd-certfile="/var/lib/rancher/k3s/server/tls/etcd/server-client.crt" --etcd-keyfile="/var/lib/rancher/k3s/server/tls/etcd/server-client.key" --endpoint https://127.0.0.1:2379,https://10.10.1.41:2379,https://10.10.1.41:2380 status 
Error initializing etcd client: creating etcd lock client: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x00\x00\x06\x04\x00\x00\x00\x00\x00\x00\x05\x00\x00@\x00"

Also tried the peer files, and those understandably didn't work.

Expected behavior

locksmith works with tls etcd in k3s.

tormath1 commented 4 months ago

@bmbeverst can you configure etcd with --enable-v2 to assert the issue comes from the v2/v3? I'll try to restart the PR you linked.

bmbeverst commented 4 months ago

Thanks @tormath1

I was unable to run etcd with enable-v2. Since when I set the enable-v2: true in the /var/lib/rancher/k3s/server/db/etcd/config file (Where I got the TLS config) it was removed when I rebooted. The node is already part of a cluster. I guess that doesn't allow it to change. I didn't find any help in the k3s docs or Google for enabling v2 in the embedded etcd.

So I did the reverse, configured etcdctl to use v2. Which results in the same error as locksmith:

ETCDCTL_API=2 etcdctl --ca-file="/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt" --cert-file="/var/lib/rancher/k3s/server/tls/etcd/server-client.crt" --key-file="/var/lib/rancher/k3s/server/tls/etcd/server-client.key" --endpoints https://127.0.0.1:2379 cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x00\x00\x06\x04\x00\x00\x00\x00\x00\x00\x05\x00\x00@\x00"

error #0: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x00\x00\x06\x04\x00\x00\x00\x00\x00\x00\x05\x00\x00@\x00"

That is the same error that locksmith gave.

          net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x00\x00\x06\x04\x00\x00\x00\x00\x00\x00\x05\x00\x00@\x00"

I grabbed the etcdctl version, and it is 3.5.

# etcdctl version
etcdctl version: 3.5.0
API version: 3.5

Also in the startup log of k3s I see this line

"embed/etcd.go:309","msg":"starting an etcd server","etcd-version":"3.5.9","git-sha":"Not provided

And the latest k3s release shows it as running:

Etcd v3.5.9-k3s1

tormath1 commented 4 months ago

@bmbeverst that should be doable with:

$ curl -sfL https://get.k3s.io | sh -s - --secrets-encryption --token SuperSecrect --cluster-init --etcd-arg=-experimental-enable-v2v3=v2 --etcd-arg=-enable-v2=true

but I kept getting the error, even if I see the flags being processed:

Apr 02 14:36:45 localhost k3s[21600]: {"level":"warn","ts":"2024-04-02T14:36:45.1322Z","caller":"embed/etcd.go:739","msg":"Flag `enable-v2` is deprecated and will get removed in etcd 3.6."}
Apr 02 14:36:45 localhost k3s[21600]: {"level":"warn","ts":"2024-04-02T14:36:45.132252Z","caller":"embed/etcd.go:741","msg":"Flag `experimental-enable-v2v3` is deprecated and will get removed in etcd 3.6."}
bmbeverst commented 4 months ago

I see the same issue, the etcd server is detecting the new configuration and I see it in the config file. The etcd clients still cannot connect. The same errors are before.

Any luck with the PR?

tormath1 commented 4 months ago

@bmbeverst yes, I confirm it works correctly with the upgrade PR:

$ sudo ./locksmithctl --etcd-cafile="/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt" --etcd-certfile="/var/lib/rancher/k3s/server/tls/etcd/server-client.crt" --etcd-keyfile="/var/lib/rancher/k3s/server/tls/etcd/server-client.key" --endpoint=https://127.0.0.1:2379 status
Available: 9
Max: 10

MACHINE ID
1325649ad50e4756bf05107701cfca69
pothos commented 4 months ago

Since you are using k3s, which is Kubernetes, I think you could rather use FLUO https://github.com/flatcar/flatcar-linux-update-operator/ or kured https://github.com/kubereboot/kured/ instead of locksmith, or?

bmbeverst commented 4 months ago

The simplicity of locksmith is what I like, simple process to reboot nodes without any needing additional Kubernetes configuration. Ideally, Kubernetes should be able to tolerate a node rebooting without any issues.

I didn't like kured because after creating a cluster it still needs to the update service to be deployed and configured in Kubernetes. I did not know about the Flatcar Linux Update Operator, but it also requires Kubernetes setup. Perhaps I am mistaken and this is the best path forward.

I am trying to create a setup where I can fully automate the deployment of a multi-node k3s cluster with automatic updates.

@tormath1 to test the PR, do I build locksmith with your PR and overwrite the binaries in the flatcar OS?

tormath1 commented 4 months ago

I am trying to create a setup where I can fully automate the deployment of a multi-node k3s cluster with automatic updates.

In this case, I would recommend to investigate further with FLUO or Kured approach. Kured is only a daemon set that runs on each node (and compatible with Flatcar) and it can be easily deployed and it takes care of draining cleanly the nodes before reboot.

For trying the PR you can build locally then upload the binary to your nodes in /opt/bin for example. You might need to copy /opt/bin/locksmithctl to /opt/bin/locksmithd if you want to update locksmithd.service to consume this new binary (by overriding the ExecStart= section). :warning: The PR has not been updated, use this for testing only :warning:

bmbeverst commented 4 months ago

Thanks for the advice! Totally understand that the PR is not production ready.

Really appreciate the help with this issue.