etcd not init etcd.pem with services.kubernetes.roles master

apeyroux commented 5 years ago

Issue description

When the services.kubernetes.roles = ["master"] is enabled, I have this error when starting the service etcd :

avril 12 18:10:01 xps15.px.io etcd[29989]: open /var/lib/kubernetes/secrets/etcd.pem: no such file or directory

Steps to reproduce

Use services.kubernetes.roles = ["master"] in /etc/nixos/configuration.nix

Technical details

system: "x86_64-linux"
host os: Linux 4.19.34, NixOS, 19.03.172138.5c52b25283a (Koi)
multi-user?: yes
sandbox: yes
version: nix-env (Nix) 2.2
channels(root): "nixos-19.03.172138.5c52b25283a"
nixpkgs: /nix/var/nix/profiles/per-user/root/channels/nixos

zarelit commented 5 years ago

FWIW it's not happening on the 18.09 release

zarelit commented 5 years ago

cc @johanot @arianvp

arianvp commented 5 years ago

@zarelit that it's not happening on 18.09 is expected. We moved to mandatory pki in 19.03 (https://nixos.org/nixos/manual/index.html#sec-kubernetes)

However, this config should work because "master" should imply easyCerts = true and bootstrap the certificates automatically.

arianvp commented 5 years ago

@apeyroux was the error transient or fatal? Did etcd eventually start up successfully with the certs, or did it just fail to start up at all?

We recently merged a PR (yesterday) that changes the order of components being started to reduce the amount of transient non-fatal errors https://github.com/NixOS/nixpkgs/pull/56789

There might be some time where the certs are still being generated, but etcd is already started. However, after the certs appear, etcd should start just fine

I can take upto serveral minutes for the kubernetes cluster to stabilise. Did it eventually complete bootstrapping?

zarelit commented 5 years ago

@arianvp I have the same issue, even though I don't know if the cause is the same. It looks like it's not transient

relevant parts of my configuration, trying to set up k8s on my laptop:

  services.kubernetes = {
   roles = ["master" "node"];
   addons.dashboard.enable = true;
   kubelet.extraOpts = "--fail-swap-on=false";
   masterAddress = "localhost";
  };

Certmgr unit is stuck in this loop

Apr 18 13:37:06 tsundoku systemd[1]: Starting certmgr...
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [INFO] certmgr: loading from config file /nix/store/q6s4lclkzmr1g59dqnjz6kdi6azqy8fj-certmgr.yaml
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [INFO] manager: loading certificates from /nix/store/ms3w33cai719d8971hsdmi4j21fs25pq-certmgr.d
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [INFO] manager: loading spec from /nix/store/ms3w33cai719d8971hsdmi4j21fs25pq-certmgr.d/addonManager.json
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for tsundoku.lan, not localhost"}
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: Failed: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for tsundoku.lan, not localhost"}
Apr 18 13:37:06 tsundoku systemd[1]: certmgr.service: Control process exited, code=exited status=1
Apr 18 13:37:06 tsundoku systemd[1]: certmgr.service: Failed with result 'exit-code'.
Apr 18 13:37:06 tsundoku systemd[1]: Failed to start certmgr.

Without any clue I tried to change the kubernetes.masterAddress from localhost to the hostname tsundoku.lan and now it complains like this:

Apr 18 13:43:05 tsundoku systemd[1]: Starting certmgr...
Apr 18 13:43:05 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:05 [INFO] certmgr: loading from config file /nix/store/xy17dlkik1rcyvdxb6n6xa5fqq7hgdxk-certmgr.yaml
Apr 18 13:43:05 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:05 [INFO] manager: loading certificates from /nix/store/d4njrd8r64mqgq4h6dxmbg6iysha5wgn-certmgr.d
Apr 18 13:43:05 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:05 [INFO] manager: loading spec from /nix/store/d4njrd8r64mqgq4h6dxmbg6iysha5wgn-certmgr.d/addonManager.json
Apr 18 13:43:15 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:15 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://tsundoku.lan:8888/api/v1/cfssl/info: Post https://tsundoku.lan:8888/api/v1/cfssl/info: dial tcp: lookup tsundoku.lan: device or resource busy"}
Apr 18 13:43:15 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: Failed: {"code":7400,"message":"failed POST to https://tsundoku.lan:8888/api/v1/cfssl/info: Post https://tsundoku.lan:8888/api/v1/cfssl/info: dial tcp: lookup tsundoku.lan: device or resource busy"}
Apr 18 13:43:15 tsundoku systemd[1]: certmgr.service: Control process exited, code=exited status=1
Apr 18 13:43:15 tsundoku systemd[1]: certmgr.service: Failed with result 'exit-code'.
Apr 18 13:43:15 tsundoku systemd[1]: Failed to start certmgr.

cawilliamson commented 5 years ago

Has anyone figured out either a solution or a workaround for this at all?

I've been struggling to get a k8s cluster up on NixOS with this issue for a few days now. :(

The following seems to be the cause of the issue - a cert is being generated for "127.0.0.1" instead of "localhost" I guess?

# /nix/store/c1dcbf3c4jb4jlcadzh05i0di98lm6zz-unit-script-certmgr-pre-start
2019/04/18 21:38:39 [INFO] certmgr: loading from config file /nix/store/jvygi3li8pjmx0vf3jldamz8j3m1a03s-certmgr.yaml
2019/04/18 21:38:39 [INFO] manager: loading certificates from /nix/store/p4c93zcbdh9fcs634n1cn2scd5rwwjf0-certmgr.d
2019/04/18 21:38:39 [INFO] manager: loading spec from /nix/store/p4c93zcbdh9fcs634n1cn2scd5rwwjf0-certmgr.d/addonManager.json
2019/04/18 21:38:39 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for 127.0.0.1, not localhost"}
Failed: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for 127.0.0.1, not localhost"}

johanot commented 5 years ago

Seems there are multiple (possibly unrelated) issues being raised here. Will try to look into them individually tomorrow, if someone else doesn't beat me to it :-).

Regarding easyCerts: It seemed less intrusive to not enable that option by default, in order not to mess with custom PKI-setups of existing clusters. I personally still prefer that easyCerts is opt-in, not opt-out. I would have expected a build failure though. IMHO, it is not nice that etcd fails at runtime, because of a missing certfile.

cawilliamson commented 5 years ago

@johanot Even when I turn on easyCerts though sadly it still fails. The main problem seems to be the following error:

x509: certificate is valid for 127.0.0.1, not localhost

Sounds like the cert is being generated for an IP and not a "domain" (kind of.)

Icerius commented 5 years ago

I just went through setting up a kubernetes cluster on a new 19.03 install, started with errors and then a success. My master node is defined with only role master.

First masterAddress was defined as an IP address and then I got the error shown in the beginning of this isssue, that is no certificate got generated. When looking at cfssl logs there were errors about a "bad certificate" and "cannot validate certificate for [IP] because it doesn't contain any IP SANs".

Then I changed the masterAddress to be a hostname and got the error "x509: certificate is valid for [ip] not [host]".

Then I:

Removed all kubernetes references from the config, rebuilt and switched to config
Deleted /var/lib/kubernetes/secrets and /var/lib/cfssl folders
Readded kubernetes to the config with masterAddress as a hostname, rebuilt and switched to config

After that the certificates got generated and my kubernetes cluster seems to be running. I have also added another node to my cluster with role node through the nixos-kubernetes-node-join script.

cawilliamson commented 5 years ago

@Icerius Thanks for the info - I think, however, that the primary issue being discussed here is running a master and node on the same box. That's where things seem to fall apart (i.e. when your masterAddress is localhost.)

I'll spent some time today looking in to this and see if I can find a solution - this is really starting to bug me since I'm spending a lot of time with k8s lately and having it on my home server would be a helpful lab.

Icerius commented 5 years ago

Thanks @cawilliamson for the reply.

Based on the original bug report from @apeyroux it does not seem to be the case that he is running master and node on the same box since his he only specifies role master and he does not mention how he set masterAddress.

It is true that @zarelit seems to be running master and node on the same box with localhost as masterAddress. A question for @zarelit, did you start your configuration with localhost or did you first try with masterAddress = "tsundoku.lan" The reason I ask is that I see the message "certificate is valid for tsundoku.lan, not localhost" and when I got a similar message it was because the certificate that had been generated had incorrect info and I had to clean out the old certificates (the reason for step 2 in my description which I was able to do since it was a clean setup). I see you tries setting masterAddress to tsundoku.lan after having it localhost, what does tsundoku.lan resolve too on the machine?

Regarding the error you are seeing @cawilliamson it seems to also be because of incorrectly generated certificates. Did you start with masterAddress localhost or start with 127.0.0.1 and then move to localhost?

cawilliamson commented 5 years ago

@Icerius I just spent some time on this and it turns out to be a very simple fix for me - I did start with "127.0.0.1" and switched to "localhost" (I didn't RTFM first!)

Anyway - the fix for me was to delete the cached certs so basically

1.) Disable kubernetes (remove refs from /etc/nixos/) 2.) rm -rf /var/lib/cfssl /var/lib/kubernetes 3.) Enable kubernetes again (add refs back to /etc/nixos/) 4.) If the first build fails - run the rebuild again and it should succeed the second time.)

I have another problem now but that's unrelated so I'm all good on this one. :+1:

zarelit commented 5 years ago

A question for @zarelit, did you start your configuration with localhost or did you first try with masterAddress = "tsundoku.lan"

At first I didn't read the docs and thought it was like the address where to bind something so (IIRC but I may be wrong) I put 127.0.0.1, then 0.0.0.0, then read the docs and put localhost, saw the messages and put tsundoku.lan at last. At some point in these tests I have rolled the nixos version back and forth to/from 18.09 to fix other unrelated issues (i.e. machine with swap enabled, disk pressure alert)

The reason I ask is that I see the message "certificate is valid for tsundoku.lan, not localhost" and when I got a similar message it was because the certificate that had been generated had incorrect info and I had to clean out the old certificates (the reason for step 2 in my description which I was able to do since it was a clean setup).

Yeah I understand but I was with a friend in an ongoing "live tests" frenzy ^_^'' so I don't recall the exact steps to reproduce. I'm going to clear the the certificate cache and report

I see you tries setting masterAddress to tsundoku.lan after having it localhost, what does tsundoku.lan resolve too on the machine?

I thought I had tsundoku.lan in my /etc/hosts but I actually don't so what happened is that I tried to rebuild-switch both in a network that resolves it to my external address and in a network that does not resolve it at all

zarelit commented 5 years ago

@Icerius After clearing the data directories and rebuilding with masterAddress = "localhost"; I received an error about etcd again but it was transient and thus I believe it will be fixed when the I update the channel as @johanot pointed out.

onixie commented 5 years ago

I have the same issue and it is fatal. I find certmgr is looping at a failure of no IP SAN.

My configuration is like

  services.kubernetes = {
   roles = ["master"];
   masterAddress = "10.0.5.2";
  };

Any idea to solve this problem?

Gonzih commented 5 years ago

@onixie master address needs to be a hostname, not an IP

onixie commented 5 years ago

@Gonzih Thanks. hostname works for me.

stale[bot] commented 4 years ago

Thank you for your contributions. This has been automatically marked as stale because it has had no activity for 180 days. If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity. Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the related code and @ mention them in a comment.
Ask on the NixOS Discourse. 3. Ask on the #nixos channel on irc.freenode.net.

saschagrunert commented 4 years ago

still important to me

gleber commented 4 years ago

I have suspicion that #95885 (by bringing in https://github.com/golang/go/issues/39568) broke NixOS kubernetes modules 22 days ago. Looks like easyCerts automation is broken by stricter cert verification logic in Go 1.15

gleber commented 4 years ago

Yes, using nixpkgs from de5a644adf0ea226b475362cbe7e862789f2849d allows for the certmgr to talk to cfssl without errors.

The symptoms I've seen were:

certmgr showing this in logs:

Sep 11 16:26:11 luna certmgr-pre-start[2899]: 2020/09/11 16:26:11 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://api.kube:8888/api/v1/cfssl/info: Post \"https://api.kube:8888/api/v1/cfssl/info\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}
Sep 11 16:26:11 luna certmgr-pre-start[2899]: Failed: {"code":7400,"message":"failed POST to https://api.kube:8888/api/v1/cfssl/info: Post \"https://api.kube:8888/api/v1/cfssl/info\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}

cfssl showing this in logs:

Sep 11 16:32:19 luna cfssl[1734]: 2020/09/11 16:32:19 http: TLS handshake error from 192.168.1.21:33026: remote error: tls: bad>

My setup is following https://nixos.wiki/wiki/Kubernetes

johanot commented 4 years ago

@gleber I believe it should be fixed once #96446 is merged.

gleber commented 4 years ago

@johanot I can confirm that it is fixed, but NixOS-based tests using k8s I've tried were flaky. It happened to me that kube-apiserver would get marked as failed after a couple of restarts through the StartLimitIntervalSec/StartLimitBurst mechanism. It would fail to start due to certificates not being present yet in the right locations (it looks like certmgr is provisioning them with a delay, I do not yet understand what its mechanism of work). This would happen to me in 1/4 test runs when running test from https://github.com/xtruder/kubenix/blob/master/examples/nginx-deployment/default.nix#L24

Rjected commented 4 years ago

~~I'm still having this issue and haven't been able to fix it, really curious if there is any progress.~~

EDIT: I did what @cawilliamson suggested but had to rebuild three times to get it to work - I have no idea why it worked

nixos-discourse commented 4 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/use-nixos-as-single-node-kubernetes-cluster/8858/8

NobbZ commented 4 years ago

I had a lot of problems with this as well.

In https://discourse.nixos.org/t/use-nixos-as-single-node-kubernetes-cluster/8858/7?u=nobbz I learned that some files have to be deleted after a failed run. The list of those files comes from this thread basically.

Also, there I found out, that the field masterAddress has to be a string describing the hostname, it seems that an IP can not be used here. Additionally apiserver.advertiseAddress has to be an IP, not a hostname.

These are my observations. Not sure if changing those fields actually fixed it or was just coincidental, but after that it worked for me.

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

samrose commented 3 years ago

I am seeing these issues in 21.05 on aws ec2

fkautz commented 2 years ago

Issues continue to persist in 21.11.

bbqsrc commented 2 years ago

Still issues.

luciusf commented 1 year ago

Still seeing this on a fresh VM with 22.11. Following the official documentation get's me the same error. I was not able so far, to use the workarounds in this thread successfully.

zeratax commented 1 year ago

so to me the issue seems that /var/lib/kubernetes/secrets/ca.pem is empty? maybe this is a race condition? by just copying the file from /var/lib/cfssl/ca.pem now it seems to work?

Edit: Seems to require two nixos-rebuild switch before all the files are actually there tho

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/simulating-a-kubernetes-cluster-with-containers/26297/1

cafkafk commented 1 year ago

I had this issue when following the wiki article for kubernetes (on NixOS 23.11 (Tapir)). I used an identical configuration to the wiki.

First I had the issue with ETCD being unable to find the certs. I removed the old certs:

sudo rm /var/lib/kubernetes/secrets/ /var/lib/cfssl/ -rf

Then I changed the kubeMasterHostname = "api.kube" to kubeMasterHostname = "localhost";

Finally, I commented out the kubernetes configuration from my config. Ran rebuild switch. Then removed the comments, ran rebuild switch again.

Now I have a working master node :D

(also, if you follow the wiki, you'll need to run kubectl with sudo -E, or change the permissions of the certificate dirs, good luck!)

JamesofScout commented 1 year ago

So i tested it the Master must refer to itself on Creation, otherwise it will not work this can wither be done per localhost or ip 127.0.0.1

MichelV69 commented 5 months ago

Continues to be an issue in : system.stateVersion = "24.05";

That's a bit disappointing, because it's causing a lot of annoyance in setting up a simple 2 Master / 3 Worker build

cafkafk commented 4 months ago

That's a bit disappointing, because it's causing a lot of annoyance in setting up a simple 2 Master / 3 Worker build

In general, the whole of kubernetes in nixos could use some love, contributions welcome :p

MichelV69 commented 4 months ago

In general, the whole of kubernetes in nixos could use some love, contributions welcome :p

I've been making some notes based on rumaging source code.

Current progress is at https://github.com/MichelV69/nixos/blob/main/kubernetes.nix ... in specific, the

apiserver.extraSANs option is crucial to progress, as is pki.cfsslAPIExtraSANs

... additionally, I think a lot of this would be much easier with a prequisite for an LB'd FQDN to exist in advance of set up.

Unfortunately, I haven't had time to fight with it in the past couple of weeks. I'll update as I struggle with it.

noizu commented 3 days ago

It seems like there is a permission issue in populating the kubernetes files. Using:

users.users.keith.extraGroups = ["docker" "kubernetes"];

in my config allows the pem files to be binplaced correctly (two builds required the first to build the second to run with placed pems.)

NixOS / nixpkgs