Open apeyroux opened 5 years ago
FWIW it's not happening on the 18.09 release
cc @johanot @arianvp
@zarelit that it's not happening on 18.09
is expected. We moved to mandatory pki in 19.03
(https://nixos.org/nixos/manual/index.html#sec-kubernetes)
However, this config should work because "master"
should imply easyCerts = true
and bootstrap the certificates automatically.
@apeyroux was the error transient or fatal? Did etcd
eventually start up successfully with the certs, or did it just fail to start up at all?
We recently merged a PR (yesterday) that changes the order of components being started to reduce the amount of transient non-fatal errors https://github.com/NixOS/nixpkgs/pull/56789
There might be some time where the certs are still being generated, but etcd is already started. However, after the certs appear, etcd should start just fine
I can take upto serveral minutes for the kubernetes cluster to stabilise. Did it eventually complete bootstrapping?
@arianvp I have the same issue, even though I don't know if the cause is the same. It looks like it's not transient
relevant parts of my configuration, trying to set up k8s on my laptop:
services.kubernetes = {
roles = ["master" "node"];
addons.dashboard.enable = true;
kubelet.extraOpts = "--fail-swap-on=false";
masterAddress = "localhost";
};
Certmgr unit is stuck in this loop
Apr 18 13:37:06 tsundoku systemd[1]: Starting certmgr...
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [INFO] certmgr: loading from config file /nix/store/q6s4lclkzmr1g59dqnjz6kdi6azqy8fj-certmgr.yaml
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [INFO] manager: loading certificates from /nix/store/ms3w33cai719d8971hsdmi4j21fs25pq-certmgr.d
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [INFO] manager: loading spec from /nix/store/ms3w33cai719d8971hsdmi4j21fs25pq-certmgr.d/addonManager.json
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: 2019/04/18 13:37:06 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for tsundoku.lan, not localhost"}
Apr 18 13:37:06 tsundoku qr84y7ksrgydiska07vjd6q9vbymlkbj-unit-script-certmgr-pre-start[11269]: Failed: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for tsundoku.lan, not localhost"}
Apr 18 13:37:06 tsundoku systemd[1]: certmgr.service: Control process exited, code=exited status=1
Apr 18 13:37:06 tsundoku systemd[1]: certmgr.service: Failed with result 'exit-code'.
Apr 18 13:37:06 tsundoku systemd[1]: Failed to start certmgr.
Without any clue I tried to change the kubernetes.masterAddress
from localhost
to the hostname tsundoku.lan
and now it complains like this:
Apr 18 13:43:05 tsundoku systemd[1]: Starting certmgr...
Apr 18 13:43:05 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:05 [INFO] certmgr: loading from config file /nix/store/xy17dlkik1rcyvdxb6n6xa5fqq7hgdxk-certmgr.yaml
Apr 18 13:43:05 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:05 [INFO] manager: loading certificates from /nix/store/d4njrd8r64mqgq4h6dxmbg6iysha5wgn-certmgr.d
Apr 18 13:43:05 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:05 [INFO] manager: loading spec from /nix/store/d4njrd8r64mqgq4h6dxmbg6iysha5wgn-certmgr.d/addonManager.json
Apr 18 13:43:15 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: 2019/04/18 13:43:15 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://tsundoku.lan:8888/api/v1/cfssl/info: Post https://tsundoku.lan:8888/api/v1/cfssl/info: dial tcp: lookup tsundoku.lan: device or resource busy"}
Apr 18 13:43:15 tsundoku 796ywa0556grby68vb4p73mn5yn3l74x-unit-script-certmgr-pre-start[22733]: Failed: {"code":7400,"message":"failed POST to https://tsundoku.lan:8888/api/v1/cfssl/info: Post https://tsundoku.lan:8888/api/v1/cfssl/info: dial tcp: lookup tsundoku.lan: device or resource busy"}
Apr 18 13:43:15 tsundoku systemd[1]: certmgr.service: Control process exited, code=exited status=1
Apr 18 13:43:15 tsundoku systemd[1]: certmgr.service: Failed with result 'exit-code'.
Apr 18 13:43:15 tsundoku systemd[1]: Failed to start certmgr.
Has anyone figured out either a solution or a workaround for this at all?
I've been struggling to get a k8s cluster up on NixOS with this issue for a few days now. :(
The following seems to be the cause of the issue - a cert is being generated for "127.0.0.1" instead of "localhost" I guess?
# /nix/store/c1dcbf3c4jb4jlcadzh05i0di98lm6zz-unit-script-certmgr-pre-start
2019/04/18 21:38:39 [INFO] certmgr: loading from config file /nix/store/jvygi3li8pjmx0vf3jldamz8j3m1a03s-certmgr.yaml
2019/04/18 21:38:39 [INFO] manager: loading certificates from /nix/store/p4c93zcbdh9fcs634n1cn2scd5rwwjf0-certmgr.d
2019/04/18 21:38:39 [INFO] manager: loading spec from /nix/store/p4c93zcbdh9fcs634n1cn2scd5rwwjf0-certmgr.d/addonManager.json
2019/04/18 21:38:39 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for 127.0.0.1, not localhost"}
Failed: {"code":7400,"message":"failed POST to https://localhost:8888/api/v1/cfssl/info: Post https://localhost:8888/api/v1/cfssl/info: x509: certificate is valid for 127.0.0.1, not localhost"}
Seems there are multiple (possibly unrelated) issues being raised here. Will try to look into them individually tomorrow, if someone else doesn't beat me to it :-).
Regarding easyCerts
: It seemed less intrusive to not enable that option by default, in order not to mess with custom PKI-setups of existing clusters. I personally still prefer that easyCerts
is opt-in, not opt-out. I would have expected a build failure though. IMHO, it is not nice that etcd fails at runtime, because of a missing certfile.
@johanot Even when I turn on easyCerts though sadly it still fails. The main problem seems to be the following error:
x509: certificate is valid for 127.0.0.1, not localhost
Sounds like the cert is being generated for an IP and not a "domain" (kind of.)
I just went through setting up a kubernetes cluster on a new 19.03 install, started with errors and then a success. My master node is defined with only role master.
First masterAddress was defined as an IP address and then I got the error shown in the beginning of this isssue, that is no certificate got generated. When looking at cfssl logs there were errors about a "bad certificate" and "cannot validate certificate for [IP] because it doesn't contain any IP SANs".
Then I changed the masterAddress to be a hostname and got the error "x509: certificate is valid for [ip] not [host]".
Then I:
After that the certificates got generated and my kubernetes cluster seems to be running. I have also added another node to my cluster with role node through the nixos-kubernetes-node-join script.
@Icerius Thanks for the info - I think, however, that the primary issue being discussed here is running a master and node on the same box. That's where things seem to fall apart (i.e. when your masterAddress is localhost.)
I'll spent some time today looking in to this and see if I can find a solution - this is really starting to bug me since I'm spending a lot of time with k8s lately and having it on my home server would be a helpful lab.
Thanks @cawilliamson for the reply.
Based on the original bug report from @apeyroux it does not seem to be the case that he is running master and node on the same box since his he only specifies role master and he does not mention how he set masterAddress.
It is true that @zarelit seems to be running master and node on the same box with localhost as masterAddress. A question for @zarelit, did you start your configuration with localhost or did you first try with masterAddress = "tsundoku.lan" The reason I ask is that I see the message "certificate is valid for tsundoku.lan, not localhost" and when I got a similar message it was because the certificate that had been generated had incorrect info and I had to clean out the old certificates (the reason for step 2 in my description which I was able to do since it was a clean setup). I see you tries setting masterAddress to tsundoku.lan after having it localhost, what does tsundoku.lan resolve too on the machine?
Regarding the error you are seeing @cawilliamson it seems to also be because of incorrectly generated certificates. Did you start with masterAddress localhost or start with 127.0.0.1 and then move to localhost?
@Icerius I just spent some time on this and it turns out to be a very simple fix for me - I did start with "127.0.0.1" and switched to "localhost" (I didn't RTFM first!)
Anyway - the fix for me was to delete the cached certs so basically
1.) Disable kubernetes (remove refs from /etc/nixos/) 2.) rm -rf /var/lib/cfssl /var/lib/kubernetes 3.) Enable kubernetes again (add refs back to /etc/nixos/) 4.) If the first build fails - run the rebuild again and it should succeed the second time.)
I have another problem now but that's unrelated so I'm all good on this one. :+1:
A question for @zarelit, did you start your configuration with localhost or did you first try with masterAddress = "tsundoku.lan"
At first I didn't read the docs and thought it was like the address where to bind something so (IIRC but I may be wrong) I put 127.0.0.1
, then 0.0.0.0
, then read the docs and put localhost
, saw the messages and put tsundoku.lan
at last.
At some point in these tests I have rolled the nixos version back and forth to/from 18.09 to fix other unrelated issues (i.e. machine with swap enabled, disk pressure alert)
The reason I ask is that I see the message "certificate is valid for tsundoku.lan, not localhost" and when I got a similar message it was because the certificate that had been generated had incorrect info and I had to clean out the old certificates (the reason for step 2 in my description which I was able to do since it was a clean setup).
Yeah I understand but I was with a friend in an ongoing "live tests" frenzy ^_^'' so I don't recall the exact steps to reproduce. I'm going to clear the the certificate cache and report
I see you tries setting masterAddress to tsundoku.lan after having it localhost, what does tsundoku.lan resolve too on the machine?
I thought I had tsundoku.lan
in my /etc/hosts but I actually don't so what happened is that I tried to rebuild-switch both in a network that resolves it to my external address and in a network that does not resolve it at all
@Icerius After clearing the data directories and rebuilding with masterAddress = "localhost";
I received an error about etcd
again but it was transient and thus I believe it will be fixed when the I update the channel as @johanot pointed out.
I have the same issue and it is fatal. I find certmgr is looping at a failure of no IP SAN.
My configuration is like
services.kubernetes = {
roles = ["master"];
masterAddress = "10.0.5.2";
};
Any idea to solve this problem?
@onixie master address needs to be a hostname, not an IP
@Gonzih Thanks. hostname works for me.
Thank you for your contributions. This has been automatically marked as stale because it has had no activity for 180 days. If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity. Here are suggestions that might help resolve this more quickly:
still important to me
I have suspicion that #95885 (by bringing in https://github.com/golang/go/issues/39568) broke NixOS kubernetes modules 22 days ago. Looks like easyCerts
automation is broken by stricter cert verification logic in Go 1.15
Yes, using nixpkgs from de5a644adf0ea226b475362cbe7e862789f2849d allows for the certmgr to talk to cfssl without errors.
The symptoms I've seen were:
certmgr showing this in logs:
Sep 11 16:26:11 luna certmgr-pre-start[2899]: 2020/09/11 16:26:11 [ERROR] cert: failed to fetch remote CA: {"code":7400,"message":"failed POST to https://api.kube:8888/api/v1/cfssl/info: Post \"https://api.kube:8888/api/v1/cfssl/info\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}
Sep 11 16:26:11 luna certmgr-pre-start[2899]: Failed: {"code":7400,"message":"failed POST to https://api.kube:8888/api/v1/cfssl/info: Post \"https://api.kube:8888/api/v1/cfssl/info\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}
cfssl showing this in logs:
Sep 11 16:32:19 luna cfssl[1734]: 2020/09/11 16:32:19 http: TLS handshake error from 192.168.1.21:33026: remote error: tls: bad>
My setup is following https://nixos.wiki/wiki/Kubernetes
@gleber I believe it should be fixed once #96446 is merged.
@johanot I can confirm that it is fixed, but NixOS-based tests using k8s I've tried were flaky. It happened to me that kube-apiserver
would get marked as failed after a couple of restarts through the StartLimitIntervalSec/StartLimitBurst
mechanism. It would fail to start due to certificates not being present yet in the right locations (it looks like certmgr
is provisioning them with a delay, I do not yet understand what its mechanism of work). This would happen to me in 1/4 test runs when running test from https://github.com/xtruder/kubenix/blob/master/examples/nginx-deployment/default.nix#L24
I'm still having this issue and haven't been able to fix it, really curious if there is any progress.
EDIT: I did what @cawilliamson suggested but had to rebuild three times to get it to work - I have no idea why it worked
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/use-nixos-as-single-node-kubernetes-cluster/8858/8
I had a lot of problems with this as well.
In https://discourse.nixos.org/t/use-nixos-as-single-node-kubernetes-cluster/8858/7?u=nobbz I learned that some files have to be deleted after a failed run. The list of those files comes from this thread basically.
Also, there I found out, that the field masterAddress
has to be a string describing the hostname, it seems that an IP can not be used here. Additionally apiserver.advertiseAddress
has to be an IP, not a hostname.
These are my observations. Not sure if changing those fields actually fixed it or was just coincidental, but after that it worked for me.
I marked this as stale due to inactivity. → More info
I am seeing these issues in 21.05 on aws ec2
Issues continue to persist in 21.11.
Still issues.
Still seeing this on a fresh VM with 22.11. Following the official documentation get's me the same error. I was not able so far, to use the workarounds in this thread successfully.
so to me the issue seems that /var/lib/kubernetes/secrets/ca.pem
is empty? maybe this is a race condition? by just copying the file from /var/lib/cfssl/ca.pem
now it seems to work?
Edit: Seems to require two nixos-rebuild switch
before all the files are actually there tho
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/simulating-a-kubernetes-cluster-with-containers/26297/1
I had this issue when following the wiki article for kubernetes (on NixOS 23.11 (Tapir)). I used an identical configuration to the wiki.
First I had the issue with ETCD being unable to find the certs. I removed the old certs:
sudo rm /var/lib/kubernetes/secrets/ /var/lib/cfssl/ -rf
Then I changed the kubeMasterHostname = "api.kube"
to kubeMasterHostname = "localhost";
Finally, I commented out the kubernetes configuration from my config. Ran rebuild switch. Then removed the comments, ran rebuild switch again.
Now I have a working master node :D
(also, if you follow the wiki, you'll need to run kubectl
with sudo -E
, or change the permissions of the certificate dirs, good luck!)
So i tested it the Master must refer to itself on Creation, otherwise it will not work this can wither be done per localhost or ip 127.0.0.1
Continues to be an issue in : system.stateVersion = "24.05";
That's a bit disappointing, because it's causing a lot of annoyance in setting up a simple 2 Master / 3 Worker build
That's a bit disappointing, because it's causing a lot of annoyance in setting up a simple 2 Master / 3 Worker build
In general, the whole of kubernetes in nixos could use some love, contributions welcome :p
In general, the whole of kubernetes in nixos could use some love, contributions welcome :p
I've been making some notes based on rumaging source code.
Current progress is at https://github.com/MichelV69/nixos/blob/main/kubernetes.nix ... in specific, the
apiserver.extraSANs
option is crucial to progress, as is pki.cfsslAPIExtraSANs
... additionally, I think a lot of this would be much easier with a prequisite for an LB'd FQDN to exist in advance of set up.
Unfortunately, I haven't had time to fight with it in the past couple of weeks. I'll update as I struggle with it.
It seems like there is a permission issue in populating the kubernetes files. Using:
users.users.keith.extraGroups = ["docker" "kubernetes"];
in my config allows the pem files to be binplaced correctly (two builds required the first to build the second to run with placed pems.)
Issue description
When the
services.kubernetes.roles = ["master"]
is enabled, I have this error when starting the service etcd :avril 12 18:10:01 xps15.px.io etcd[29989]: open /var/lib/kubernetes/secrets/etcd.pem: no such file or directory
Steps to reproduce
Use
services.kubernetes.roles = ["master"]
in /etc/nixos/configuration.nixTechnical details
"x86_64-linux"
Linux 4.19.34, NixOS, 19.03.172138.5c52b25283a (Koi)
yes
yes
nix-env (Nix) 2.2
"nixos-19.03.172138.5c52b25283a"
/nix/var/nix/profiles/per-user/root/channels/nixos