Failed to load kubeconfig due to Invalid kube-config file

danielburrell commented 3 years ago

SUMMARY

When k8s_info module runs on a non-controller host, it looks for a kubeconfig file on the controller host, but returns "invalid kube-config file" error even though the kubeconfig file on the controller is valid.

To demonstrate this I have 2 scenarios to be compared; the first scenario (CI) seems to work by co-incidence, the second scenario (local development) reveals the bug.

CI Scenario:

I have a CI environment where two new servers are terraformed, one as control box, the other as target box.
CI will run the ansible playbook from the control box and deploy a kubernetes cluster to the target box as part of the playbook. - - As part of the playbook, the target box's kubeconfig file ~centos/.kube/config is copied to the control box at the same location ~centos/.kube/config near the end of the play.
At all times everything runs as 'centos' user.
When I run a k8s_info action in this scenario with hosts: server (target), specifying the kubeconfig file, I am able to confirm certain objects exist in my cluster. This works as intended. whoami on all boxes would return centos and ansible_user would also return centos. The 2 boxes are basically the same, just different 'purpose'. This is the only scenario that works.

Local Development Scenario

I have a developer computer which acts as a control box, I am logged in as myself e.g. daniel there is a target box configured the same way as the CI scenario (with a centos user).
When I run the installation, I set ansible-playbook .... -u centos, the ssh user is centos, but I run the installer as daniel. so whoami on the localhost would return daniel and anywhere else would be centos. As above the kubeconfig file is in its place on the target machine, and as before gets copied to the control box. Note in the CI scenario, the copy was to an identical folder, ~centos/.kube/config, but this time in this scenario, it's copied to ~daniel/.kube/config

The documentation does not state where the kubeconfig file must be located, nor does it state that the k8s_info must be run on the control box. So up till now I had no reason to think anything was wrong. When my k8s_info task ran on the target machine, I assumed it was using the target machine kubeconfig (not the controlbox kubeconfig).

When I try to run a k8s_info action in this scenario with hosts: server (target) specifying the kubeconfig file as ~centos/.kube/config then it says the file cannot be found on the AnsibleControl machine (of course it's now located at ~daniel/.kube/confing on the control machine

This suggests that regardless of the hosts, the role is expecting to find the kubeconfig file on the control machine. Can you confirm this is the case?

Assuming this is true, if I tell the installer to use ~daniel/.kube/config (which exists on the control machine), with hosts: server then it tells me that the config isn't valid!

The only scenario that works with hosts: server is if my target and control both have a kubeconfig file in the same location.

This seems to be a bug as

I would have expected my config to be considered valid as below.
I would expect the operation to work in locations other than the control machine.

Can you clarify the following:

Where can these tasks run, localhost only? or anywhere?
Which kubeconfig file is used, control? or the current inventory_host?
Has this behaviour changed recently?

failed: [10.50.52.94] (item={'name': 'coredns', 'quantity': 1}) => {"ansible_loop_var": "item", "attempts": 5, "changed": false, "item": {"name": "coredns", "quantity": 1}, "msg": "Failed to load kubeconfig due to Invalid kube-config file. No configuration found."}

ls ~/.kube/config
/home/daniel/.kube/config

cat ~/.kube/config

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: redactedbase64==
    server: https://10.50.52.94:6443
  name: default
contexts:
- context:
    cluster: default
    user: default
  name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
  user:
    password: redacted
    username: redacted

This is all very strange, as this causes the playbook to fail, and I am able to cat the file, and it's perfectly valid, and works with kubectl.

Any ideas?

ISSUE TYPE

Bug Report

COMPONENT NAME

k8s_info

ANSIBLE VERSION

[WARNING]: Ansible is being run in a world writable directory (/home/daniel/projects/prom/project), ignoring it as an ansible.cfg source. For more
information see https://docs.ansible.com/ansible/devel/reference_appendices/config.html#cfg-in-world-writable-dir
ansible-playbook 2.10.6
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/daniel/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/daniel/.pex/installed_wheels/04c26471cb05787fcd8372d2f2bea63afb042678/ansible_base-2.10.6-py3-none-any.whl/ansible
  executable location = ansible-playbook
  python version = 3.6.8 (default, Nov 16 2020, 16:55:22) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]

CONFIGURATION

OS / ENVIRONMENT

centos 7.9

STEPS TO REPRODUCE


- name: Copy kubeconfig from the server to the control machine
  fetch:
   # in CI both control_user_id and ansible_user are centos, in development, control_user_id is daniel, ansible_user is centos
    src: ~{{ ansible_user }}/.kube/config
    dest: ~{{ control_user_id }}/.kube/config
    flat: yes
  become: no

# SNIP
# run once on host 'server' (target)
- name: "Verify k3s deployment"
  community.kubernetes.k8s_info:
    kind: Deployment
    wait: yes
    name: "{{ item.name }}"
    namespace: kube-system
    wait_timeout: 360
    wait_sleep: 10
   #apparently this kubeconfig file is supposed to be on the control box, can this be added to the docs?
    kubeconfig: "~{{ control_user_id }}/.kube/config"   #in development control_user_id  this is ~daniel in CI its centos
  register: deployment_status
  run_once: true
  become: yes
  until: (deployment_status.resources[0].status.readyReplicas | default(0) == item.quantity)
  retries: 5
  loop:
    - { name: 'coredns', quantity: 1 }

EXPECTED RESULTS

I would have expected my kubeconfig file to be considered valid.

ACTUAL RESULTS

failed: [10.50.52.94] (item={'name': 'coredns', 'quantity': 1}) => {"ansible_loop_var": "item", "attempts": 5, "changed": false, "item": {"name": "coredns", "quantity": 1}, "msg": "Failed to load kubeconfig due to Invalid kube-config file. No configuration found."}

gravesm commented 3 years ago

@danielburrell I can confirm that the module looks for the kubeconfig on whichever host it is being run on. It will work fine on a managed node and look for the kubeconfig on that node. There should be no need to copy the kubeconfig from the managed node to the controller. I am unable to reproduce the behavior you describe using:

ansible 2.10.8 community.kubernetes 1.2.1 openshift 0.12.0 kubernetes 12.0.1

I tried running the playbook on a managed node with a different user and it found the kubeconfig whether it was in the default location (~/.kube/config) or if I moved it to a non-default location and specified the path to it using the kubeconfig parameter. If I put a bunch of garbage in the kubeconfig on the managed node, the kubeconfig will fail to load because of that, so it's clearly finding the correct kubeconfig.

I'm not sure what to suggest other than to double check that the kubeconfig exists at the path you are specifying.

danielburrell commented 3 years ago

ansible-2.10.7 (this is the latest available version in my organization) ansible_base-2.10.6 openshift-0.11.2 (because of a bug in openshift 0.12.0) kubernetes-11.0.0 Not sure what the version of community.kubernetes is, I guess it is determined by the ansible version if it's bundled with ansible?

I'll try and come up with a test case repo, as I ran it again and I think the following output demonstrates that the file exists at the path:

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
failed: [10.50.52.94] (item={'name': 'traefik', 'quantity': 1}) => {"ansible_loop_var": "item", "attempts": 5, "changed": false, "item": {"name": "traefik", "quantity": 1}, "msg": "Could not find or access '/home/centos/.kube/config' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}

NO MORE HOSTS LEFT ********************************************************************************************************************************************************

PLAY RECAP ****************************************************************************************************************************************************************
10.50.52.94                : ok=65   changed=19   unreachable=0    failed=1    skipped=4    rescued=0    ignored=0   
localhost                  : ok=11   changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

ssh -i ../cloudtls.pem centos@10.50.52.94
Last login: Wed Apr 14 08:14:11 2021 from <redacted>
[centos@10.50.52.94 ~]$ sudo su
[root@10.50.52.94 centos]# ls -lart /home/centos/.kube/config 
-rw-r--r--. 1 centos root 1054 Apr 14 08:13 /home/centos/.kube/config

So at the point of failure, the playbook terminates, and if you log back into the box, the file is right there.

Side note, in the case the file is missing, the error message is;

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
failed: [10.50.52.94] (item={'name': 'metrics-server', 'quantity': 1}) => {"ansible_loop_var": "item", "attempts": 5, "changed": false, "item": {"name": "metrics-server", "quantity": 1}, "msg": "Could not find or access '/home/centos/.kube/config' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}

Why does it say "on the Ansible Controller", the ansible controller is the machine you run ansible from correct? Anything else is a managed node or host. Am I wrong or is the message generic/misleading?

gravesm commented 3 years ago

I'm confused because the error message about "the Ansible Controller" doesn't exist in this repo. Can you post the output of a failing playbook with full verbosity (-vvvv)?

danielburrell commented 3 years ago

<10.50.52.94> ESTABLISH SSH CONNECTION FOR USER: centos
<10.50.52.94> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/daniel/projects/prom/cloudtls.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="centos"' -o ConnectTimeout=10 -o ControlPath=/home/daniel/.ansible/cp/b5fb5e1ade 10.50.52.94 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /home/centos/.ansible/tmp `"&& mkdir "` echo /home/centos/.ansible/tmp/ansible-tmp-1618407874.488945-17766-127442268166527 `" && echo ansible-tmp-1618407874.488945-17766-127442268166527="` echo /home/centos/.ansible/tmp/ansible-tmp-1618407874.488945-17766-127442268166527 `" ) && sleep 0'"'"''
<10.50.52.94> (0, b'ansible-tmp-1618407874.488945-17766-127442268166527=/home/centos/.ansible/tmp/ansible-tmp-1618407874.488945-17766-127442268166527\n', b'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /home/daniel/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 58: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 4 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 16322\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\n')
<10.50.52.94> ESTABLISH SSH CONNECTION FOR USER: centos
<10.50.52.94> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/daniel/projects/prom/cloudtls.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="centos"' -o ConnectTimeout=10 -o ControlPath=/home/daniel/.ansible/cp/b5fb5e1ade 10.50.52.94 '/bin/sh -c '"'"'rm -f -r /home/centos/.ansible/tmp/ansible-tmp-1618407874.488945-17766-127442268166527/ > /dev/null 2>&1 && sleep 0'"'"''
<10.50.52.94> (0, b'', b'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /home/daniel/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 58: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 4 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 16322\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\n')
The full traceback is:
Traceback (most recent call last):
  File "/home/daniel/.pex/installed_wheels/c708016249a31ecd1c4fc3c5b03d3dd85e595252/ansible-2.10.7-py3-none-any.whl/ansible_collections/community/kubernetes/plugins/action/k8s_info.py", line 51, in run
    kubeconfig = self._find_needle('files', kubeconfig)
  File "/home/daniel/.pex/installed_wheels/04c26471cb05787fcd8372d2f2bea63afb042678/ansible_base-2.10.6-py3-none-any.whl/ansible/plugins/action/__init__.py", line 1232, in _find_needle
    return self._loader.path_dwim_relative_stack(path_stack, dirname, needle)
  File "/home/daniel/.pex/installed_wheels/04c26471cb05787fcd8372d2f2bea63afb042678/ansible_base-2.10.6-py3-none-any.whl/ansible/parsing/dataloader.py", line 327, in path_dwim_relative_stack
    raise AnsibleFileNotFound(file_name=source, paths=[to_native(p) for p in search])
ansible.errors.AnsibleFileNotFound: Could not find or access '/home/centos/.kube/config' on the Ansible Controller.
If you are using a module and expect the file to exist on the remote, see the remote_src option
failed: [10.50.52.94] (item={'name': 'traefik', 'quantity': 1}) => {
    "ansible_loop_var": "item",
    "attempts": 5,
    "changed": false,
    "item": {
        "name": "traefik",
        "quantity": 1
    },
    "msg": "Could not find or access '/home/centos/.kube/config' on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"
}

gravesm commented 3 years ago

OK, I'm pretty sure what's going on here is you have an old version of community.kubernetes. I would suggest upgrading to 1.2.1 and seeing if that fixes your problem.

danielburrell commented 3 years ago

is it possible to select a version for that library? I thought it was determined/bundled with the ansible binary. Is there a way to check which version I'm using?

gravesm commented 3 years ago

Since you're using ansible 2.10 you should be able to do:

$ ansible-galaxy collection list | grep kubernetes

You can install the latest version by doing:

$ ansible-galaxy collection install community.kubernetes

More info: https://docs.ansible.com/ansible/latest/user_guide/collections_using.html

gravesm commented 3 years ago

@danielburrell have you been able to test this after upgrading? Could you report if it is working for you?

danielburrell commented 3 years ago

sorry for the delay, so it seems that when building our 'pex' we have been using the python dependency known as 'ansible' which bundles some of the community collections including the kubernetes and helm ones, and I cannot see a more up to date version of this ansible package in pypi, I note that the community package that's bundled with the version I'm using is not the latest though.

So this means that if I want to upgrade the community collections, I'll have to wait for a new ansible release, or else migrate to using ansible-core (previously known as ansible-base?) and find a way to airgap the community collections.

I am going to try to do the latter as it seems to me that the ansible-core + collections is being maintained more than the ansible package, and if I can get it to work it will mean the ability to pick up bugfixes in collections.

The supply of collections is a bit more tricky though and I'm not sure how the installation of these collections interacts with custom virtual environments. (some of our tasks specify the python version from a particular venv).

I think using ansible to deploy software behind an airgap (having no public internet connection) is a bit trickier so, sorry if it takes a while longer to confirm.

ansible-collections / kubernetes.core