clusterinthecloud / ansible

Ansible config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
10 stars 27 forks source link

installation of telegraf fails when building node image for `aarch64` #131

Closed boegel closed 1 year ago

boegel commented 1 year ago
$ sudo /usr/local/bin/run-packer aarch64
...
^[[0;32m    amazon-ebs.aws:  ______________________________________________^[[0m
^[[0;32m    amazon-ebs.aws: < TASK [monitoring : install telegraf package] >^[[0m
^[[0;32m    amazon-ebs.aws:  ----------------------------------------------^[[0m
^[[0;32m    amazon-ebs.aws:         \   ^__^^[[0m
^[[0;32m    amazon-ebs.aws:          \  (oo)\_______^[[0m
^[[0;32m    amazon-ebs.aws:             (__)\       )\/\^[[0m
^[[0;32m    amazon-ebs.aws:                 ||----w |^[[0m
^[[0;32m    amazon-ebs.aws:                 ||     ||^[[0m
^[[0;32m    amazon-ebs.aws:^[[0m
^[[0;32m    amazon-ebs.aws: fatal: [default]: FAILED! => {"changed": false, "failures": [], "msg": "Depsolve Error occurred: \n Problem: cannot install the best candidate for the job\n  - package telegraf-1.14.5-1.arm64 does not have a compatible architecture", "rc": 1, "results": []}^[[0m

The problem seems to be that the arm64 repo is still being used, while the aarch64 repo is now populated as expected, see https://repos.influxdata.com/centos/8/aarch64/stable.

This should be fixed in https://github.com/clusterinthecloud/ansible/blob/9edc1db0ed6d0f500178b6388aae14d9643c561e/roles/monitoring/tasks/main.yml#L14-L21

I'm trying to figure out where I should make that change for an active cluster... Are those Ansible tasks copied on the head node somewhere, where I can make the necessary change to be able to build the node image?

milliams commented 1 year ago

The cluster has on it a local copy of the Ansible playbook in /root/citc-ansible. If you make any local changes to this, you can then run the script /root/run_ansible which will run the local copy of that playbook, including any changes. This should also rebuild the node image in the process.

boegel commented 1 year ago

Thanks for the tip regarding being able to live patch @ /root/citc-ansible/roles/monitoring/tasks/main.yml, that's very useful. Running /root/run_ansible didn't trigger a rebuild of the node image though, but I can run /usr/local/bin/run-packer aarch64 after editing /root/citc-ansible/roles/monitoring/tasks/main.yml, and then the changes made seem to be picked up when building the node image.

I made the change as proposed in #132, but that didn't fully fix the problem. I'm no longer seeing the package telegraf-1.14.5-1.arm64 does not have a compatible architecture issue, but I am seeing this:

    amazon-ebs.aws:  ______________________________________________
    amazon-ebs.aws: < TASK [monitoring : install telegraf package] >
    amazon-ebs.aws:  ----------------------------------------------
    amazon-ebs.aws:         \   ^__^
    amazon-ebs.aws:          \  (oo)\_______
    amazon-ebs.aws:             (__)\       )\/\
    amazon-ebs.aws:                 ||----w |
    amazon-ebs.aws:                 ||     ||
    amazon-ebs.aws:
    amazon-ebs.aws: fatal: [default]: FAILED! => {"changed": false, "msg": "Failed to download metadata for repo 'influxdb': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried", "rc": 1, "results": []}

That may point to a known problem with the influxdb repo, cfr. https://github.com/influxdata/telegraf/issues/7899

I don't understand why that's happening, since https://repos.influxdata.com/centos/8/aarch64/stable/repodata/repomd.xml exists.

When I try with everything hardcoded (https://repos.influxdata.com/centos/8/aarch64/stable), I run into a different problem:

    amazon-ebs.aws:  ______________________________________________
    amazon-ebs.aws: < TASK [monitoring : install telegraf package] >
    amazon-ebs.aws:  ----------------------------------------------
    amazon-ebs.aws:         \   ^__^
    amazon-ebs.aws:          \  (oo)\_______
    amazon-ebs.aws:             (__)\       )\/\
    amazon-ebs.aws:                 ||----w |
    amazon-ebs.aws:                 ||     ||
    amazon-ebs.aws:
    amazon-ebs.aws: changed: [default]
    amazon-ebs.aws:  _________________________________________________
    amazon-ebs.aws: < TASK [monitoring : enable the telegraf service] >
    amazon-ebs.aws:  -------------------------------------------------
    amazon-ebs.aws:         \   ^__^
    amazon-ebs.aws:          \  (oo)\_______
    amazon-ebs.aws:             (__)\       )\/\
    amazon-ebs.aws:                 ||----w |
    amazon-ebs.aws:                 ||     ||
    amazon-ebs.aws:
    amazon-ebs.aws: fatal: [default]: FAILED! => {"changed": false, "msg": "Unable to start service telegraf: Job for telegraf.service failed because the control process exited with error code.\nSee \"systemctl status telegraf.service\" and \"journalctl -xe\" for details.\n"}

I'm not sure how to figure out what goes wrong exactly there, would a more detailed log be available somewhere after the image build process has exited?

This does tell me that https://repos.influxdata.com/centos/8/$ansible_architecture/stable and https://repos.influxdata.com/centos/8/aarch64/stable aren't the same, so $ansible_architecture is not equal to aarch64 somehow?!

I actually don't need telegraf at all... Is there an easy way to disable this part? Would it be sufficient to just comment out a part of monitoring/tasks/main.yml, or can I easily exclude the whole monitoring part somehow? Do I just comment out the - monitoring line in /root/citc-ansible/compute.yml?

boegel commented 1 year ago

I actually don't need telegraf at all... Is there an easy way to disable this part? Would it be sufficient to just comment out a part of monitoring/tasks/main.yml, or can I easily exclude the whole monitoring part somehow? Do I just comment out the - monitoring line in /root/citc-ansible/compute.yml?

To answer my own question: I can just kick out the monitoring role in /root/citc-ansible/compute.yml to circumvent the problem with installing/starting telegraf:

--- /root/citc-ansible/compute.yml.orig 2023-01-12 16:30:38.565893137 +0000
+++ /root/citc-ansible/compute.yml  2023-01-12 15:57:00.045641752 +0000
@@ -21,4 +21,4 @@
     - packages
     - mpi
     - slurm
-    - monitoring
+      #- monitoring
milliams commented 1 year ago

Your fix to comment out the monitoring is a reasonable one which will get you past this issue.

The issue you had after the hardcoding of the repo is fixed by #133.

The problem of not finding repomd.xml is still open and I will look into it when I get a chance.