EESSI / compatibility-layer

Compatibility layer of the EESSI project
https://eessi.github.io/docs/compatibility_layer
GNU General Public License v2.0
11 stars 21 forks source link

Intermediate aarch64 build notes #82

Closed terhorstd closed 1 year ago

terhorstd commented 3 years ago

These are the notes during a novice installing aarch64 on a test machine.

Aarch64 compatibility layer compile

  ssh some-machine.somewhe.re
  [...]
  Failed to set locale, defaulting to C.UTF-8
  [...]

The machine is

  uname -a
  Linux ip-172-31-42-108.eu-west-1.compute.internal 4.18.0-240.15.1.el8_3.aarch64 #1 SMP Wed Feb 3 03:16:05 EST 2021 aarch64 aarch64 aarch64 GNU/Linux

  cat /etc/os-release
  NAME="Red Hat Enterprise Linux"
  VERSION="8.3 (Ootpa)"
  ID="rhel"
  ID_LIKE="fedora"
  VERSION_ID="8.3"
  PLATFORM_ID="platform:el8"
  PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)"
  ANSI_COLOR="0;31"
  CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA"
  HOME_URL="https://www.redhat.com/"
  BUG_REPORT_URL="https://bugzilla.redhat.com/"
  REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
  REDHAT_BUGZILLA_PRODUCT_VERSION=8.3
  REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
  REDHAT_SUPPORT_PRODUCT_VERSION="8.3"

Install attempt on machine itself

Seems to be very worksome...

Remote install

On my local machine, fetch repository.

  git clone https://github.com/EESSI/compatibility-layer
  cd compatibility-layer/ansible/playbooks
  cat README.md
  vim hosts

Credentials as provided, key added to ssh-agent.

  [cvmfsstratum0servers]
  ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com ansible_ssh_user=ec2-user eessi_host_arch=aarch64 eessi_host_os=linux

Error /usr/bin/python not found

Error

  fatal: [ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com]: FAILED! => {"changed": false, "module_stderr": "Shared connection to ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com closed.\r\n", "module_stdout": "/bin/sh: /usr/bin/python: No such file or directory\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 127}

Resolved by

  sudo yum install python38

and adding ansible_python_interpreter=/usr/bin/python3 to hosts

Ansible I still had installed in a conda environment, so I reused that one.

  ansible-playbook -i hosts -b install.yml

'ansible_os_family' is undefined

Error

  TASK [compatibility_layer : Fail if host OS is not supported]
fatal: [ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com]: FAILED! => {"msg": "The conditional check 'not(ansible_os_family == \"RedHat\" and ansible_distribution_major_version is version(\"8\", \"==\"))' failed. The error was: error while evaluating conditional (not(ansible_os_family == \"RedHat\" and ansible_distribution_major_version is version(\"8\", \"==\"))): 'ansible_os_family' is undefined\n\nThe error appears to have been in '/home/.../eessi/compatibility-layer/ansible/playbooks/roles/compatibility_layer/tasks/install_prefix.yml': line 4, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Fail if host OS is not supported\n  ^ here\n"}

when debug printing all {{ ansible_facts }} no distribution defining variables are defined, so they are also not injected into global ansible_ namespace.

Resolved by manual verification in /etc/os-release and commenting out the OS check in the playbook.

Could not detect which major revision of yum is in use

Error

  TASK [compatibility_layer : Install EPEL]
fatal: [ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com]: FAILED! => {"changed": false, "msg": ["Could not detect which major revision of yum is in use, which is required to determine module backend.", "You can manually specify use_backend to tell the module whether to use the yum (yum3) or dnf (yum4) backend})"]}

Manually checked

  yum --version
Failed to set locale, defaulting to C.UTF-8
4.2.23
  Installed: dnf-0:4.2.23-4.el8.noarch at Tue Feb  9 15:46:00 2021
--- a/ansible/playbooks/roles/compatibility_layer/tasks/install_prefix.yml
+++ b/ansible/playbooks/roles/compatibility_layer/tasks/install_prefix.yml
 - name: "Install EPEL"
   yum:
       - https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
     disable_gpg_check: yes
     state: present
+    use_backend: yum4
   tags:
     - build_prefix

python3-dnf package required

Error

  TASK [compatibility_layer : Install EPEL]
  fatal: [ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com]: FAILED! => {"changed": false, "msg": "Could not import the dnf python module. Please install `python3-dnf` package.", "results": []}

Notice however, that

  sudo yum install python3-dnf
  Failed to set locale, defaulting to C.UTF-8
  Last metadata expiration check: 2:29:36 ago on Wed Feb 24 14:07:03 2021.
  Package python3-dnf-4.2.23-4.el8.noarch is already installed.
  Dependencies resolved.
  Nothing to do.
  Complete!

However the package is still not found. NOT SOLVED!

update ansible 2.7 -> 3

Maybe old Ansible does not detect packages right?

python3-dnf package required

Error

  TASK [compatibility_layer : Install EPEL]
fatal: [ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com]: FAILED! => {"changed": false, "cmd": "dnf install -y python3-dnf", "msg": "Could not import the dnf python module using /usr/bin/python3 (3.8.3 (default, Aug 18 2020, 13:06:44) [GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]). Please install `python3-dnf` package or ensure you have specified the correct ansible_python_interpreter.", "rc": 0, "results": [], "stderr": "", "stderr_lines": [], "stdout": "Last metadata expiration check: 2:37:06 ago on Wed Feb 24 14:07:03 2021.\nPackage python3-dnf-4.2.23-4.el8.noarch is already installed.\nDependencies resolved.\nNothing to do.\nComplete!\n", "stdout_lines": ["Last metadata expiration check: 2:37:06 ago on Wed Feb 24 14:07:03 2021.", "Package python3-dnf-4.2.23-4.el8.noarch is already installed.", "Dependencies resolved.", "Nothing to do.", "Complete!"]}

nope.

  sudo pip3 install dnf

try again

  TASK [compatibility_layer : Install EPEL] ****************************************************************
fatal: [ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com]: FAILED! => {"changed": false, "cmd": "dnf install -y python3-dnf", "msg": "Could not import the dnf python module using /usr/bin/python3 (3.8.3 (default, Aug 18 2020, 13:06:44) [GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]). Please install `python3-dnf` package or ensure you have specified the correct ansible_python_interpreter.", "rc": 0, "results": [], "stderr": "", "stderr_lines": [], "stdout": "Last metadata expiration check: 2:39:52 ago on Wed Feb 24 14:07:03 2021.\nPackage python3-dnf-4.2.23-4.el8.noarch is already installed.\nDependencies resolved.\nNothing to do.\nComplete!\n", "stdout_lines": ["Last metadata expiration check: 2:39:52 ago on Wed Feb 24 14:07:03 2021.", "Package python3-dnf-4.2.23-4.el8.noarch is already installed.", "Dependencies resolved.", "Nothing to do.", "Complete!"]}

Manually test install:

/usr/bin/python3
Python 3.8.3 (default, Aug 18 2020, 13:06:44)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dnf
/usr/local/lib/python3.8/site-packages/dnf.py:15: UserWarning: The DNF Python API is not currently available via PyPI.

Please install it with your distro package manager (typically called
'python2-dnf' or 'python3-dnf'), and ensure that any virtual environments
needing the API are configured to be able to see the system site packages
directory.

  warnings.warn(warning_msg)

Really strange. Try overwriting already installed package...

sudo dnf reinstall python3-dnf
Failed to set locale, defaulting to C.UTF-8
Last metadata expiration check: 2:43:07 ago on Wed Feb 24 14:07:03 2021.
Dependencies resolved.
==========================================================================================================
 Package               Architecture     Version                   Repository                         Size
==========================================================================================================
Reinstalling:
 python3-dnf           noarch           4.2.23-4.el8              rhel-8-baseos-rhui-rpms           526 k

Transaction Summary
==========================================================================================================

Total download size: 526 k
Installed size: 1.8 M
Is this ok [y/N]: y
Downloading Packages:
python3-dnf-4.2.23-4.el8.noarch.rpm                                       4.7 MB/s | 526 kB     00:00----
----------------------------------------------------------------------------------------------------------
Total                                                                     2.3 MB/s | 526 kB     00:00-----
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                  1/1-
  Reinstalling     : python3-dnf-4.2.23-4.el8.noarch                                                  1/2-
  Cleanup          : python3-dnf-4.2.23-4.el8.noarch                                                  2/2-
  Running scriptlet: python3-dnf-4.2.23-4.el8.noarch                                                  2/2-
  Verifying        : python3-dnf-4.2.23-4.el8.noarch                                                  1/2-
  Verifying        : python3-dnf-4.2.23-4.el8.noarch                                                  2/2-

Reinstalled:
  python3-dnf-4.2.23-4.el8.noarch-------------------------------------------------------------------------

Complete!

Still the error

TASK [compatibility_layer : Install EPEL]
fatal: [ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com]: FAILED! => {"changed": false, "cmd": "dnf install -y python3-dnf", "msg": "Could not import the dnf python module using /usr/bin/python3 (3.8.3 (default, Aug 18 2020, 13:06:44) [GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]). Please install `python3-dnf` package or ensure you have specified the correct ansible_python_interpreter.", "rc": 0, "results": [], "stderr": "", "stderr_lines": [], "stdout": "Last metadata expiration check: 2:43:32 ago on Wed Feb 24 14:07:03 2021.\nPackage python3-dnf-4.2.23-4.el8.noarch is already installed.\nDependencies resolved.\nNothing to do.\nComplete!\n", "stdout_lines": ["Last metadata expiration check: 2:43:32 ago on Wed Feb 24 14:07:03 2021.", "Package python3-dnf-4.2.23-4.el8.noarch is already installed.", "Dependencies resolved.", "Nothing to do.", "Complete!"]}

remove ansible_python_interpreter from hosts file.

RUN STARTS: Wed Feb 24 17:53:07 CET 2021 RUN BREAKS: Wed Feb 24 20:54:00 CET 2021 (very approximate time)

Error lmod masked by missing keyword

TASK [compatibility_layer : Install package set ['eessi-2021.02-linux-aarch64']]
failed: [ec2-aaa-bbb-ccc-ddd.eu-west-1.compute.amazonaws.com] (item=eessi-2021.02-linux-aarch64) => {"ansible_loop_var": "item", "changed": false, "cmd": ["/cvmfs/pilot.eessi-hpc.org/2021.02/compat/linux/aarch64/usr/bin/emerge", "--noreplace", "--ask=n", "@eessi-2021.02-linux-aarch64"], "item": "eessi-2021.02-linux-aarch64", "msg": "Packages not installed.", "rc": 1, "stderr": "\n!!! All ebuilds that could satisfy \"sys-cluster/lmod\" have been masked.\n!!! One of the following masked packages is required to complete your request:\n- sys-cluster/lmod-9999::gentoo (masked by: missing keyword)\n- sys-cluster/lmod-8.4.20::gentoo (masked by: missing keyword)\n\n(dependency required by \"@eessi-2021.02-linux-aarch64\" [argument])\nFor more information, see the MASKED PACKAGES section in the emerge\nman page or refer to the Gentoo Handbook.\n\n", "stderr_lines": ["", "!!! All ebuilds that could satisfy \"sys-cluster/lmod\" have been masked.", "!!! One of the following masked packages is required to complete your request:", "- sys-cluster/lmod-9999::gentoo (masked by: missing keyword)", "- sys-cluster/lmod-8.4.20::gentoo (masked by: missing keyword)", "", "(dependency required by \"@eessi-2021.02-linux-aarch64\" [argument])", "For more information, see the MASKED PACKAGES section in the emerge", "man page or refer to the Gentoo Handbook.", ""], "stdout": "Calculating dependencies  \n * IMPORTANT: 4 news items need reading for repository 'gentoo'.\n * Use eselect news read to view new items.\n\n... done!\n", "stdout_lines": ["Calculating dependencies  ", " * IMPORTANT: 4 news items need reading for repository 'gentoo'.", " * Use eselect news read to view new items.", "", "... done!"]}

that is

!!! All ebuilds that could satisfy \"sys-cluster/lmod\" have been masked.
!!! One of the following masked packages is required to complete your request:
- sys-cluster/lmod-9999::gentoo (masked by: missing keyword)
- sys-cluster/lmod-8.4.20::gentoo (masked by: missing keyword)

(dependency required by \"@eessi-2021.02-linux-aarch64\"

The package should have a aarch64 or ~aarch64 keyword, but it doesn't:

/cvmfs/pilot.eessi-hpc.org/2021.02/compat/linux/aarch64/usr/bin/equery meta lmod
 * sys-cluster/lmod [gentoo]
Maintainer:  gentoo@aisha.cc (Aisha Tammy)
Maintainer:  sci@gentoo.org (Gentoo Science Project)
Upstream:    Remote-ID:   TACC/Lmod ID: github
Homepage:    https://lmod.readthedocs.io/en/latest
Homepage:    https://github.com/TACC/Lmod
Location:    /cvmfs/pilot.eessi-hpc.org/2021.02/compat/linux/aarch64/var/db/repos/gentoo/sys-cluster/lmod
Keywords:    8.4.20:0: ~amd64 ~x86
Keywords:    9999:0:
License:     MIT

Suggestion by Bob:

  EPREFIX=/cvmfs/pilot.eessi-hpc.org/2021.02/compat/linux/aarch64
  vi $EPREFIX/etc/portage/package.accept_keywords
  +sys-cluster/lmod ~amd64
  +dev-lua/luaposix ~amd64
  +dev-lua/lua-bit32 amd64

RUN RESTART: Thu Feb 25 16:35:24 CET 2021

RUN FINISHED: Thu Feb 25 16:59:34 CET 2021 SUCCESS.

boegel commented 3 years ago

@terhorstd Thanks a lot for keeping detailed notes!

Takeaways:

@pescobar: Any feedback on this w.r.t. the Ansible parts?

bedroge commented 3 years ago

I often use Ansible 2.9.x or 2.10.x, Terje has been using 2.10.x. So I'm guessing that 2.7 is too old, and 3.0 (released a few weeks ago) too new. But it would be good if we can test this and specify which version(s) should be used / are known to work.

terhorstd commented 3 years ago
  • We need to make it clear which Ansible version to use (do we expect Ansible 3.0?), and which python command it should use. I'm not sure how to explain the python3-dnf trouble though...

The python3-dnf problem seems to be caused by a kind of dummy package. I've seen similar "broken" installs in containers before. The package is marked as installed (dnf install python3-dnf finds everything is ok), but it is not usable due to some parts being stripped off. The try in native Python shows, that even Python finds the files, but they don't do anything but complain:

>>> import dnf
/usr/local/lib/python3.8/site-packages/dnf.py:15: UserWarning: The DNF Python API is not currently available via PyPI.

The re-install of the package then overwrites the broken instance (and usually inflates the image quite a bit).

  • Some problems seem to be due to locale issues, those can probably be fixed in the playbook?

This could also be caused by the SSH itself. Some of the LC_* environment variables are transferred to the target machine by ssh depending on the local configuration. Different OS have different defaults in /etc/ssh/ssh_config:

    SendEnv LANG LC_*

If the target container is strongly stripped, only LC_ALL=C may be available... The playbook could of course build e.g. European locales. Anyway, I think this only affects the build process, once the system is complete, there is I think no big influence on the cvmfs mount. Does anyone know if the existing locales affect the build artifacts?

pescobar commented 3 years ago

I usually store a requirements.txt file with my playbooks generated with pip freeze (as you can do with any python project) to define the specific ansible version and deps which were used when developing the playbook.

About the ansible version to choose, I am personally using the latest release in 2.10.x branch until I find time to test my playbooks with 3.x. ansible-3.x was released quite recently and introduced some breaking changes. Up to you if you can start testing ansible3.x now or you prefer to stay in 2.10.x for a while. All the details here https://www.ansible.com/blog/announcing-the-community-ansible-3.0.0-package https://www.ansible.com/blog/ansible-3.0.0-qa

For the python3-dnf problem I would try to define ansible_python_interpreter. I don't manage any system using dnf but I have seen similar issues with the python-yum module. see here https://stackoverflow.com/questions/47069450/ansible-yum-not-working . You can define the variable ansible_python_interpreter only at the task level if needed

my two cents. Hope it helps :)

bedroge commented 1 year ago

We now use a build container that includes Ansible, and the installation can be fired off using a bash script. With this approach we can completely control the build environment, and this should address all of the issues mentioned here. So, I'll go ahead and close this issue.