coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
263 stars 60 forks source link

`coreos.ignition.ssh.key` failing in `kola-azure` #1553

Closed marmijo closed 1 week ago

marmijo commented 1 year ago

coreos.ignition.ssh.key has been failing in the kola-azure jenkins job since August 15, 2023 with the following error:

[2023-08-17T15:08:35.184Z] --- FAIL: coreos.ignition.ssh.key (331.60s)
[2023-08-17T15:08:35.184Z]         harness.go:1789: mach.Start() failed: machine "kola-975f064f-cccec36d80" failed basic checks: detected failed or stuck systemd units

Looking into the journal log file for the test shows: journal.txt console.txt

Aug 17 19:46:17.528507 afterburn[1523]: Aug 17 19:46:17.528 INFO Fetching http://168.63.129.16/?comp=versions: Attempt #1
Aug 17 19:46:17.552714 afterburn[1523]: Aug 17 19:46:17.552 INFO Fetch successful
Aug 17 19:46:17.569624 afterburn[1523]: Aug 17 19:46:17.569 INFO Fetching http://168.63.129.16/machine/?comp=goalstate: Attempt #1
Aug 17 19:46:17.574856 afterburn[1523]: Aug 17 19:46:17.574 INFO Fetch successful
Aug 17 19:46:17.883244 afterburn[1523]: Aug 17 19:46:17.883 INFO Fetching http://168.63.129.16/machine/5c91c37d-f891-4428-93c9-81e602a91a74/a1fc44fd%2Dc611%2D4ea8%2Dbc6b%2D678f2ba3b71e.%5Fkola%2Dc954503e%2D36b448701c?comp=certificates&incarnation=1: Attempt #1
Aug 17 19:46:17.898618 afterburn[1523]: Aug 17 19:46:17.898 INFO Fetch successful
Aug 17 19:46:17.927929 afterburn[1523]: Error: failed to run
Aug 17 19:46:17.927929 afterburn[1523]: Caused by:
Aug 17 19:46:17.927929 afterburn[1523]:     0: writing ssh keys
Aug 17 19:46:17.927929 afterburn[1523]:     1: failed to convert pkcs12 blob to ssh pubkey
Aug 17 19:46:17.927929 afterburn[1523]:     2: failed to get chain from pkcs12
Aug 17 19:46:17.929015 systemd[1]: afterburn-sshkeys@core.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 19:46:17.929193 systemd[1]: afterburn-sshkeys@core.service: Failed with result 'exit-code'.
Aug 17 19:46:17.934176 systemd[1]: Failed to start afterburn-sshkeys@core.service - Afterburn (SSH Keys).
Aug 17 19:46:17.942790 systemd[1]: Dependency failed for afterburn-sshkeys.target - Synchronize afterburn-sshkeys@.service template instances.
Aug 17 19:46:17.947667 systemd[1]: Dependency failed for multi-user.target - Multi-User System.
Aug 17 19:46:17.951000 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=afterburn-sshkeys@core comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Aug 17 19:46:17.952533 systemd[1]: multi-user.target: Job multi-user.target/start failed with result 'dependency'.
Aug 17 19:46:17.952960 kernel: audit: type=1130 audit(1692301577.951:80): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=afterburn-sshkeys@core comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Aug 17 19:46:17.952564 systemd[1]: afterburn-sshkeys.target: Job afterburn-sshkeys.target/start failed with result 'dependency'.

Afterburn fails to convert pkcs12 blob to ssh pubkey Before this failure, the fetching section looked like this:

Aug 14 00:49:48.769452 afterburn[1511]: Aug 14 00:49:48.769 INFO Fetching http://168.63.129.16/?comp=versions: Attempt #1    
Aug 14 00:49:48.790498 afterburn[1511]: Aug 14 00:49:48.790 INFO Fetch successful                                            
Aug 14 00:49:48.809532 afterburn[1511]: Aug 14 00:49:48.809 INFO Fetching http://168.63.129.16/machine/?comp=goalstate: Attem
Aug 14 00:49:48.819160 afterburn[1511]: Aug 14 00:49:48.818 INFO Fetch successful                                            
Aug 14 00:49:48.819462 afterburn[1511]: Aug 14 00:49:48.819 WARN SSH pubkeys requested, but not provisioned for this instance

Which is what other test's logs look like when an SSH key IS set. So it's looking like an SSH key is being set somehow, even though one isnt specified. This is causing the .ssh/authorized_keys.d/afterburn file to be created and the kola test fails.

The certificates endpoint is now included in http://168.63.129.16/machine/?comp=goalstate so now it's being fetched and afterburn is failing directly after fetching the certificate.

jlebon commented 9 months ago

@marmijo As part of this, since it's touching the same part of the codebase, could you look at question 1 from https://github.com/coreos/coreos-assembler/issues/1835? Possibly it can be closed out if it's no longer an issue, or if there's a valid rationale for the Afterburn message.

marmijo commented 5 months ago

For some reason, azure started populating the certs endpoint even when an SSH key is not provisioned. The afterburn logic currently expects the certs endpoint to either exist with an SSH public key in the certificate chain, or not exist at all. It looks like we are hitting a third case where the certs endpoint exists, but the certificate chain inside the referenced PKCS#12 file is empty. After speaking with @dustymabe, we decided to allow this case to succeed and still warn that an SSH public key was not provisioned.

https://github.com/coreos/afterburn/pull/1074 should resolve this issue.

marmijo commented 3 months ago

I'll reopen this issue until a new Afterburn release can be cut (which will include the change made in https://github.com/coreos/afterburn/pull/1074) so we can continue to use it to snooze coreos.ignition.ssh.key in the FCOS pipeline.

gursewak1997 commented 1 week ago

Test passes in the latest azure builds with afterburn-5.7.0-1