crc-org / crc

CRC is a tool to help you run containers. It manages a local OpenShift 4.x cluster, Microshift or a Podman VM optimized for testing and development purposes
https://crc.dev
Apache License 2.0
1.23k stars 233 forks source link

[BUG] Failed to update pull secret on the disk on Centos 9 Stream #3446

Open danpawlik opened 1 year ago

danpawlik commented 1 year ago

General information

CRC version

CRC version: 2.11.0+823e40d
OpenShift version: 4.11.13
Podman version: 4.2.0

CRC config

- consent-telemetry                     : no
- kubeadmin-password                   : 123456789
- pull-secret-file                      : pull-secret.txt

Host Operating System

NAME="CentOS Stream"
VERSION="9"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="9"
PLATFORM_ID="platform:el9"
PRETTY_NAME="CentOS Stream 9"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:centos:centos:9"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"

Steps to reproduce

  1. Update Centos 9 stream
  2. Setup CRC
  3. Start CRC

Expected

CRC will start normally.

Actual

DEBU Making call to close connection to plugin binary
Failed to update pull secret on the disk: Temporary error: pull secret not updated to disk (x159)

Logs

https://gist.github.com/danpawlik/608fd45ce9e8642ce43baace625575d4

Before gather the logs try following if that fix your issue

$ crc delete -f
$ crc cleanup
$ crc setup
$ crc start --log-level debug

Also it does not help.

praveenkumar commented 1 year ago

@danpawlik have you able to reproduce it constantly?

danpawlik commented 1 year ago

Unfortunately yes.

praveenkumar commented 1 year ago

Also I can see Running CRC on: VM which means you are using nested virtualization setup which is not tested by us. Can you use https://github.com/crc-org/crc/wiki/Debugging-guide one and ssh to the VM and the check /var/lib/kubelet/config.json file exist with your pull secret content (this is what this check do which is failing for you) ?

danpawlik commented 1 year ago

thanks @praveenkumar .

So on the VM, the file contains:

{}

so it was not copied.

I see a lot traceback in dmesg:

[  880.704548] RIP: 0033:0x7f8b19a3ec6b
[  880.704736] Code: 73 01 c3 48 8b 0d b5 b1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0
 ff ff 73 01 c3 48 8b 0d 85 b1 1b 00 f7 d8 64 89 01 48
[  880.705596] RSP: 002b:00007f88b7ffe4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  880.705966] RAX: ffffffffffffffda RBX: 00007f88c4ff8e50 RCX: 00007f8b19a3ec6b
[  880.706317] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001c
[  880.706684] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000000ff
[  880.707052] R10: 00007f88b0051580 R11: 0000000000000246 R12: 000055b9750bf620
[  880.707412] R13: 00007f88c4ff8ff0 R14: 9ebdce38c25acb00 R15: 00007f88c4ff8e48
[  880.707754]  </TASK>
[  880.707887] Call Trace:
[  880.708010]  <TASK>
[  880.708116]  x86_pmu_stop+0x50/0xb0
[  880.708289]  x86_pmu_del+0x73/0x190
[  880.708463]  event_sched_out.part.0+0x7a/0x1f0
[  880.708679]  group_sched_out.part.0+0x93/0xf0
[  880.708898]  ctx_sched_out+0x124/0x2a0
[  880.709083]  perf_event_context_sched_out+0x1a5/0x460
[  880.709329]  __perf_event_task_sched_out+0x50/0x170
[  880.709572]  ? pick_next_task+0x51/0x940
[  880.709766]  prepare_task_switch+0xbd/0x2a0
[  880.709997]  __schedule+0x1cb/0x620
[  880.710172]  schedule+0x5a/0xc0
[  880.710331]  xfer_to_guest_mode_handle_work+0xac/0xe0
[  880.710578]  vcpu_run+0x1f5/0x250 [kvm]
[  880.710801]  kvm_arch_vcpu_ioctl_run+0x104/0x620 [kvm]
[  880.711079]  kvm_vcpu_ioctl+0x271/0x670 [kvm]
[ 1898.648039] RIP: 0033:0x7f8b19a3ec6b
[ 1898.648213] Code: 73 01 c3 48 8b 0d b5 b1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0
 ff ff 73 01 c3 48 8b 0d 85 b1 1b 00 f7 d8 64 89 01 48
[ 1898.649089] RSP: 002b:00007f88c57f84a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1898.649448] RAX: ffffffffffffffda RBX: 00007f88c5ffae50 RCX: 00007f8b19a3ec6b
[ 1898.649790] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001b
[ 1898.650141] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000000ff
[ 1898.650481] R10: 00007f88b0051580 R11: 0000000000000246 R12: 000055b9750af730
[ 1898.650815] R13: 00007f88c5ffaff0 R14: 9ebdce38c25acb00 R15: 00007f88c5ffae48
[ 1898.651154]  </TASK>
[ 1898.651824] Call Trace:
[ 1898.652155]  <TASK>
[ 1898.652401]  amd_pmu_enable_all+0x44/0x60
[ 1898.652851]  __perf_install_in_context+0x16c/0x220
[ 1898.653372]  remote_function+0x47/0x50
[ 1898.653781]  generic_exec_single+0x78/0xb0
[ 1898.654254]  smp_call_function_single+0xeb/0x130
[ 1898.654569]  ? sw_perf_event_destroy+0x60/0x60
[ 1898.654871]  perf_install_in_context+0xcf/0x200
[ 1898.655173]  ? ctx_resched+0xe0/0xe0
[ 1898.655416]  perf_event_create_kernel_counter+0x114/0x180
[ 1898.655776]  pmc_reprogram_counter.constprop.0+0xec/0x220 [kvm]
[ 1898.656230]  amd_pmu_set_msr+0x106/0x170 [kvm_amd]
[ 1898.656562]  ? __svm_vcpu_run+0x67/0x110 [kvm_amd]
[ 1898.656898]  ? get_gp_pmc_amd+0x129/0x200 [kvm_amd]
[ 1898.657235]  __kvm_set_msr+0x7f/0x1c0 [kvm]
[ 1898.657567]  kvm_emulate_wrmsr+0x52/0x1b0 [kvm]
[ 1898.657923]  vcpu_enter_guest+0x667/0x1010 [kvm]
[ 1898.658277]  ? kvm_get_rflags+0xe/0x30 [kvm]
[ 1898.658606]  ? svm_get_if_flag+0x1d/0x50 [kvm_amd]
[ 1898.658931]  ? kvm_apic_has_interrupt+0x32/0x90 [kvm]
[ 1898.659311]  ? kvm_cpu_has_interrupt+0x60/0x80 [kvm]
[ 1898.659681]  vcpu_run+0x33/0x250 [kvm]
[ 1898.659977]  kvm_arch_vcpu_ioctl_run+0x104/0x620 [kvm]
[ 1898.660365]  kvm_vcpu_ioctl+0x271/0x670 [kvm]
[ 1898.660702]  ? __seccomp_filter+0x45/0x470

The odd think here is, that on Centos 8 Stream is working normally (same hypervisor that got AMD CPU, just instance has been rebuilt).

I have done one more test on different Cloud Provider with same image and the result is.... it is working normally (but there was Intel CPU).

I will try to dig more, what is breaking the crc start there. Maybe it would be helpful to others that got same issue.

danpawlik commented 1 year ago

Workaround, ansible-playbook:

---
# This playbook deploy crc and prepare VM to make a snapshot, that later
# can be deployed in CI.
- hosts: crc.dev
  become: true
  tasks:
    - name: Install packages
      yum:
        name:
          - qemu-kvm-common
        state: present

    - name: Ensure CentOS runs with selinux permissive
      selinux:
        policy: targeted
        state: permissive

    - name: Enable nested virtualization
      lineinfile:
        path: /etc/modprobe.d/kvm.conf
        regexp: '^#options kvm_amd nested=1'
        line: 'options kvm_amd nested=1'

    # From https://lore.kernel.org/lkml/20220830235537.4004585-8-seanjc@google.com/T/
    - name: Disable ept
      shell: |
        sed -i 's/net.ifnames=0/net.ifnames=0 ept=0/g' /etc/default/grub

    - name: Regenerate grub
      shell: |
        grub2-mkconfig -o /boot/grub2/grub.cfg

# REBOOT HOST.

After applying playbook, it seems that it move forward. IMO the issue is not on crc side, but kvm/libvirt.

danpawlik commented 1 year ago

Still it does not deploy CRC. Created bug for kvm https://bugzilla.redhat.com/show_bug.cgi?id=2151878

cfergeau commented 1 year ago

For what it's worth, there has been multiple similar reports in the past https://github.com/crc-org/crc/issues/3366#issuecomment-1264304842 https://github.com/crc-org/crc/issues/1830

(searching closed issues for "AMD Intel" might give more results)

gbraad commented 1 year ago

Marking as unsupported. This is not something we can resolve as this is relate to nested virtualization and the 'incompatiblity' (read existing bugs: https://marc.info/?l=kvm&m=166886061623174&w=2) with some AMD Ryzen/Epyc CPUs.

danpawlik commented 1 year ago

Workaround on Centos 9 Stream: install kernel from elrepo.org.

Steps:

  1. rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
  2. dnf install -y https://www.elrepo.org/elrepo-release-9.el9.elrepo.noarch.rpm
  3. dnf --enablerepo=elrepo-kernel install -y kernel-ml
  4. reboot