coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
260 stars 61 forks source link

Kernel Errors w/latest next/testing likely CIFS related. #1381

Closed fifofonix closed 1 year ago

fifofonix commented 1 year ago

Describe the bug

Servers upgraded to latest next or testing operate successfully with CIFS mounts for sometime but within 24 hours typically hang with the kernel reporting some kind of error trace.

Reproduction steps

  1. Upgrade an existing server w/CIFS mounts
  2. Wait

Expected behavior

Server continues to operate

Actual behavior

Server hangs. Three separate journals captured from two separate environments. In our environment we typically see cifs.upcall logs in our journals every 15 minutes which is assumed to be related to keeping kerberos tickets fresh. In all cases seens so far error messages (or outright last messages) are during cifs.upcall.

Example 1 Journal Tail (Shortest Example):

Server had been up for 50 minutes and simply hangs during cifs.upcall with no further messages.

Jan 13 01:51:44 t-pdm-c1-2 cifs.upcall[33050]: key description: ****
Jan 13 01:51:44 t-pdm-c1-2 cifs.upcall[33051]: ver=2
Jan 13 01:51:44 t-pdm-c1-2 cifs.upcall[33051]: host=*****
Jan 13 01:51:44 t-pdm-c1-2 cifs.upcall[33051]: ip=*****
Jan 13 01:51:44 t-pdm-c1-2 cifs.upcall[33051]: sec=1
Jan 13 01:51:44 t-pdm-c1-2 cifs.upcall[33051]: uid=0

Example 2 Journal Tail:

Server had been up for nearly 3 hours executing many cifs.upcalls successfully.

Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117577]: ver=2
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117577]: host=*****
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117577]: ip=*****
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117577]: sec=1
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117577]: uid=0
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117577]: creduid=0
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117577]: user=*****
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117577]: pid=110528
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117576]: get_cachename_from_process_env: pid == 0
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117576]: get_existing_cc: default ccache is FILE:/tmp/krb5cc_0
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117576]: get_tgt_time: unable to get principal
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117576]: handle_krb5_mech: getting service ticket for p-sys-fs-03.mc.cumc.columbia.edu
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117576]: handle_krb5_mech: obtained service ticket
Jan 13 01:09:40 d-pdm-c1-2 cifs.upcall[117576]: Exit status 0
Jan 13 01:09:40 d-pdm-c1-2 kernel: CIFS: VFS: \\***** Send error in SessSetup = -11
Jan 13 01:09:40 d-pdm-c1-2 kernel: ------------[ cut here ]------------
Jan 13 01:09:40 d-pdm-c1-2 kernel: kernel BUG at mm/slub.c:386!
Jan 13 01:09:40 d-pdm-c1-2 kernel: invalid opcode: 0000 [#1] PREEMPT SMP PTI
Jan 13 01:09:40 d-pdm-c1-2 kernel: CPU: 3 PID: 110528 Comm: kworker/3:3 Not tainted 6.0.18-300.fc37.x86_64 #1
Jan 13 01:09:40 d-pdm-c1-2 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Jan 13 01:09:40 d-pdm-c1-2 kernel: Workqueue: cifsiod smb2_reconnect_server [cifs]
Jan 13 01:09:40 d-pdm-c1-2 kernel: RIP: 0010:kfree+0x3ce/0x400
Jan 13 01:09:40 d-pdm-c1-2 kernel: Code: 08 78 f9 ff 0f 0b 49 8b 46 08 f0 48 83 28 01 0f 85 c3 fd ff ff 49 8b 46 08 4c 89 f7 48 8b 40 08 ff d0 0f 1f 00 e9 ae fd ff ff <0f> 0b 0f 0b 48 c7 c6 20 68 77 9e 4c 89 e7 e8 cf 77 f9 ff 0f 0b 4c
Jan 13 01:09:40 d-pdm-c1-2 kernel: RSP: 0018:ffffb1d6c78efd20 EFLAGS: 00010246
Jan 13 01:09:40 d-pdm-c1-2 kernel: RAX: ffff9fa1c66a6e80 RBX: ffff9fa1c66a6e80 RCX: ffff9fa1c66a6e90
Jan 13 01:09:40 d-pdm-c1-2 kernel: RDX: 00000000fb092003 RSI: ffffffffc07488ce RDI: ffff9fa1c66a6e80
Jan 13 01:09:40 d-pdm-c1-2 kernel: RBP: ffff9fa1c0042400 R08: ffffb1d6c78efc38 R09: 0000000000000000
Jan 13 01:09:40 d-pdm-c1-2 kernel: R10: ffff9fa1c66a6e80 R11: 0000000000000000 R12: ffffd9340419a980
Jan 13 01:09:40 d-pdm-c1-2 kernel: R13: ffffffffc07488ce R14: ffff9fa1daab4a10 R15: 0000000000000001
Jan 13 01:09:40 d-pdm-c1-2 kernel: FS:  0000000000000000(0000) GS:ffff9fa2f7d80000(0000) knlGS:0000000000000000
Jan 13 01:09:40 d-pdm-c1-2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 13 01:09:40 d-pdm-c1-2 kernel: CR2: 00007f9d03f78b7c CR3: 0000000175ab2006 CR4: 00000000003706e0
Jan 13 01:09:40 d-pdm-c1-2 kernel: Call Trace:
Jan 13 01:09:40 d-pdm-c1-2 kernel:  <TASK>
Jan 13 01:09:40 d-pdm-c1-2 kernel:  cifs_setup_session+0x21e/0x340 [cifs]
Jan 13 01:09:40 d-pdm-c1-2 kernel:  smb2_reconnect+0x334/0x5d0 [cifs]
Jan 13 01:09:40 d-pdm-c1-2 kernel:  ? preempt_count_add+0x6a/0xa0
Jan 13 01:09:40 d-pdm-c1-2 kernel:  ? _raw_spin_lock+0x13/0x40
Jan 13 01:09:40 d-pdm-c1-2 kernel:  ? preempt_count_add+0x6a/0xa0
Jan 13 01:09:40 d-pdm-c1-2 kernel:  smb2_reconnect_server+0x20d/0x610 [cifs]
Jan 13 01:09:40 d-pdm-c1-2 kernel:  process_one_work+0x1c4/0x380
Jan 13 01:09:40 d-pdm-c1-2 kernel:  worker_thread+0x1d6/0x380
Jan 13 01:09:40 d-pdm-c1-2 kernel:  ? _raw_spin_lock_irqsave+0x23/0x50
Jan 13 01:09:40 d-pdm-c1-2 kernel:  ? rescuer_thread+0x380/0x380
Jan 13 01:09:40 d-pdm-c1-2 kernel:  kthread+0xe6/0x110
Jan 13 01:09:40 d-pdm-c1-2 kernel:  ? kthread_complete_and_exit+0x20/0x20
Jan 13 01:09:40 d-pdm-c1-2 kernel:  ret_from_fork+0x1f/0x30
Jan 13 01:09:40 d-pdm-c1-2 kernel:  </TASK>
Jan 13 01:09:40 d-pdm-c1-2 kernel: Modules linked in: xt_REDIRECT ip_vs_rr xt_ipvs ip_vs vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_u32 nf_conntrack_netlink br_netfilter xt_multiport xt_nat xt_addrtype xt_mark xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink veth bridge stp llc nls_utf8 overlay cifs cifs_arc4 cifs_md4 dns_resolver fscache netfs rfkill vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock intel_rapl_msr intel_rapl_common rapl vmwgfx joydev vmw_balloon drm_ttm_helper vmw_vmci ttm i2c_piix4 xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel serio_raw vmxnet3 ata_generic vmw_pvscsi pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse
Jan 13 01:09:40 d-pdm-c1-2 kernel: ---[ end trace 0000000000000000 ]---
Jan 13 01:09:40 d-pdm-c1-2 kernel: RIP: 0010:kfree+0x3ce/0x400
Jan 13 01:09:40 d-pdm-c1-2 kernel: Code: 08 78 f9 ff 0f 0b 49 8b 46 08 f0 48 83 28 01 0f 85 c3 fd ff ff 49 8b 46 08 4c 89 f7 48 8b 40 08 ff d0 0f 1f 00 e9 ae fd ff ff <0f> 0b 0f 0b 48 c7 c6 20 68 77 9e 4c 89 e7 e8 cf 77 f9 ff 0f 0b 4c
Jan 13 01:09:40 d-pdm-c1-2 kernel: RSP: 0018:ffffb1d6c78efd20 EFLAGS: 00010246
Jan 13 01:09:40 d-pdm-c1-2 kernel: RAX: ffff9fa1c66a6e80 RBX: ffff9fa1c66a6e80 RCX: ffff9fa1c66a6e90
Jan 13 01:09:40 d-pdm-c1-2 kernel: RDX: 00000000fb092003 RSI: ffffffffc07488ce RDI: ffff9fa1c66a6e80
Jan 13 01:09:40 d-pdm-c1-2 kernel: RBP: ffff9fa1c0042400 R08: ffffb1d6c78efc38 R09: 0000000000000000
Jan 13 01:09:40 d-pdm-c1-2 kernel: R10: ffff9fa1c66a6e80 R11: 0000000000000000 R12: ffffd9340419a980
Jan 13 01:09:40 d-pdm-c1-2 kernel: R13: ffffffffc07488ce R14: ffff9fa1daab4a10 R15: 0000000000000001
Jan 13 01:09:40 d-pdm-c1-2 kernel: FS:  0000000000000000(0000) GS:ffff9fa2f7d80000(0000) knlGS:0000000000000000
Jan 13 01:09:40 d-pdm-c1-2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 13 01:09:40 d-pdm-c1-2 kernel: CR2: 00007f9d03f78b7c CR3: 0000000175ab2006 CR4: 00000000003706e0

Example 3:

In this example kernel messages cycle repeatedly.

Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497872]: ver=2
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497872]: host=*****
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497872]: ip=*****
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497872]: sec=1
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497872]: uid=0
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497872]: creduid=0
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497872]: user=*****
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497872]: pid=496218
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497871]: get_cachename_from_process_env: pid == 0
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497871]: get_existing_cc: default ccache is FILE:/tmp/krb5cc_0
Jan 13 08:07:02 d-pdm-c1-3 cifs.upcall[497871]: get_tgt_time: unable to get principal
Jan 13 08:07:02 d-pdm-c1-3 kernel: general protection fault, probably for non-canonical address 0xd663e6570d097ed7: 0000 [#1] PREEMPT SMP PTI
Jan 13 08:07:02 d-pdm-c1-3 kernel: CPU: 0 PID: 497871 Comm: cifs.upcall Not tainted 6.0.18-300.fc37.x86_64 #1
Jan 13 08:07:02 d-pdm-c1-3 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Jan 13 08:07:02 d-pdm-c1-3 kernel: RIP: 0010:kmem_cache_alloc_trace+0xed/0x2d0
Jan 13 08:07:02 d-pdm-c1-3 kernel: Code: 79 10 00 48 89 04 24 0f 84 80 01 00 00 48 85 c0 0f 84 77 01 00 00 8b 4d 28 48 8b 7d 00 48 8b 9d b8 00 00 00 48 01 c1 48 89 ce <48> 33 19 48 0f ce 48 31 f3 40 f6 c7 0f 0f 85 8c 01 00 00 48 8d 8a
Jan 13 08:07:02 d-pdm-c1-3 kernel: RSP: 0018:ffffb9cd056afd68 EFLAGS: 00010286
Jan 13 08:07:02 d-pdm-c1-3 kernel: RAX: d663e6570d097ec7 RBX: a6e182405e958139 RCX: d663e6570d097ed7
Jan 13 08:07:02 d-pdm-c1-3 kernel: RDX: 000000038ab72000 RSI: d663e6570d097ed7 RDI: 0000000000036080
Jan 13 08:07:02 d-pdm-c1-3 kernel: RBP: ffff9c5300042400 R08: ffffb9cd056afda8 R09: 0000000000000800
Jan 13 08:07:02 d-pdm-c1-3 kernel: R10: ffff9c5317c42800 R11: 0000000000000001 R12: 0000000000000000
Jan 13 08:07:02 d-pdm-c1-3 kernel: R13: 0000000000000dc0 R14: 0000000000000020 R15: ffffffffab60adbd
Jan 13 08:07:02 d-pdm-c1-3 kernel: FS:  00007fe7e1d1b1c0(0000) GS:ffff9c53b9c00000(0000) knlGS:0000000000000000
Jan 13 08:07:02 d-pdm-c1-3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 13 08:07:02 d-pdm-c1-3 kernel: CR2: 0000555a8a6835e0 CR3: 0000000113892002 CR4: 00000000003706f0
Jan 13 08:07:02 d-pdm-c1-3 kernel: Call Trace:
Jan 13 08:07:02 d-pdm-c1-3 kernel:  <TASK>
Jan 13 08:07:02 d-pdm-c1-3 kernel:  ? rtnetlink_net_exit+0x30/0x30
Jan 13 08:07:02 d-pdm-c1-3 kernel:  selinux_sk_alloc_security+0x3d/0xa0
Jan 13 08:07:02 d-pdm-c1-3 kernel:  security_sk_alloc+0x37/0x60
Jan 13 08:07:02 d-pdm-c1-3 kernel:  sk_prot_alloc+0xa1/0x120
Jan 13 08:07:02 d-pdm-c1-3 kernel:  sk_alloc+0x2c/0x1d0
Jan 13 08:07:02 d-pdm-c1-3 kernel:  __netlink_create+0x32/0xc0
Jan 13 08:07:02 d-pdm-c1-3 kernel:  netlink_create+0x15c/0x240
Jan 13 08:07:02 d-pdm-c1-3 kernel:  __sock_create+0x107/0x1c0
Jan 13 08:07:02 d-pdm-c1-3 kernel:  __sys_socket+0x61/0xe0
Jan 13 08:07:02 d-pdm-c1-3 kernel:  ? syscall_trace_enter.constprop.0+0x124/0x1a0
Jan 13 08:07:02 d-pdm-c1-3 kernel:  __x64_sys_socket+0x13/0x20
Jan 13 08:07:02 d-pdm-c1-3 kernel:  do_syscall_64+0x58/0x80
Jan 13 08:07:02 d-pdm-c1-3 kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
Jan 13 08:07:02 d-pdm-c1-3 kernel: RIP: 0033:0x7fe7e237df0b
Jan 13 08:07:02 d-pdm-c1-3 kernel: Code: 73 01 c3 48 8b 0d 25 4f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 29 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 4e 0c 00 f7 d8 64 89 01 48
Jan 13 08:07:02 d-pdm-c1-3 kernel: RSP: 002b:00007fff7af79888 EFLAGS: 00000246 ORIG_RAX: 0000000000000029
Jan 13 08:07:02 d-pdm-c1-3 kernel: RAX: ffffffffffffffda RBX: 00007fff7af7a210 RCX: 00007fe7e237df0b
Jan 13 08:07:02 d-pdm-c1-3 kernel: RDX: 0000000000000000 RSI: 0000000000080003 RDI: 0000000000000010
Jan 13 08:07:02 d-pdm-c1-3 kernel: RBP: 00007fff7af799d0 R08: 0000000000000000 R09: 0000000000000064
Jan 13 08:07:02 d-pdm-c1-3 kernel: R10: 00007fff7af79e96 R11: 0000000000000246 R12: 00007fff7af7a242
Jan 13 08:07:02 d-pdm-c1-3 kernel: R13: 00007fff7af79a80 R14: 0000000000000002 R15: 00007fff7af7a210
Jan 13 08:07:02 d-pdm-c1-3 kernel:  </TASK>
Jan 13 08:07:02 d-pdm-c1-3 kernel: Modules linked in: xt_REDIRECT ip_vs_rr xt_ipvs ip_vs vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_u32 nf_conntrack_netlink br_netfilter xt_multiport xt_nat xt_addrtype xt_mark xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink veth bridge stp llc nls_utf8 overlay cifs cifs_arc4 cifs_md4 dns_resolver fscache netfs rfkill vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock intel_rapl_msr intel_rapl_common rapl vmw_balloon vmwgfx joydev drm_ttm_helper ttm vmw_vmci i2c_piix4 xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel serio_raw vmw_pvscsi vmxnet3 ata_generic pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse
Jan 13 08:07:02 d-pdm-c1-3 kernel: ---[ end trace 0000000000000000 ]---

System details

Ignition config

No response

Additional information

No response

jlebon commented 1 year ago

I couldn't find any reports of other users hitting this. I think we need kernel/SMB SMEs looking at this at this point. Would you be able to file a ticket at https://bugzilla.redhat.com/ against the kernel component?

jlebon commented 1 year ago

Looking at the v6.0.17 and v6.0.18 release notes, the CIFS-related items are:

v6.0.17

Paulo Alcantara (2):
      cifs: fix static checker warning
      cifs: don't leak -ENOMEM in smb2_open_file()

v6.0.18

Paulo Alcantara (5):
      cifs: fix confusing debug message
      cifs: set correct tcon status after initial tree connect
      cifs: set correct ipc status after initial tree connect
      cifs: set correct status of tcon ipc when reconnecting
      cifs: prevent copying past input buffer boundaries

Steve French (1):
      cifs: fix missing display of three mount options

The "cifs: prevent copying past input buffer boundaries" patch is the one that fixes the Bugzilla reports in https://github.com/coreos/fedora-coreos-tracker/issues/1379. Could be caused by the other connection-related ones?

fifofonix commented 1 year ago

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160777

fifofonix commented 1 year ago

Additional observations:

jlebon commented 1 year ago

This was discussed in today's community meeting:

AGREED: It appears this issue may affect older machines that are upgraded but we are still investigating to get more details. Since currently this issue only has one reported affected user/environment and they have pinned on a known working version we will release the next stable as usual.

(@dustymabe I changed testing to stable in that message since I assume that's what you meant.)

We agreed to revisit this if more information comes in that may impel us to hold stable.

fifofonix commented 1 year ago

I am still working on an easy way to reproduce and making progress.

However, at this point it is clear that:

On most of my nodes I have a script running on a loop that executes df which incidentally triggers kerberos ticket renewals or keeps a ticket active. In the absence of this script running the issue reported here seems to occur.

dustymabe commented 1 year ago

@fifofonix is this still an issue?

fifofonix commented 1 year ago

Not sure. I skirted around the issue by removing kerberos auth on my fleet. Since no one else has reported/encountered I'm fine with this being closed. I don't have time right now to stand-up some kerberos-authing nodes to try and reproduce.

dustymabe commented 1 year ago

Thanks. If anyone is able to reproduce this please re-open the issue.

MattPOlson commented 2 months ago

Hello, We've seen a similar issue in our OpenShift (OCP) cluster and opened a ticket with RedHat Support. They pointed us to this known issue: https://issues.redhat.com/browse/RHEL-25787 https://access.redhat.com/solutions/7055908

They provided us a kernel patch which we applied to our cluster and this has seemed to fix this issue. Would it be possible to include this fix into the Fedora-coreos code base? We are currently experiencing the issue in our OKD clusters also

jlebon commented 2 months ago

@MattPOlson Thanks for the references.

From the links you posted, it looks like the upstream patches claiming to fix the issues are:

The rawhide kernel in the latest FCOS rawhide build is kernel-6.9.0-0.rc4.20240419git2668e3ae2ef3.41.fc41 and has those patches (and looks like many more fixes in that same area). It'll eventually come to f40 and so into the other FCOS streams.

Meanwhile if you'd like, you could also test this by overriding the kernel. Obviously being an rc kernel, other unrelated issues might pop up.

jwklijnsma commented 2 months ago

@MattPOlson i have same issue on my okd what was de fix for openshift then ?

jwklijnsma commented 1 month ago

@jlebon https://github.com/openshift/os/blob/master/docs/faq.md#replacing-kernel-with-a-different-version

so it is document there on redhat page give this error ?

rpm-ostree override replace \ kernel-{,modules-,modules-extra-,core-}6.9.0-0.rc7.58.fc41.x86_64.rpm

error: Could not depsolve transaction; 4 problems detected: Problem 1: conflicting requests

nothing provides kernel-modules-core-uname-r = 6.9.0-0.rc7.58.fc41.x86_64 needed by kernel-modules-extra-6.9.0-0.rc7.58.fc41.x86_64 from @commandline Problem 2: conflicting requests nothing provides kernel-modules-core-uname-r = 6.9.0-0.rc7.58.fc41.x86_64 needed by kernel-modules-6.9.0-0.rc7.58.fc41.x86_64 from @commandline Problem 3: conflicting requests nothing provides kernel-modules-core-uname-r = 6.9.0-0.rc7.58.fc41.x86_64 needed by kernel-core-6.9.0-0.rc7.58.fc41.x86_64 from @commandline Problem 4: conflicting requests nothing provides kernel-modules-core-uname-r = 6.9.0-0.rc7.58.fc41.x86_64 needed by kernel-6.9.0-0.rc7.58.fc41.x86_64 from @commandline

jlebon commented 1 month ago

@jwklijnsma Yeah, the instructions need to be adapted for FCOS since the package set is not exactly the same. A better suggestion on Fedora would've been to use the Koji/Bodhi integration. E.g. rpm-ostree override replace https://koji.fedoraproject.org/koji/buildinfo?buildID=2441070 should work.

jwklijnsma commented 1 month ago

@jlebon but this error is in rchos os from openshift we will like to test if fix are use ?

MattPOlson commented 3 weeks ago

@MattPOlson Thanks for the references.

From the links you posted, it looks like the upstream patches claiming to fix the issues are:

The rawhide kernel in the latest FCOS rawhide build is kernel-6.9.0-0.rc4.20240419git2668e3ae2ef3.41.fc41 and has those patches (and looks like many more fixes in that same area). It'll eventually come to f40 and so into the other FCOS streams.

Meanwhile if you'd like, you could also test this by overriding the kernel. Obviously being an rc kernel, other unrelated issues might pop up.

@jlebon How do I determine if the fix is in f40 yet?

Thanks, Matt

jlebon commented 3 weeks ago

Assuming it wasn't later reverted, it should be in the 6.9 kernel, which is already stable in Fedora 40. It should be in the next testing and next releases next week, and the stable release two weeks after that. You can override the kernel with this Bodhi update, which is the same kernel version currently in testing-devel.