Closed fifofonix closed 1 year ago
I couldn't find any reports of other users hitting this. I think we need kernel/SMB SMEs looking at this at this point. Would you be able to file a ticket at https://bugzilla.redhat.com/ against the kernel component?
Looking at the v6.0.17 and v6.0.18 release notes, the CIFS-related items are:
v6.0.17
Paulo Alcantara (2):
cifs: fix static checker warning
cifs: don't leak -ENOMEM in smb2_open_file()
v6.0.18
Paulo Alcantara (5):
cifs: fix confusing debug message
cifs: set correct tcon status after initial tree connect
cifs: set correct ipc status after initial tree connect
cifs: set correct status of tcon ipc when reconnecting
cifs: prevent copying past input buffer boundaries
Steve French (1):
cifs: fix missing display of three mount options
The "cifs: prevent copying past input buffer boundaries" patch is the one that fixes the Bugzilla reports in https://github.com/coreos/fedora-coreos-tracker/issues/1379. Could be caused by the other connection-related ones?
Additional observations:
kernel BUG at mm/slub.c:386!
error message although journal messages cease shortly thereafter cadvisor somehow continues to send some metric data for some time indicating containers spinning out of control, e.g. traefik eventually exhausting memory.This was discussed in today's community meeting:
AGREED: It appears this issue may affect older machines that are upgraded but we are still investigating to get more details. Since currently this issue only has one reported affected user/environment and they have pinned on a known working version we will release the next
stable
as usual.
(@dustymabe I changed testing
to stable
in that message since I assume that's what you meant.)
We agreed to revisit this if more information comes in that may impel us to hold stable
.
I am still working on an easy way to reproduce and making progress.
However, at this point it is clear that:
On most of my nodes I have a script running on a loop that executes df
which incidentally triggers kerberos ticket renewals or keeps a ticket active. In the absence of this script running the issue reported here seems to occur.
@fifofonix is this still an issue?
Not sure. I skirted around the issue by removing kerberos auth on my fleet. Since no one else has reported/encountered I'm fine with this being closed. I don't have time right now to stand-up some kerberos-authing nodes to try and reproduce.
Thanks. If anyone is able to reproduce this please re-open the issue.
Hello, We've seen a similar issue in our OpenShift (OCP) cluster and opened a ticket with RedHat Support. They pointed us to this known issue: https://issues.redhat.com/browse/RHEL-25787 https://access.redhat.com/solutions/7055908
They provided us a kernel patch which we applied to our cluster and this has seemed to fix this issue. Would it be possible to include this fix into the Fedora-coreos code base? We are currently experiencing the issue in our OKD clusters also
@MattPOlson Thanks for the references.
From the links you posted, it looks like the upstream patches claiming to fix the issues are:
The rawhide kernel in the latest FCOS rawhide build is kernel-6.9.0-0.rc4.20240419git2668e3ae2ef3.41.fc41 and has those patches (and looks like many more fixes in that same area). It'll eventually come to f40 and so into the other FCOS streams.
Meanwhile if you'd like, you could also test this by overriding the kernel. Obviously being an rc kernel, other unrelated issues might pop up.
@MattPOlson i have same issue on my okd what was de fix for openshift then ?
@jlebon https://github.com/openshift/os/blob/master/docs/faq.md#replacing-kernel-with-a-different-version
so it is document there on redhat page give this error ?
rpm-ostree override replace \ kernel-{,modules-,modules-extra-,core-}6.9.0-0.rc7.58.fc41.x86_64.rpm
error: Could not depsolve transaction; 4 problems detected: Problem 1: conflicting requests
nothing provides kernel-modules-core-uname-r = 6.9.0-0.rc7.58.fc41.x86_64 needed by kernel-modules-extra-6.9.0-0.rc7.58.fc41.x86_64 from @commandline Problem 2: conflicting requests nothing provides kernel-modules-core-uname-r = 6.9.0-0.rc7.58.fc41.x86_64 needed by kernel-modules-6.9.0-0.rc7.58.fc41.x86_64 from @commandline Problem 3: conflicting requests nothing provides kernel-modules-core-uname-r = 6.9.0-0.rc7.58.fc41.x86_64 needed by kernel-core-6.9.0-0.rc7.58.fc41.x86_64 from @commandline Problem 4: conflicting requests nothing provides kernel-modules-core-uname-r = 6.9.0-0.rc7.58.fc41.x86_64 needed by kernel-6.9.0-0.rc7.58.fc41.x86_64 from @commandline
@jwklijnsma Yeah, the instructions need to be adapted for FCOS since the package set is not exactly the same. A better suggestion on Fedora would've been to use the Koji/Bodhi integration. E.g. rpm-ostree override replace https://koji.fedoraproject.org/koji/buildinfo?buildID=2441070
should work.
@jlebon but this error is in rchos os from openshift we will like to test if fix are use ?
@MattPOlson Thanks for the references.
From the links you posted, it looks like the upstream patches claiming to fix the issues are:
- https://lore.kernel.org/all/20240401170044.86991-1-pc@manguebit.com/
- https://lore.kernel.org/all/20240402193404.236159-1-pc@manguebit.com/
The rawhide kernel in the latest FCOS rawhide build is kernel-6.9.0-0.rc4.20240419git2668e3ae2ef3.41.fc41 and has those patches (and looks like many more fixes in that same area). It'll eventually come to f40 and so into the other FCOS streams.
Meanwhile if you'd like, you could also test this by overriding the kernel. Obviously being an rc kernel, other unrelated issues might pop up.
@jlebon How do I determine if the fix is in f40 yet?
Thanks, Matt
Assuming it wasn't later reverted, it should be in the 6.9 kernel, which is already stable in Fedora 40. It should be in the next testing
and next
releases next week, and the stable
release two weeks after that. You can override the kernel with this Bodhi update, which is the same kernel version currently in testing-devel.
Describe the bug
Servers upgraded to latest next or testing operate successfully with CIFS mounts for sometime but within 24 hours typically hang with the kernel reporting some kind of error trace.
Reproduction steps
Expected behavior
Server continues to operate
Actual behavior
Server hangs. Three separate journals captured from two separate environments. In our environment we typically see cifs.upcall logs in our journals every 15 minutes which is assumed to be related to keeping kerberos tickets fresh. In all cases seens so far error messages (or outright last messages) are during cifs.upcall.
Example 1 Journal Tail (Shortest Example):
Server had been up for 50 minutes and simply hangs during cifs.upcall with no further messages.
Example 2 Journal Tail:
Server had been up for nearly 3 hours executing many cifs.upcalls successfully.
Example 3:
In this example kernel messages cycle repeatedly.
System details
Ignition config
No response
Additional information
No response