Closed FanOfABT closed 3 days ago
Dont have much to add except that i have been affected by this three times personally in less than two weeks - happens nearly every reboot for me. FSCK fixes the home partition thankfully, but this is inexcusable.
Taken from deathblade user comment on reddit:
yeah so did I many times while testing 3.6 and 3.7 to the point I eventually had to go back to 3.5 which hasnt had the issue once. the most common thing it does from what I saw is it either deletes the whole .config folder or it just dumps the whole deck folder into lost and found making the data basically unrecoverable
If you run into this issue, please go to Settings->System and create a system report so we can look at the logs to diagnose what happened and work towards fixing the issue.
After creating the report you can save it to desktop and paste it here, or submit it to steam support and you can send me your steam username and I can retrieve it.
One extra note: a system report includes logs from the current boot and the previous boot. This might not cover the session where things actually went wrong. If you have more than one boot in between these two sessions you can collect all the previous logs by running this command in a desktop mode console:
journalctl > ~/Desktop/journal.log
If you'd rather not paste your steam userid here, feel free to open a steam support ticket and send me the ticket id and I can find you via that method. Also, if you have a deck stuck in this state and you are having trouble recovering it, feel free to contact me via steam support as well.
@lostgoat
Would it be possible for Valve to push an experimental branch of SteamOS 3.6 running on the Linux 6.8 kernel?
This way, those users affected by this file-system corruption bug on Linux 6.5 could easily verify whether it also occurs on Linux 6.8 or not.
The alternative route is Valve trying to fix an obscure file-system bug on an obsolete Linux kernel revision (good luck with that, BTW).
Your choice...
@FanOfABT thanks for bringing this issue to our attention, but comments like that don't help us get towards a resolution to the problem.
The same happened to me like 4 times in 2 weeks, i had to reimage/format steam os from zero everytime, i tried to repair with fsck booting from an ubuntu live usb, and it booted but after a while i couldnt open any program even firefox, i was scared that it was something related with hardware, but since i changed to stable i havent had any issue at all.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1591#issuecomment-2278457378
this is not possible because when the error happens you cant even boot, it starts on emergency mode loop, i would like to help but right now im scared to be on beta again, i have spent like 2 days trying to resolve this issue, and dont want to install everything again.
@lostgoat
Today I went to the beta channel just to collect logs for you.
What happened is that I waited foe the issue to happen (First system reboot actually) , after the system went into emergency mode, I fired up the recovery image and copied every var/log folder that i could find.
log2.zip The other log file. Please also check reddit, this is affecting so many people that it would create a huge issue if released to stable and will require above average skills to solve and might flood the RMA for steam support.
I have also had this issue multiple times already and I misinterpreted it as a failing SSD, so I RMAed. This is a huge issue, especially for people with less technical expertise. Running FSCK doesn't always fix it either, it seems like the only good way to fix it is by Reimaging the Steam Deck which really sucks. Reinstalling SteamOS doesn't fix it either.
I just switched my Deck into the beta branch, and within days my filesystem was completely corrupt as FanOfABT posted. Trying to switch back to the previous "B" SteamOS install by holding the "..." button during power up did not work either.
I was unable to recover the file system and had to reimage the deck.
What happened is that I waited foe the issue to happen (First system reboot actually) , after the system went into emergency mode, I fired up the recovery image and copied every var/log folder that i could find.
Thanks, looking into it.
I wonder if these fs corruption events happen when rebooting the deck? or maybe when suspending/resume the deck ?
Could anyone who has had this problem let us know exactly what they have?
steamos-systemreport > sysrep.txt
Will tell you (and a bunch of other potentially useful info).
Also the following:
Just so we can make sure there are no bad interactions. (None of the tweaks we are aware of should be able to cause these problems, but it's always better to check).
@fledermaus
Currently unable to access my Steam Deck, so I can't tell you the exact storage model yet, but it happened on my 256 GB LCD model twice.
First time was with all of A.B.T.'s SteamOS tweaks applied, however the second time it happened with a stock SteamOS 3.6 Beta installation.
Therefore I concur that applying some software tweaks to SteamOS is certainly not the culprit here.
My bet still stands that the most likely culprit is a very subtle file-system bug within the outdated & obsolete Linux 6.5 kernel revision Valve insists on using for SteamOS 3.6, for very mysterious & unknown reasons...
Replying to https://github.com/ValveSoftware/SteamOS/issues/1591#issuecomment-2284194431
There were plans to upgrade to the Linux 6.10 kernel, but the expected merges didn't take place.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1591#issuecomment-2284134260
I have a Stock 64GB LCD model with a 256GB SD card. SteamOS was untweaked, although I did frequently use Desktop mode and had some applications installed through Discover.
Just want to re-iterate.
If you've ever experienced this issue, please submit a system report and let us know your steam username. This is useful even if you aren't experiencing the problem right now.
If you are experiencing the problem, run fsck to restore the unit or restore the files in lost+found. Then submit a system report so that we can see what happened. If you are unfamiliar with how to do the above, open a support ticket and we can try a couple of more things to get you up and running again while preserving the error logs.
I've also seen this issue being linked in a couple of unrelated places online, so I wanted to clarify what are the actual symptoms for this failure:
Can someone that has experienced this issue confirm if the above is correct?
@lostgoat @fledermaus sysrep.txt Went back to 3.5 to be able to boot. I hope that my previous full logs and this system report is useful. due to facing this issue more than 10 times, I faces it on vanilla and modified setups. Current or previous from recovery menu both fails, interestingly enough for me using the erase user data solved this for me at least a couple of times lost+found restored from 3.5 setup doesn't solve the issue
Just want to re-iterate.
If you've ever experienced this issue, please submit a system report and let us know your steam username. This is useful even if you aren't experiencing the problem right now.
If you are experiencing the problem, run fsck to restore the unit or restore the files in lost+found. Then submit a system report so that we can see what happened. If you are unfamiliar with how to do the above, open a support ticket and we can try a couple of more things to get you up and running again while preserving the error logs.
I just submitted a system report, but I don't know how much use it is. Fsck was not able to unfsck my filesystem. It was only fixed with a complete reimage, so potential useful log files were gone.
The only additional thing I can add was that my Steamdeck had separate OS and Steam Client beta-participation settings enabled. I had been running the OS Update Channel in "Beta", and the "Steam Client Update Channel" in "Steam Deck Stable".
Starting with the SteamOS 3.6.9 Beta, I noticed my Steam Deck started behaving odd. Every single cold boot would result in the "Verifying Installation" appearing before eventually making it to the Steam OS Home. It otherwise worked fine, however. I then decided to switch the "Steam Client Update Channel" into Steam Deck Beta. It seemed to install the Beta, but the Steamdeck then hung when it tried to restart the Steam Client.
I had to hard power off the Steamdeck. Upon powering back up, it would never make it past the Steam logo. I then pushed the "..." button and power at the same time, and selected the current OS to get verbose output. The output text said something about the system being in Emergency mode, and "press enter to continue". I plugged in a keyboard, hit enter, and it automatically attempted to run fsck. It got to about 25%, failed, and then said "press enter to continue", and the same thing happened each time.
"Verifying installation" can happen, iirc, if there was an abrupt shutdown or steam didn't get a chance to clean up properly. That part doesn't necessarily indicate a deeper problem. The rest is quite weird though. More data for the investigation. Thanks.
A.B.T.'s SteamOS tweaks
I immediately wonder if there is correlation between people having these issues and those tweaks.
Also it should be noted that Decky had an issue with a specific plugin that they resolved.
@Intoxicus
Is it really too much to ask to read a thread fully before diving in head-first?
As I had already stated, the file-system corruption bug happened on a stock SteamOS 3.6 Beta installation.
So no, neither Decky nor A.B.T.'s SteamOS tweaks are the culprit here.
@Intoxicus
This happens in stock installation also.
I hope we get updates on this soon.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1591#issuecomment-2289869311
What the ABT Tweaks do would persist unless there's a full wipe/reset/reinstall as I understand it.
Have you ever used ABT Tweaks at all?
The Decky Boot Loop thing is interesting in how the timing of it correlates with when this bug appeared. It could be related in the sense that the same bug can caused different issues in different circumstances.
Unless a BNiB Deck has presented with this bug it is valid to consider if ABT and/or Decky is a possible cause. And/or a factor, but not the cause in of itself.
I can say I have not had these issues at all. I use Decky, but did not get hit by that recent Boot Loop issue that happened. I think it occured on the same update where this bug presented.
I do not use ABT, and would not, except for that one memory lock tweak. I want to fact check that one before applying it independently. The idea of permanently being in performance mode seems silly to me. I'll just use PowerTools from Decky to do it for games that actually need it. The thought process behind those tweaks makes me apprehensive that they're not well thought out and short sighted. I don't even know who "ABT" is. I didn't even hear of these ABT Tweaks until reading this bug report thread. Right away it strikes me as sus in terms of "does doing this actually make sense."
I've done all sorts of troubleshooting and one thing I've learned is to always keep a beginner's mind and recheck your assumptions. If you're sure that can't be possible, double check anyway(within reason.) I've been surprised where the thing I thought it could not possibly be actually was the solution. I'll double check things that seem silly just to be absolutely certain(within reason.)
Valve Devs are going to have a clearer and more complete data set and idea of what is actually going on. They're likely 23 steps ahead of us mere mortal towels. ;)
I have an LE OLED that has neither the boot loop issue with decky, or this config wipe bug. I've not modded anything. I only use Decky and Cryobytes, nothing fancy or crazy. Even with Decky I keep it minimal. There's some Decky plugins that I just would not bother with for various reasons.
Being able to compare a unit that has never had the issue with the logs of affected units could be helpful. Valve Devs let me know if you want "clean" logs from an unaffected Deck "in the wild."
I've hidden a couple of topics to avoid de-railing the discussion. For now some users have reported that they see the problem with no modifications so we are treating it that way.
So far we've been stress testing the filesystem and trying different ways to hard reset the unit to trigger an unclean filesystem exit and we haven't been able to repro.
If anyone has their hands on a unit that is in this failure mode please contact us as there could be a lot of useful information in that unit.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1591#issuecomment-2290176167
I had previously provided logs from a failed system through recovery image, any good come from those?
For now I will use only stable as I'm travelling and beed my steam deck, but for the future, what will be needed from a failed systems other than the logs provided? could chroot to the system and provide what is needed.
Also, I would like to know if the main channle has a bewer kernel than beta, could test if this also happens on main or not
Right now I think the latest main image has only a couple of display blanking/refresh rate related patches over what's in beta.
Replying to #1591 (comment)
I had previously provided logs from a failed system through recovery image, any good come from those?
Some, but not yet enough. We can see it happening, as well as a few messages which migt be leads or might be unrelated - we're still chasing those down.
We cannot yet tell why it happens and our stress testing hasn't triggered the problem here (yet).
interestingly enough for me using the erase user data solved this for me at least a couple of times lost+found restored from 3.5 setup doesn't solve the issue
Erase user data will reformat /home, which is what's getting corrupted. So it makes sense that doing that would make the problem go away, at least temporarily.
We've been taking a look at units that report this bug via steam support or other channels. Most units were experiencing a different kind of error (incorrect modifications of pacman packages, errors in thirdparty mods, etc). But we did find two units that were failing to boot due to filesystem errors as described in this bug.
There will be an update that aims to address the issues we saw on these two systems. But it is still useful to receive more reports as our current data sample size is fairly small.
If you run into something that looks like this bug[1], please boot into the recovery image, connect to a wifi network, and then run the following command to collect logs from the failed boot attempts:
curl https://raw.githubusercontent.com/lostgoat/steamos-diagnostics/main/diagnostics.sh | bash
This will generate a ~/diagnostics.tar.xz
archive with the relevant data. Please attach that to your reply.
[1] Symptoms for this issue: boot into an emergency mode shell, and falling back to "Previous OS" via the ... recovery menu still results in boot failure.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1591#issuecomment-2299852384
Coming back from vacation, would be willing to test the fix when released and report back, will also provide the needed logs if the issue reappears after testing the possible fix.
Can confirm. Had to factory reset my Deck six times since I purchased it 3 months ago due to this disk corruption bug. Ofc been using the beta branch all this time.
I believe a temporary fix is setting your SteamOS to Stable and the Steam Client to Beta. That way you can still enjoy the Beta features normally. I'm not sure if this is actually a fix, I will contact the support when this issue happens again, perhaps I can send in my Steam Deck or something if it helps narrow down the issue.
Also, if it helps, the first time I encountered this issue was on the 10th or 11th of May this year.
Also, if it helps, the first time I encountered this issue was on the 10th or 11th of May this year.
Thanks. Not a smoking gun, but another datum for the pile.
I'll add that the first time this happened to me was in May as well after switching over to 3.6 to see how the switch to zram worked in more intense games. I thought it was probably caused by some 3rd party software I had installed, but it happened to me again a couple weeks or so later with an essentially stock SteamOS install. Only a complete re-image has been able to fix the problem either time.
I've been on stable the last couple of months, but I'm willing to switch over to 3.6 and see what happens once again if it means getting this bug ironed out. I will report back if/when my deck is in that state.
If you have experienced this problem the first thing we'd like is the sysreport info described in an earlier comment:
https://github.com/ValveSoftware/SteamOS/issues/1591#issuecomment-2284134260
You don't have to be on beta to give us this info, even a sysreport collected on stable is useful to us.
systemreport-20240826180428.txt
System report is attached. Still plan to switch back over to 3.6, and will be back with any updates.
This is mine, been in stable and zero issues so far. but when i was in beta it was a mess, i knew something was not working right because i couldnt update any program throught Discover, also this situation was triggered when i restarted or powered off the SD... Hope this helps solving the issue.
Thanks to everyone who gave us sysreports - we have a couple of theories as to what's going on here and we're zeroing in on the cause.
I tried switch to default boot to Desktop mode steamos-session-select plasma-persistent And then defatult boot to Game mode steamos-session-select gamescope This can make my Steam deck stuck at boot logo.
I tried switch to default boot to Desktop mode steamos-session-select plasma-persistent And then defatult boot to Game mode steamos-session-select gamescope This can make my Steam deck stuck at boot logo.
I think that describes a different problem - could you open a separate issue for that?
Replying to https://github.com/ValveSoftware/SteamOS/issues/1591#issuecomment-2299852384
For what it's worth, and you may have addressed this, where I've seen this problem most commonly happening have been those with 64gb emmc storage in their steam deck. It may be unrelated, but from reports across discord servers it's a correlation I've seen.
Hoping we can start testing soon
Rest assured we'll post an update as soon as we have something concrete.
there is actually a very similar bug upstream in Fedora that only affected 64GB Decks still using the original storage, and people reported no issues with installation after replacing the internal 64GB drive, so I wonder if this was actually the same bug all along: https://universal-blue.discourse.group/t/what-is-the-upstream-bug-related-to-the-emmc-on-the-steam-deck/519/10
unfortunately no one was able to bisect the issue, but would be interested to hear the root cause if it's determined.
there is actually a very similar bug upstream in Fedora that only affected 64GB Decks still using the original storage, and people reported no issues with installation after replacing the internal 64GB drive, so I wonder if this was actually the same bug all along: https://universal-blue.discourse.group/t/what-is-the-upstream-bug-related-to-the-emmc-on-the-steam-deck/519/10
unfortunately no one was able to bisect the issue, but would be interested to hear the root cause if it's determined.
I agree, this immediately came to mind when I saw how prevalent it was with 64gb users on the steam deck.
Your system information
Please describe your issue in as much detail as possible:
When using the current SteamOS 3.6 Beta, file-system corruption has been observed by multiple Steam Deck users over the span of multiple months, which makes SteamOS unbootable and thus unusable.
The unfortunate nature of this very severe bug is that it seemingly occurs randomly without any clear pattern to trigger it, thus making it practically impossible to reliably reproduce.
Having used Linux extensively for more than a decade myself, I have a gut-feeling that the Linux kernel Valve chose for SteamOS 3.6, namely Linux 6.5, most likely contains a very subtle bug inside its very complex file-systems code area.
IMHO, the choice of using Linux 6.5 for SteamOS 3.6 is a poor one by Valve, because Linux 6.5 is already considered obsolete by upstream kernel developers, whereas Linux 6.6 is an LTS release and therefore still supported and actively used in production systems.
On the other hand, I'm aware that Valve is already actively preparing to use Linux 6.8 within SteamOS, so perhaps switching from Linux 6.5 to 6.8 for the current SteamOS 3.6 Beta would already resolve this very serious file-system corruption bug.
Hope Valve at least considers doing so, thanks!