ananjaser1211 / FloydQ_Reports

BUG Reporting for Exynos8890 OneUI 2.0 Project
57 stars 7 forks source link

[BUG] Random freezing while being active #286

Closed RoughlyAdvanced closed 2 years ago

RoughlyAdvanced commented 2 years ago

Description: The system just randomly freezes forever, no action after that gives any response except a hard restart. Doesn't occur when the device is idle, only when it's actively used, sometimes in the home screen, sometimes inside a system or a third party app, even at the splash screen when booting (Even the first boot did freeze). No hiccups or any errors before the freeze.

Smartphone:

Using FloydQ V6.0.

I might wait for 12 hours to get it to freeze, so I will post this without logs and will try to add them later. Any advises on how to log a situation like this are welcomed with utmost thanks.

ananjaser1211 commented 2 years ago

ive got a log on my mail for this issue, cant see it here, but regardless, i was unable to see any usable info in it regarding the freeze. i suspect it is still a kernel issue, but i have been unable to recreate it with V7 to debug it properly, for now if you can try morogoku kernel, and let me know if things change.

it is likely to be the same issue we faced with hades Q port, it was a ram issue due to bad ram configs and aggressive SWAP. we fixed it by setting swappiness to 100. again im not sure if its the same issue here, but thats my best guess since they are practically the same ROM.

i will finish my semester this week, hopefully ill have enough time then to continue working on V7

RoughlyAdvanced commented 2 years ago

I actually inserted a full log but felt that it contained some sensitive information so I removed the comment. If you may tell me which logs are most useful for you I will try to create them.

I would like to mention that the freezes are even happening at the initial boot screen (The splash screen with the device name). The freezes don't always happen randomly, a random freeze occurs then the device would keep freezing about 5-6 times in a row for about 30-60 minutes. Yesterday I likely reproduced the issue in some weird way, I brought the phone to freeze like 3 times by opening the office lens app and trying to share a pdf file from it. It sounds weird and I have a feeling that it's just random timing but I will try to do it again. And in the meantime while the device is in a panic, after locking the phone it becomes too sluggish and unresponsive to wake it up, until it freezes again. The freeze panics are becoming less frequent but really annoying and worrying. To stop this panic I usually leave the phone off for about 15 minutes then it returns to normal after booting.

I've never worked with kernels, and this is my primary device so I can't risk my time experimenting with it, and I'm a student too, studying computer engineering, so good luck in your exams.

Edit 1: Forgot to mention that the phone heats up very much during the freeze.

RoughlyAdvanced commented 2 years ago

I noticed that the notification LED gets stuck in it's last state when the device freezes (Blinking for a notification, red color while charging, ....). here is an error log where the device froze while capturing it, it stopped logging when the device froze. log_e.log

RoughlyAdvanced commented 2 years ago

I did a loop kernel log and managed to bring the device to crash. Kernel_msg.txt

RoughlyAdvanced commented 2 years ago

I just tried to flash morokernel and the device froze in the middle of two installations, and if I remember correctly the device hanged in TWRP in my first installation of Floyd Q. So I guess it's a kernel issue.

Edit: The system just won't stay up for a couple of minutes on morokernel. I have some exams coming and I need my phone to be usable when I need it. I'm flashing stock pie for now, I really loved the ROM, but losing some of the Spen functionality and the iris scanner convinced me to go back to pie. I will try Floyd Q V7 when it comes out. Good luck.

ananjaser1211 commented 2 years ago

ok so ive done some investigation over the past couple weeks, with the new V7 base im not getting the freeze all together, however you mentioned the phone hanging on TWRP installation screen, and on boot logo, which are not characteristics of the freeze issue we have/had. im wondering if your bootloader and modem are up-to-date, and your phone is not suffering some other issue as its very odd to hang on TWRP or on the initial boot screen.

so for starters id say make sure latest stock is flashed via odin, and is up-to-date (latest bootloader package) and form there, a full format of internal storage. since the phone is working on fine on stock, im not sure why it is acting up yet. will need to look more into the logs.

RoughlyAdvanced commented 2 years ago

It seams that my device is messed up for some reason. Before I flashed Floyd Q I was using a debloated version of stock pie, and out of nowhere I started to have some random short freezes and then a restart like once every three days or so, and it wasn't a huge problem so I didn't mind it too much.

Now, after I reflashed the official stock pie package, I'm having the exact same freezing issue I had in Floyd Q but in the form of the same restarts I used to have before, and I actually had such restarts in Floyd Q but were too few. So my best guess is that it's a hardware issue and was being handled differently by different kernels? I flashed using odin without any wiping or formatting and then used the device immediately, so could there be any leftovers that are causing the increased restart rate?

I'm still trying to find the cause of the issue, it seams that the main cause of the problem is a medium to high load on the CPU that acts like a trigger. The bootloader I'm using is U6, and the latest is S8, I don't think it's that old to cause such problems.

I left the phone aside for now, it isn't reliable anymore. The issue is probably a hardware or some low-level issue and has nothing to do with Floyd Q. Thanks for the help either way, I'll try to post an update to conclude the issue.

Again, I'm trying to examine every kind of logs I know of, if there are any logs that you think may be useful to find such an issue, could you please mention them? And please tell me if there is a need to close the issue if it wasn't Floyd Q related.

ananjaser1211 commented 2 years ago

as per Dave in our group, "Try a new battery as the first option. If it is a hardware issue this is a possible culprit and the cheapest option to start with"

a shutdown / freeze on high load implies that, assuming you never changed your battery it is likely dead. the other culprit would be UFS which broke on my S7 just couple months ago until i swapped the board.

your best bet to log this issue is to run normal log_v and use the phone until it reboots, once the phone reboots, so it captures whatever happened on the system end, then after the phone fully boots back up, run log_kmsg which will take a kernel log of the previous crash, note : this must be done right away, as a log_kmsg is created once every reboot, so it might overwrite your crash reason.

the KMSG might show something interesting, such as a storage (ufs) issue, or undervolting issue (battery)

dont worry about the issue and github technicalities, ill be managing them once i finish V7

RoughlyAdvanced commented 2 years ago

Thank you so much for the info.

last_kmsg is the first thing I thought of, but the file is unreachable without root and I did not root yet, so definitely this is the first thing I'll check.

From my past experiences, changing the battery was a solution to random reboots on my previous galaxy S4 and galaxy S2, but the batteries were already horrible at that time. The battery of my device now is not in a bad condition and it still has about 75% of it's health. Yes you're right, but finding a good battery for a Note FE is not an easy task locally, so I'll have to find the issue before considering buying a new battery.

I couldn't find anything useful in full logs that I took, they usually get cut in the middle of nowhere and the last recorded actions are some normal application activities. It wouldn't cost me anything, I'll do it again when I root to get kernel logs altogether.

ulikolek commented 2 years ago

I am experiencing the same thing, it is after messing up the kernel. Is there any stock kernel to be installed? I am using cronos kernel. Tried to install moro but failed and then the freezing started to happen.

ulikolek commented 2 years ago

It is just my hypotheses. I often get stuck on recovery too when I am trying to install rom, the freezing problem. Then i let the phone heat up just enough, I also use the flip cover. After it, I am able to install rom normally. Weird but works. Maybe it is a hardware problem, idk.

ExtremeXT commented 2 years ago

Why not try to FULLY wipe your device with ODIN and install the LATEST stock ROM? https://technastic.com/odin-nand-erase-samsung/

RoughlyAdvanced commented 2 years ago

I am experiencing the same thing, it is after messing up the kernel. Is there any stock kernel to be installed? I am using cronos kernel. Tried to install moro but failed and then the freezing started to happen.

You'll have to flash the stock rom I guess. Try and the flash the stock rom and tell us how it behaves. BTW, what device are you using?

It is just my hypotheses. I often get stuck on recovery too when I am trying to install rom, the freezing problem. Then i let the phone heat up just enough, I also use the flip cover. After it, I am able to install rom normally. Weird but works. Maybe it is a hardware problem, idk.

The freezing in recovery also happened but with no restarts in both stock recovery and TWRP after flashing stock rom and kernel, so I guess it's not a kernel issue. I have never let the phone heat up for more than 2 hours, how long did it take to unfreeze? BTW, I was using the TWRP version Anan posted in the device forum, here.

Why not try to FULLY wipe your device with ODIN and install the LATEST stock ROM? https://technastic.com/odin-nand-erase-samsung/

I don't know if a NAND wipe is necessary, since the problem was there before flashing any custom rom but was not too critical. I might give it a shot, and also since download mode is the only place safe from freezing. Looking at the behavior of the device, I'm leaning towards the issue being battery related. I'm busy this week so I'll do some research again after I finish.

anonWilder commented 2 years ago

hi, i recently got the floyd rom and im experiencing some freeze between usage. I saw the bug has been removed so im asking, when are we getting the new release

RoughlyAdvanced commented 2 years ago

I got these using dump state logs, could they be useful somehow? dumpstate_lastkmsg_20220611_013329_DP.log.gz dumpstate_lastkmsg_20220611_151453_DP.log.gz dumpstate_lastkmsg_20220611_151828_DP.log.gz dumpstate_lastkmsg_20220611_152855_KP.log.gz dumpstate_lastkmsg_20220611_154756_DP.log.gz

Edit: This is the full dump state log file, I will try to find something useful in it. In the mean time, I would be happy if anyone could figure anything out of these logs. log_half1.zip log_half2.zip

RoughlyAdvanced commented 2 years ago

In the file with KP at the end, the last_kmsg has a line saying "<0>[ 5.984659] [1: swapper/0: 1] Unable to handle kernel NULL pointer dereference at virtual address 000000b0", could this be the reason?

RoughlyAdvanced commented 2 years ago

I just brought the device to restart with debug mode set to mid, and the screen that appears after the crash now shows "WatchDog Reset". I think that this is what causes the restarts, but what could prevent the watchdog timer from resetting and cause the crash?

All the logs I retrieved lately were taken after I did a nand erase, so it's definitely a hardware issue. (Even though, the restarts became less frequent after a nand erase) IMG_20220612_135142

ananjaser1211 commented 2 years ago

Been trying to read dumpstates, as you pointed out the crash is Unable to handle kernel NULL pointer dereference at virtual address 000000b0 which traces to

<0>[    6.016209]  [0:      swapper/0:    1] [<ffffffc000085f64>] el1_da+0x24/0x84
<0>[    6.016230]  [0:      swapper/0:    1] [<ffffffc000464284>] pm_generic_runtime_resume+0x28/0x38
<0>[    6.016249]  [0:      swapper/0:    1] [<ffffffc000855a78>] sysmmu_runtime_resume+0x118/0x130
<0>[    6.016268]  [0:      swapper/0:    1] [<ffffffc00046c864>] pm_genpd_default_restore_state+0x68/0x70
<0>[    6.016286]  [0:      swapper/0:    1] [<ffffffc00046dbc0>] pm_genpd_runtime_resume+0x160/0x200
<0>[    6.016304]  [0:      swapper/0:    1] [<ffffffc0004659dc>] __rpm_callback+0x40/0x74
<0>[    6.016321]  [0:      swapper/0:    1] [<ffffffc000465a6c>] rpm_callback+0x5c/0x80
<0>[    6.016336]  [0:      swapper/0:    1] [<ffffffc000466384>] rpm_resume+0x444/0x4d8
<0>[    6.016352]  [0:      swapper/0:    1] [<ffffffc000466f00>] __pm_runtime_resume+0x4c/0x70
<0>[    6.016369]  [0:      swapper/0:    1] [<ffffffc000720be0>] exynos_smfc_probe+0x1e0/0x574
<0>[    6.016386]  [0:      swapper/0:    1] [<ffffffc00045f67c>] platform_drv_probe+0x50/0x9c
<0>[    6.016405]  [0:      swapper/0:    1] [<ffffffc00045ddf4>] driver_probe_device+0xd4/0x238
<0>[    6.016422]  [0:      swapper/0:    1] [<ffffffc00045e008>] __driver_attach+0x64/0x90
<0>[    6.016440]  [0:      swapper/0:    1] [<ffffffc00045c560>] bus_for_each_dev+0x80/0xb0
<0>[    6.016457]  [0:      swapper/0:    1] [<ffffffc00045e0f0>] driver_attach+0x20/0x28
<0>[    6.016474]  [0:      swapper/0:    1] [<ffffffc00045ce38>] bus_add_driver+0xf0/0x1b8
<0>[    6.016491]  [0:      swapper/0:    1] [<ffffffc00045eb88>] driver_register+0x94/0xe0
<0>[    6.016507]  [0:      swapper/0:    1] [<ffffffc0004600ec>] __platform_driver_register+0x60/0x68
<0>[    6.016526]  [0:      swapper/0:    1] [<ffffffc0013ac5c8>] exynos_smfc_driver_init+0x18/0x20
<0>[    6.016544]  [0:      swapper/0:    1] [<ffffffc001379d3c>] do_one_initcall+0x188/0x1a4
<0>[    6.016561]  [0:      swapper/0:    1] [<ffffffc001379f08>] kernel_init_freeable+0x1b0/0x274
<0>[    6.016578]  [0:      swapper/0:    1] [<ffffffc000affccc>] kernel_init+0x10/0xfc
<0>[    6.016595]  [0:      swapper/0:    1] Code: 7100081f 2a0003e2 540000a0 d0005d61 (f9405a60) 
<4>[    6.016673]  [0:      swapper/0:    1] ---[ end trace 0d7f4a42c3dbcf3a ]---
<2>[    6.059962]  [0:      swapper/0:    1] sec_debug_store_backtrace
<0>[    6.060072]  [0:      swapper/0:    1] Kernel panic - not syncing: Fatal exception
<0>[    6.060072]  [0:      swapper/0:    1] PC is at smfc_runtime_resume+0x38/0x54
<0>[    6.060072]  [0:      swapper/0:    1] LR is at smfc_runtime_resume+0x28/0x54

Panic at smfc_runtime_resume which is here

dev_err(smfc->dev, "fail to set cfw protection (%d)\n", ret);

we can see in your kmsg

exynos5-scaler 15000000.scaler: fail to set cfw protection (-1) exynos5-scaler 15010000.scaler: fail to set cfw protection (-1) [g2d_runtime_resume:651] fail to set cfw protection (-1)

Then Unable to handle kernel NULL pointer dereference at virtual address 000000b0 the crash happens

i dont know what they mean by CFW in this context, cache first write ? anyway all this SMFC CFW junk comes back to

Secure path protection which as far as i know is for DRM playback

config EXYNOS_CONTENT_PATH_PROTECTION
    bool "Exynos Content Path Protection"
    default y
    help
      Enable content path protection of EXYNOS.

This extends MFC And scaler drivers, the latter is the last thing to load before the kernel panic happens.

i should mention that content protection is also critical in IRIS usage, so i guess if you use iris disable it ?, i was just debugging iris today and i noticed this, SecDrv

<6>[    4.175049]  [2:    kworker/2:1: 1706] Trustonic TEE: 601|[srpmb] Driver: SRPMB F/W version: 11160608
<6>[    4.175120]  [2:    kworker/2:1: 1706] Trustonic TEE: 601|SEC_DRIVER_2.0_2018.09.12
<6>[    4.175133]  [2:    kworker/2:1: 1706] Trustonic TEE: 601|[Error]:SecDrv:: drCfwInit(): cfw_disp11 drApiStartThread failed

Now looking at healthy N7FE kmsg and comparing it to yours, i notice few stuff for SRPMB

your device crashes and reports

[ldfw] Pass LDFW partition!
[ldfw] read whole CM partition from the storage
ldfw: 0th ldfw's version 0x30160930 name : CryptoManagerV30
ldfw: 1th ldfw's version 0x20160406 name : fmp_fw_V20
ldfw: 2th ldfw's version 0x11181119 name : drm_fw
ldfw: 3th ldfw's version 0x11160608 name : srpmb_fw
ldfw: 4th ldfw's version 0x20150828 name : tail_fw
ldfw: init ldfw(s). whole ldfws size 0x255110
[ldfw] signature of ldfw is corrupted.!
[mobi_drv] add: 0xd009d290, size: 17005
MobiCore RTM failed to initialize

a good N7FE reports


[ldfw] Pass LDFW partition!
[ldfw] read whole CM partition from the storage
ldfw: 0th ldfw's version 0x30160930 name : CryptoManagerV30
ldfw: 1th ldfw's version 0x20160406 name : fmp_fw_V20
ldfw: 2th ldfw's version 0x11181119 name : drm_fw
ldfw: 3th ldfw's version 0x11160608 name : srpmb_fw
ldfw: 4th ldfw's version 0x20150828 name : tail_fw
ldfw: init ldfw(s). whole ldfws size 0x255110
[ldfw] try to init 4 ldfw(s). except 0 ldfw 4 ldfw(s) have been inited done.
[mobi_drv] add: 0xd009d290, size: 17005
MobiCore IDLE flag = 0
MobiCore Driver loaded and RTM IDLE!

i dont know what is "CM" partition. or if this is caused by the earlier panic (keeping in mind that the N7FE i took kmsg from also had a nice panic. digging more into RPMB

your N7FE

[EFUSE] Already fused for Market status
[EFUSE] This is commercial device.
eSE Protection error! ffffffff
eSE Protection!!
Secure camera - invalid fastcall id ffffffff
Authenticated data read request (Swapped)
Authenticated data read response (Swapped)
[CM] RPMB: error in CryptoManager F/W: 0xFFFFFFFF
RPMB: get hamc value: fail: 0xFFFFFFFF
HMAC compare fail !!
HMAC Host value
HMAC Device value blk_cnt 1 i 1
Authenticated data read response (Not Swapped)
Authenticated data read response (Swapped)
[CM] RPMB: failed to block key: 0xFFFFFFFF
initialize_secdata_rpmb: usable! (0x52504d42), 1
set_fuse_history : no reason
fuse history 0x0
DDR SIZE: 4G

we expect (working n7fe)

[EFUSE] Already fused for Market status
[EFUSE] This is commercial device.
eSE Protection!!
Secure camera - set_tzpc_secure_camera: successfully protected 0
Authenticated data read request (Swapped)
Authenticated data read response (Swapped)
RPMB: get hmac value: success
HMAC compare success !!

initialize_secdata_rpmb: usable! (0x52504d42), 1
set_fuse_history : no reason
fuse history 0x0
DDR SIZE: 4G

CryptoManager goes back to mobicore, RPMB ? i dont know, RPMB rings a bell, im not sure where.

sd 0:0:0:1: rpmb wsm smc init failed: ffffffff it also seems to have issues if this log is to be believed as i dont see it in the working N7FE log

i dont know to be honest, another part of your KMSG shows RPMB as being "fine" so i am confused, for all i know this might be some stock rom security mitigations (similar to defex etc) maybe its best if you continue your tests on a custom ROM since we disable stuff like WSM, DEFEX and other security shit that goes haywire with knox fuse.

ive only went through the KP log (kernel panic) and prev_dump from log_half2

if i get time ill check the rest of the logs

RoughlyAdvanced commented 2 years ago

Man I just don't know how to thank you for doing so much research for this.

As I said, I did a successful NAND erase and the device is clean with no apps or accounts, not even a lockscreen, everything is stock, so the IRIS scanner was not used for a while. The problem is that there are so few custom roms for this device and I can't guarantee that these roms will not make noise with other problems. I would like to try and flash floyd Q but it seams that cronos kernel and morokernel do not restart when the freeze happens, so getting KMSGs is going to be a pain if every restart is a force restart. I sent the device to a repair shop two days ago and expecting to get it back tomorrow. Looking at the logs I expected that the problem is a faulty memory.

I wanted to point out that the logs do include shutdown reasons, and they are generally one of the following: NP, RP, DP or KP. NP is user requested shutdown. RP seams to be user requested restart. KP seams to be kernel panics. DP happens with a freeze and a restart but I never found the actual meaning of it online, and it happens more than KP, but not sure what is it actually.

I set the debug mode to medium when retrieving these logs, but the device will boot to upload mode after every crash, so you'll find many user requested restarts next to the DP and KP occasions, and there doesn't seam to be any way to disable upload mode in this device. The records of the freezes in the shutdown reasons are either DP or KP.

I don't know if cronos kernel not restarting is a bug, but this is an actual problem that may prevent logging kernel panics so it should examined. I will continue examining after I get the device back if the problem was not solved, I asked them to not repair anything beyond changing the battery in case it was fixed.

Thanks for all your efforts

ananjaser1211 commented 2 years ago

For future Freezing bugs please continue discussion in #262 if the issue is outside of hardware.