TrungNguyen1909 / qemu-t8030

iPhone 11 emulated on QEMU
Other
1.97k stars 194 forks source link

Freezes and timeouts #65

Closed dom-lgtm closed 2 years ago

dom-lgtm commented 2 years ago

Hey,

I have trouble restoring because of timeouts and freezes. Any advice ?

I also notice that the VMs (linux or iOS) often freeze for no reason until I focus the GUI window.

dom-lgtm commented 2 years ago

My latest try:

restore-step-ids = {0x1103068B:124}
restore-step-names = {0x1103068B:seal_system_volume}
restore-step-uptime = 1287
restore-step-user-progress = 97
[07:41:05.0563-GMT]{3>6} CHECKPOINT BEGIN: RESTORED:[0x0648] cleanup_send_final_status
restore-step-ids = {0x1103068B:124;0x11030648:128}
restore-step-names = {0x1103068B:seal_system_volume;0x11030648:cleanup_send_final_status}
restore-step-uptime = 1287
restore-step-user-progress = 97

ERROR: Unable to successfully restore device
Checkpoint 1608 complete with code 0
FDR 0x55cc57d7d190 terminating...
ERROR: Unable to restore device
dom-lgtm commented 2 years ago

The VM then rebooted and panicked after some time:

panic(cpu 0 caller 0xfffffff02735ccc8): watchdog timeout: no checkins from watchdogd since boot (601 seconds ago)
Debugger message: panic
Memory ID: 0x0
OS release type: User
OS version: 19F77
Kernel version: Darwin Kernel Version 21.5.0: Thu Apr 21 21:51:27 PDT 2022; root:xnu-8020.122.1~1/RELEASE_ARM64_T8030
Kernel UUID: 8DA68E05-5BBB-3D3A-ACF5-7EA227ADDFD7
iBoot version: qemu-t8030
secure boot?: YES
Paniclog version: 13
Kernel slide:      0x000000001d908000
Kernel text base:  0xfffffff02490c000
mach_absolute_time: 0x3613ae4f0
Epoch Time:        sec       usec
  Boot    : 0x62bc031f 0x000ce543
  Sleep   : 0x00000000 0x00000000
  Wake    : 0x00000000 0x00000000
  Calendar: 0x62bc0579 0x00031619

Zone info:
  Zone map: 0xffffffe001eb8000 - 0xffffffe601eb8000
  . VM    : 0xffffffe001eb8000 - 0xffffffe0e851c000
  . RO    : 0xffffffe0e851c000 - 0xffffffe1351e8000
  . GEN0  : 0xffffffe1351e8000 - 0xffffffe21b84c000
  . GEN1  : 0xffffffe21b84c000 - 0xffffffe301eb0000
  . GEN2  : 0xffffffe301eb0000 - 0xffffffe3e8518000
  . GEN3  : 0xffffffe3e8518000 - 0xffffffe4ceb80000
  . DATA  : 0xffffffe4ceb80000 - 0xffffffe601eb8000
  Metadata: 0xffffffe72c498000 - 0xffffffe72dc98000
  Bitmaps : 0xffffffe72dc98000 - 0xffffffe72f1b4000

CORE 0 is the one that panicked. Check the full backtrace for details.
CORE 1: PC=0xfffffff0255b6328, LR=0xfffffff0255b6324, FP=0xffffffe73584be90
CORE 2: PC=0xfffffff0255b6328, LR=0xfffffff0255b6324, FP=0xffffffe7358ebe90
CORE 3: PC=0xfffffff0255b6328, LR=0xfffffff0255b6324, FP=0xffffffe73592be90
Compressor Info: 0% of compressed pages limit (OK) and 0% of segments limit (OK) with 0 swapfiles and OK swap space
Total cpu_usage: 48135040
Thread task pri cpu_usage
0xffffffe21b7c8dc0 kernel_task 0 418935
0xffffffe21b7c2940 kernel_task 0 0
0xffffffe21b7b94a0 kernel_task 0 5511560
0xffffffe21b7d94a0 kernel_task 0 0

Panicked task 0xffffffe21b7ec668: 0 pages, 111 threads: pid 0: kernel_task
Panicked thread: 0xffffffe21b7c8dc0, backtrace: 0xffffffe7352bf860, tid: 102
          lr: 0xfffffff02547472c  fp: 0xffffffe7352bf8a0
          lr: 0xfffffff025474444  fp: 0xffffffe7352bf910
          lr: 0xfffffff0255c26c4  fp: 0xffffffe7352bf930
          lr: 0xfffffff0255b34b0  fp: 0xffffffe7352bf9a0
          lr: 0xfffffff0255b22dc  fp: 0xffffffe7352bfa60
          lr: 0xfffffff025bb8610  fp: 0xffffffe7352bfa70
          lr: 0xfffffff025474120  fp: 0xffffffe7352bfe00
          lr: 0xfffffff025474120  fp: 0xffffffe7352bfe60
          lr: 0xfffffff0274f801c  fp: 0xffffffe7352bfe80
          lr: 0xfffffff02735ccc8  fp: 0xffffffe7352bfeb0
          lr: 0xfffffff02735c250  fp: 0xffffffe7352bfef0
          lr: 0xfffffff02735a670  fp: 0xffffffe7352bff10
          lr: 0xfffffff026dfd490  fp: 0xffffffe7352bffb0
          lr: 0xfffffff025b02b90  fp: 0xffffffe7352bffc0
          lr: 0xfffffff0255b3f0c  fp: 0xffffffe7352bffe0
          lr: 0xfffffff025bb8684  fp: 0xffffffe7352bfff0
          lr: 0xfffffff0255b6324  fp: 0xffffffe7358abe90
          lr: 0xfffffff0254a7fb8  fp: 0xffffffe7358abf00
          lr: 0xfffffff0254a8234  fp: 0xffffffe7358abf20
          lr: 0xfffffff025bc0920  fp: 0x0000000000000000

** Stackshot Succeeded ** Bytes Traced 49210 (Uncompressed 133568) **
IOPlatformPanicAction -> AppleANS2NVMeController
IOPlatformPanicAction -> AppleT8030PMGR
IOPlatformPanicAction -> AppleARMWatchdogTimer
IOPlatformPanicAction -> AppleNubSynopsysOTG3Device
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> AppleSMC
IOPlatformPanicAction -> AppleANS2NVMeController
IOPlatformPanicAction -> AppleT8030PMGR
IOPlatformPanicAction -> AppleARMWatchdogTimer
IOPlatformPanicAction -> AppleNubSynopsysOTG3Device
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> AppleSMC

Please go to https://panic.apple.com to report this panic
TrungNguyen1909 commented 2 years ago

For the second log, you can add wdt=-1 to the quotes after -append to workaround.

The first one is expected for newer iOS.

I also don't have a solution for freezes when the GUI window is not focused yet, but macOS seems to throttle them down.

dom-lgtm commented 2 years ago

Thanks I will play a bit more.

dom-lgtm commented 2 years ago

I had the same throttling issue on a project a few years back but I never managed to fix it :/

dom-lgtm commented 2 years ago

I fixed the freezes and timeouts with: https://github.com/TrungNguyen1909/qemu-t8030/pull/68/files

TrungNguyen1909 commented 2 years ago

@dom-lgtm Hello, I get that removing mutex stuff can possibly work, but your PR looks very sketchy to me.

Can you first add --enable-debug-info to the configure command, rebuild (make) the whole thing again.

And then reproduce the hangs and then use gdb/lldb to attach to both the Linux VM and the iOS VM when they hang, do a thread apply all bt (gdb) or bt all (lldb) and then send me the FULL result of them (just the ones in the debugger)?

Thanks.

dom-lgtm commented 2 years ago

With all due respect, it's your use of mutexes I find sketchy. Are you sure you understand mutexes ?

  1. Can you explain why you need to lock iothread mutex in this code?
  2. Why is a thread basically locking the iothread mutex forever? https://github.com/TrungNguyen1909/qemu-t8030/blob/4de8baf00a70f1c612028c0886db7d98a1e8486d/hw/usb/dev-tcp-remote.c#L328
  3. Why do you need unsafe unlocks everywhere to fix 2? (e.g. https://github.com/TrungNguyen1909/qemu-t8030/blob/4de8baf00a70f1c612028c0886db7d98a1e8486d/hw/usb/dev-tcp-remote.c#L172)

/ cc @siguza

TrungNguyen1909 commented 2 years ago

Hello @dom-lgtm,

All the mutexes usage in hw/usb/hcd-tcp.c is based off here.

  1. This thread don't lock the mutex forever, if you followed the calls, you would see that it is unlocked eventually at usb_tcp_remote_read_one() -> usb_tcp_remote_read() it unlocks before the read and it relocks after that (because read blocks). The reason for that design is to make sure that all the code in between (processing) are synchronized.

Yes, you can say that all other queue locks are useless (the reason they are there is historical). But I want all processing code to be inside iothread lock due to the fact that EHCI runs inside IOThread (implicitly through qemu_bh).

And with all due respects, please send me the backtrace I requested so that we can figured this out.

Thanks.

dom-lgtm commented 2 years ago

@TrungNguyen1909 Do I understand correctly that if there is some data available to read on the socket, the iothread is basically never unlocked except for the few instructions during which another thread might, with lots of luck, manage to lock thread for its own use, which could explain the lockups and why my fix removes the lockups?

TrungNguyen1909 commented 2 years ago

@dom-lgtm, Without your backtrace, I am unable to explain the lockups for you.

Well yes, assuming read constantly has data coming in, the IOThread is mostly locked.

Ummm, but however, I figured that maybe we should remove all other locks accept for the IOThread and make sure that no deadlocks are possible, plus every processing bits is IOThread locked, then we can still be thread-safe.

dom-lgtm commented 2 years ago

Sorry I was out for the weekend. Here is a BT from the linux VM (both linux and iOS hang).

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff8097c797a libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007ff8097c7ce8 libsystem_kernel.dylib`mach_msg + 56
    frame #2: 0x00007ff8098cb38d CoreFoundation`__CFRunLoopServiceMachPort + 319
    frame #3: 0x00007ff8098c9a18 CoreFoundation`__CFRunLoopRun + 1276
    frame #4: 0x00007ff8098c8e5c CoreFoundation`CFRunLoopRunSpecific + 562
    frame #5: 0x00007ff8125705e6 HIToolbox`RunCurrentEventLoopInMode + 292
    frame #6: 0x00007ff81257034a HIToolbox`ReceiveNextEventCommon + 594
    frame #7: 0x00007ff8125700e5 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 70
    frame #8: 0x00007ff80c3081fd AppKit`_DPSNextEvent + 927
    frame #9: 0x00007ff80c3068ba AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 1394
    frame #10: 0x00007ff80c2f8f69 AppKit`-[NSApplication run] + 586
    frame #11: 0x000000010bf595cd qemu-system-aarch64`main(argc=<unavailable>, argv=<unavailable>) at cocoa.m:1989:5 [opt]
    frame #12: 0x0000000117d6751e dyld`start + 462
  thread #2
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c7905a7 qemu-system-aarch64`qemu_event_wait [inlined] qemu_futex_wait(ev=0x000000010ccf8738, val=4294967295) at qemu-thread-posix.c:386:9 [opt]
    frame #3: 0x000000010c790587 qemu-system-aarch64`qemu_event_wait(ev=0x000000010ccf8738) at qemu-thread-posix.c:481:9 [opt]
    frame #4: 0x000000010c7999c0 qemu-system-aarch64`call_rcu_thread(opaque=0x0000000000000000) at rcu.c:261:21 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #3
    frame #0: 0x00007ff8097cfd5a libsystem_kernel.dylib`__select + 10
    frame #1: 0x000000010db3ae0f libglib-2.0.0.dylib`g_poll + 505
    frame #2: 0x000000010c7ac017 qemu-system-aarch64`main_loop_wait [inlined] os_host_main_loop_wait(timeout=61000) at main-loop.c:250:11 [opt]
    frame #3: 0x000000010c7abfd8 qemu-system-aarch64`main_loop_wait(nonblocking=<unavailable>) at main-loop.c:531:11 [opt]
    frame #4: 0x000000010bf96157 qemu-system-aarch64`qemu_main_loop at runstate.c:727:9 [opt]
    frame #5: 0x000000010bf28ede qemu-system-aarch64`qemu_main(argc=50, argv=0x00007ff7b3fdea20, envp=<unavailable>) at main.c:50:5 [opt]
    frame #6: 0x000000010bf59636 qemu-system-aarch64`call_qemu_main(opaque=0x0000000000000000) at cocoa.m:1939:14 [opt]
    frame #7: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #8: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #9: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #4
    frame #0: 0x00007ff8097d0512 libsystem_kernel.dylib`__sigwait + 10
    frame #1: 0x00007ff809804c25 libsystem_pthread.dylib`sigwait + 29
    frame #2: 0x000000010c78dde9 qemu-system-aarch64`sigwait_compat(opaque=0x00006000010fc980) at compatfd.c:36:15 [opt]
    frame #3: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #4: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #5: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #5
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c78fe4a qemu-system-aarch64`qemu_cond_wait_impl(cond=0x00006000007c8cc0, mutex=0x000000010cce2c00, file="../softmmu/cpus.c", line=424) at qemu-thread-posix.c:195:11 [opt]
    frame #3: 0x000000010bf8fbc8 qemu-system-aarch64`qemu_wait_io_event(cpu=0x00007f8f1f7c5000) at cpus.c:424:9 [opt]
    frame #4: 0x000000010c640221 qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x00007f8f1f7c5000) at tcg-accel-ops-mttcg.c:124:9 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #6
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c78fe4a qemu-system-aarch64`qemu_cond_wait_impl(cond=0x00006000007c9680, mutex=0x000000010cce2c00, file="../softmmu/cpus.c", line=424) at qemu-thread-posix.c:195:11 [opt]
    frame #3: 0x000000010bf8fbc8 qemu-system-aarch64`qemu_wait_io_event(cpu=0x00007f8f2000b000) at cpus.c:424:9 [opt]
    frame #4: 0x000000010c640221 qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x00007f8f2000b000) at tcg-accel-ops-mttcg.c:124:9 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #7
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c78fe4a qemu-system-aarch64`qemu_cond_wait_impl(cond=0x00006000007e9580, mutex=0x000000010cce2c00, file="../softmmu/cpus.c", line=424) at qemu-thread-posix.c:195:11 [opt]
    frame #3: 0x000000010bf8fbc8 qemu-system-aarch64`qemu_wait_io_event(cpu=0x00007f8f200be000) at cpus.c:424:9 [opt]
    frame #4: 0x000000010c640221 qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x00007f8f200be000) at tcg-accel-ops-mttcg.c:124:9 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #8
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c78fe4a qemu-system-aarch64`qemu_cond_wait_impl(cond=0x00006000007ca0c0, mutex=0x000000010cce2c00, file="../softmmu/cpus.c", line=424) at qemu-thread-posix.c:195:11 [opt]
    frame #3: 0x000000010bf8fbc8 qemu-system-aarch64`qemu_wait_io_event(cpu=0x00007f8f20171000) at cpus.c:424:9 [opt]
    frame #4: 0x000000010c640221 qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x00007f8f20171000) at tcg-accel-ops-mttcg.c:124:9 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #9, name = 'com.apple.NSEventThread'
    frame #0: 0x00007ff8097c797a libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007ff8097c7ce8 libsystem_kernel.dylib`mach_msg + 56
    frame #2: 0x00007ff8098cb38d CoreFoundation`__CFRunLoopServiceMachPort + 319
    frame #3: 0x00007ff8098c9a18 CoreFoundation`__CFRunLoopRun + 1276
    frame #4: 0x00007ff8098c8e5c CoreFoundation`CFRunLoopRunSpecific + 562
    frame #5: 0x00007ff80c475c1e AppKit`_NSEventThread + 132
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #10
    frame #0: 0x00007ff8097fff48 libsystem_pthread.dylib`start_wqthread
(lldb)
dom-lgtm commented 2 years ago

Well yes, assuming read constantly has data coming in, the IOThread is mostly locked.

Could this by chance be the case when the firmware is being transmitted and coincidently when the VMs most often lock ?

TrungNguyen1909 commented 2 years ago

I couldn't see anything strange here, Can you send me the bt for the iOS VM?

dom-lgtm commented 2 years ago

I can't claim I know much about qemu or multi-threaded code, but I can't help wonder if spinning is the best choice here:

https://github.com/TrungNguyen1909/qemu-t8030/blob/4de8baf00a70f1c612028c0886db7d98a1e8486d/hw/usb/dev-tcp-remote.c#L630

TrungNguyen1909 commented 2 years ago

@dom-lgtm, You might be wondering but there is a reason for that.

The QEMU USB stack does not accept USB_RET_NAK after a packet is USB_RET_ASYNC, and the only way to defer the completion of a USB packet is USB_RET_ASYNC.

So the solution is we have to wait for the other side to respond. It might not be the best way to workaround this here but that's the only one we currently got.

dom-lgtm commented 2 years ago

This code is too complex for me. I'm going to work on my own fork so I don't mess the mutexes.