OE4T / meta-tegra

BSP layer for NVIDIA Jetson platforms, based on L4T
MIT License
410 stars 227 forks source link

[jetson-nano-B01] import python-module face_recognition crashes Kernel #468

Closed elPrac closed 3 years ago

elPrac commented 3 years ago

Recently i bought the new jetson nano dev-kit B01 and tried the same image i already had working on the previous board jetson nano developer kit carrier board A02.

But any time i want to import face_recognition i get the following error

root@jetson-nano-qspi-sd:~# python3
Python 3.7.5 (default, Jul  5 2020, 03:04:45) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import face_recognition
[  382.131942] sched: RT throttling activated for rt_rq ffffffc0fefd8ea8 (cpu 1)
[  382.131942] potential CPU hogs:
[  382.131942]  irq/80-gk20a_st (2148)
[  382.132025] CPU1: SError detected, daif=140, spsr=0x40000045, mpidr=80000001, esr=bf000002
[  403.142401] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  403.148121]  2-...: (0 ticks this GP) idle=4b1/140000000000000/0 softirq=8360/8360 fqs=0 
[  403.156331]  3-...: (1 GPs behind) idle=149/140000000000000/0 softirq=9056/9058 fqs=0 
[  403.164264]  (detected by 0, t=5252 jiffies, g=2309, c=2308, q=148)
[  403.170567] Task dump for CPU 2:
[  403.173810] kworker/u8:1    R  running task        0    34      2 0x00000002
[  403.180940] Workqueue: devfreq_wq devfreq_monitor
[  403.185670] Call trace:
[  403.188147] [<ffffff800808640c>] __switch_to+0x9c/0xc0
[  403.189334] CPU2: SError detected, daif=140, spsr=0x40000045, mpidr=80000002, esr=bf000002
[  403.201581] [<ffffff8009c09000>] bp_hardening_data+0x0/0x10
[  403.207167] Task dump for CPU 3:
[  403.210409] systemd-udevd   R  running task        0  4544   2100 0x00000202
[  403.217502] Call trace:
[  403.219970] [<ffffff800808640c>] __switch_to+0x9c/0xc0
[  403.225122] [<          (null)>]           (null)
[  445.337624] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 39s! [kworker/u8:1:34]
[  445.345307] Modules linked in: uvcvideo nfsd nfs_acl nvgpu
[  445.350889] 
[  445.352401] CPU: 2 PID: 34 Comm: kworker/u8:1 Not tainted 4.9.140-l4t-r32.3.1+g48c6aaffbe37 #1
[  445.361028] Hardware name: NVIDIA Jetson Nano Developer Kit (DT)
[  445.367074] Workqueue: devfreq_wq devfreq_monitor
[  445.371806] task: ffffffc0f9af8e00 task.stack: ffffffc0f9b60000
[  445.378722] PC is at __nvgpu_readl+0x38/0xc8 [nvgpu]
[  445.384647] LR is at nvgpu_mc_boot_0+0x34/0x78 [nvgpu]
[  445.389802] pc : [<ffffff80010fd410>] lr : [<ffffff80010e7694>] pstate: 60400045
[  445.397209] sp : ffffffc0f9b63b10
[  445.400537] x29: ffffffc0f9b63b10 x28: 0000000000000000 
[  445.405892] x27: 0000000000000064 x26: ffffffc0f5553c00 
[  445.411245] x25: 0000000000000010 x24: ffffffc0f6650000 
[  445.416597] x23: 20c49ba5e353f7cf x22: ffffffc0f6650000 
[  445.421947] x21: 0000000000000000 x20: ffffffc0f6650000 
[  445.427296] x19: 00000000ffffffff x18: 0000000000000000 
[  445.432646] x17: 0000000000000000 x16: 0000000000000004 
[  445.437995] x15: 0000000000000000 x14: 0000000000000001 
[  445.443347] x13: 0000000000003a60 x12: 0000000000002cff 
[  445.448696] x11: 0000000000000000 x10: 0000000000000a20 
[  445.454046] x9 : ffffffc0f9b63d10 x8 : ffffffc0f9af9880 
[  445.459397] x7 : 0000000002e9f800 x6 : 00000000ffffffff 
[  445.464745] x5 : 000000000010a548 x4 : ffffff80011d7248 
[  445.470097] x3 : 0000000000000000 x2 : 0000000000000000 
[  445.475447] x1 : 0000000000000000 x0 : ffffffc0f6658000 
[  445.480796] 
[  445.482302] Kernel panic - not syncing: softlockup: hung tasks
[  445.488155] CPU: 2 PID: 34 Comm: kworker/u8:1 Tainted: G             L  4.9.140-l4t-r32.3.1+g48c6aaffbe37 #1
[  445.497993] Hardware name: NVIDIA Jetson Nano Developer Kit (DT)
[  445.504033] Workqueue: devfreq_wq devfreq_monitor
[  445.508762] Call trace:
[  445.511233] [<ffffff800808bf58>] dump_backtrace+0x0/0x1a0
[  445.516653] [<ffffff800808c53c>] show_stack+0x24/0x30
[  445.521726] [<ffffff8008498e60>] dump_stack+0x98/0xc0
[  445.526797] [<ffffff80081c5db0>] panic+0x11c/0x298
[  445.531610] [<ffffff8008185294>] watchdog_timer_fn+0x2c4/0x2c8
[  445.537462] [<ffffff800813c85c>] __hrtimer_run_queues+0xd4/0x358
[  445.543484] [<ffffff800813d1b8>] hrtimer_interrupt+0xa8/0x1e0
[  445.549252] [<ffffff8008ca9f18>] tegra210_timer_isr+0x38/0x48
[  445.555014] [<ffffff8008124edc>] __handle_irq_event_percpu+0x64/0x288
[  445.561472] [<ffffff8008125128>] handle_irq_event_percpu+0x28/0x60
[  445.567670] [<ffffff80081251b0>] handle_irq_event+0x50/0x80
[  445.573260] [<ffffff8008129048>] handle_fasteoi_irq+0xc8/0x1b8
[  445.579110] [<ffffff8008123ebc>] generic_handle_irq+0x34/0x50
[  445.584870] [<ffffff80081245a8>] __handle_domain_irq+0x68/0xc0
[  445.590719] [<ffffff8008080d4c>] gic_handle_irq+0x5c/0xb0
[  445.596131] [<ffffff8008082be8>] el1_irq+0xe8/0x18c
[  445.601972] [<ffffff80010fd410>] __nvgpu_readl+0x38/0xc8 [nvgpu]
[  445.608927] [<ffffff80010e7694>] nvgpu_mc_boot_0+0x34/0x78 [nvgpu]
[  445.616107] [<ffffff800113d46c>] __nvgpu_check_gpu_state+0x2c/0xa8 [nvgpu]
[  445.623939] [<ffffff80010fd4f0>] nvgpu_readl+0x50/0x60 [nvgpu]
[  445.630764] [<ffffff800115d1b8>] gk20a_pmu_read_idle_counter+0x30/0x40 [nvgpu]
[  445.638979] [<ffffff800112f964>] nvgpu_pmu_busy_cycles_norm+0x6c/0x160 [nvgpu]
[  445.647175] [<ffffff8001111f58>] gk20a_scale_get_dev_status+0xa8/0xf8 [nvgpu]
[  445.654335] [<ffffff8008d69c88>] nvhost_pod_estimate_freq+0x90/0x808
[  445.660708] [<ffffff8008d66b04>] update_devfreq+0x44/0x230
[  445.666210] [<ffffff8008d66d24>] devfreq_monitor+0x34/0x90
[  445.671715] [<ffffff80080d5c8c>] process_one_work+0x1ec/0x4c0
[  445.677477] [<ffffff80080d5fb0>] worker_thread+0x50/0x4e0
[  445.682891] [<ffffff80080dcd54>] kthread+0xec/0xf0
[  445.687699] [<ffffff8008083850>] ret_from_fork+0x10/0x40
[  445.693029] SMP: stopping secondary CPUs
[  446.983103] SMP: failed to stop secondary CPUs 0-3
[  446.987920] Kernel Offset: disabled
[  446.991430] Memory Limit: none
[  447.022294] Rebooting in 5 seconds..

This error only happens with board B01 with A02 it works just fine, I built this image using meta-tegra: zeus-l4t-r32.3.1 machine: jetson-nano-qspi-sd

Could you please give me any clue on how i can start debugging this issue? i have never used kgdb but probably this could be a good opportunity.

Thanks!

madisongh commented 3 years ago

The machine check (SError) doesn't look good... it's possible there's a hardware problem on that one board.

From the traces, it looks like maybe the GPU is having a problem. You might want to search around the Internet for some similar issues and things to try before trying to dig into kernel debugging. A quick search I did was inconclusive, but it did sound like something like this was possible if the power management stuff isn't set up right... make sure nvpmodel.service ran successfully, for instance.

You might also try dunfell-l4t-r32.4.3 and see if that makes a difference.

ichergui commented 3 years ago

Hey @elPrac Did you check nvpmodel.service as @madisongh suggest. Did you try with our stable branch dunfell-l4t-r32.4.3 ?

ichergui commented 3 years ago

I will close this issue because no updates since at least 2 months open new issue if needed.