OE4T / meta-tegra

BSP layer for NVIDIA Jetson platforms, based on L4T
MIT License
389 stars 216 forks source link

Crash in nvgpu module with 5.15 kernel 5.15.136-l4t-r36.3 #1613

Open kraj opened 2 weeks ago

kraj commented 2 weeks ago

Describe the bug nvpmodel.service fails to start and any other services needing OpenGL/EGL also do not start

To Reproduce

Build QTWebengine for MACHINE=jetson-agx-orin-devkit

Additional context

crash report as seen on console.

ul 10 16:36:36 jetson-agx-orin-devkit kernel: cma: cma_alloc: linux,cma: alloc failed, req-size: 62864 pages, ret: -12
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: ------------[ cut here ]------------
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: WARNING: CPU: 0 PID: 942 at mm/page_alloc.c:5396 __alloc_pages+0x304/0x320
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: Modules linked in: rtk_btusb(O) bluetooth ecdh_generic ecc rtl8822ce(O) snd_hda_codec_hdmi tegra_cactmon_mc_all(O) nvethernet(O) mttcan(O) tegra234_aon(O) nvpps(O) snd_hda_tegra can_dev at24 snd_hda_codec snd_hda_core spi_tegra114 pwm_tegra_tachometer(O) pwm_tegra mc_hwpm(O) host1x_fence(O) nvhost_isp5(O) nvhost_vi5(O) nvhost_nvcsi_t194(O) nvvrs_pseq_rtc(O) nvidia(O) i2c_nvvrs11(O) lm90 nvidia_vrs_pseq(O) tegra_camera(O) tegra_bpmp_thermal v4l2_dv_timings v4l2_fwnode tegra_dce(O) nvpmodel_clk_cap(O) tegra23x_perf_uncore(O) tegra234_oc_event(O) tegra_mce(O) thermal_trip_event(O) v4l2_async videobuf2_dma_contig videobuf2_memops nvhost_nvcsi(O) tegra_camera_platform(O) capture_ivc(O) governor_userspace cfg80211 tegra_camera_rtcpu(O) ivc_bus(O) hsp_mailbox_client(O) ivc_ext(O) rfkill tegra_drm(O) videobuf2_v4l2 nvhost_pva(O) nvhost_nvdla(O) videobuf2_common tegra_wmark(O) videodev nvhost_capture(O) nvhwpm(O) cec tegra_se(O) mc host1x_nvhost(O) crypto_engine tsecriscv(O)
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  drm_kms_helper pwm_fan nvgpu(O) governor_pod_scaling(O) nvmap(O) nvsciipc(O) host1x(O) mc_utils(O) ina3221 drm ipv6 nvme nvme_core tegra_xudc ucsi_ccg typec_ucsi typec pcie_tegra194 phy_tegra194_p2u
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: CPU: 0 PID: 942 Comm: nvpmodel Tainted: G           O      5.15.136-l4t-r36.3-1009.9+g46cdb595bebc #1
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: Hardware name: NVIDIA NVIDIA Jetson AGX Orin Developer Kit/Jetson, BIOS v36.3.0 01/08/2024
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: pc : __alloc_pages+0x304/0x320
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: lr : __alloc_pages+0x38/0x320
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: sp : ffff80000e6d31c0
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x29: ffff80000e6d31f0 x28: ffffafaf94b69000 x27: ffffafaf94b10000
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x26: 0000000000000000 x25: 0000000000000004 x24: c7b2afafd6f5aeb8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x23: 0000000000000000 x22: 0000000000000000 x21: 0000000000000010
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x20: 0000000000000cc0 x19: 0000000000000cc0 x18: 00000000fffffffe
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x17: 202c736567617020 x16: 3436383236203a65 x15: 7a69732d71657220
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x14: 2c64656c69616620 x13: 32312d203a746572 x12: 202c736567617020
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x11: 3436383236203a65 x10: 7a69732d71657220 x9 : 636f6c6c61203a61
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x8 : 6d632c78756e696c x7 : 0000000000000000 x6 : 000000000000000c
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x5 : 0000000000000000 x4 : ffff000fa2ba59f0 x3 : 0000000000000000
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: x2 : ffffafafd6e418d8 x1 : 0000000000000000 x0 : ffff000083fabe00
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: Call trace:
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  __alloc_pages+0x304/0x320
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  __dma_direct_alloc_pages.isra.0+0x194/0x250
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  dma_direct_alloc+0x90/0x358
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  dma_alloc_attrs+0x90/0xfc
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  nvgpu_dma_alloc_flags_sys+0xfc/0x4c4 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  nvgpu_cbc_alloc+0xe8/0x15c [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  ga10b_cbc_alloc_comptags+0x1b8/0x2c4 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  nvgpu_cbc_init_support+0xcc/0x114 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  nvgpu_finalize_poweron+0x518/0x640 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  gk20a_pm_finalize_poweron+0x188/0x708 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  gk20a_pm_runtime_resume+0x58/0xa0 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  pm_generic_runtime_resume+0x44/0x68
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  __genpd_runtime_resume+0x40/0x90
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  genpd_runtime_resume+0xec/0x26c
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  __rpm_callback+0x54/0x200
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  rpm_callback+0x8c/0xa0
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  rpm_resume+0x49c/0x730
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  pm_runtime_forbid+0x74/0xa8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  control_store+0xa4/0xa8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  dev_attr_store+0x4c/0x68
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  sysfs_kf_write+0x64/0x78
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  kernfs_fop_write_iter+0x140/0x1f8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  new_sync_write+0x11c/0x1e0
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  vfs_write+0x21c/0x284
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  ksys_write+0x88/0x120
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  __arm64_sys_write+0x2c/0x40
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  invoke_syscall+0x5c/0x120
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  el0_svc_common.constprop.0+0xf0/0x110
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  do_el0_svc+0x3c/0x9c
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  el0_svc+0x20/0x60
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  el0t_64_sync_handler+0x108/0x120
Jul 10 16:36:36 jetson-agx-orin-devkit kernel:  el0t_64_sync+0x1a4/0x1a8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: ---[ end trace 4613cc04920f90c5 ]---
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu               nvgpu_dma_print_err:101  [INFO]  DMA alloc FAILED: [sysmem] size=257490944 aligned=257490944 flags:PHYSICALLY_ADDRESSED
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu            nvgpu_cbc_init_support:91   [ERR]  Failed to allocate comptags
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu            nvgpu_finalize_poweron:1095 [ERR]  Failed initialization for: g->ops.cbc.cbc_init_support
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.879784] cma: cma_alloc: linux,cma: alloc failed, req-size: 62864 pages, ret: -12
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:35 SimpleIOT v0.16.1
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 Server NATS client reconnect attempt # 1
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.879815] ------------[ cut here ]------------
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.879818] WARNING: CPU: 0 PID: 942 at mm/page_alloc.c:5396 __alloc_pages+0x304/0x320
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.879849] Modules linked in: rtk_btusb(O) bluetooth ecdh_generic ecc rtl8822ce(O) snd_hda_codec_hdmi tegra_cactmon_mc_all(O) nvethernet(O) mttcan(O) tegra234_aon(O) nvpps(O) snd_hda_tegra can_dev at24 snd_hda_codec snd_hda_core spi_tegra114 pwm_tegra_tachometer(O) pwm_tegra mc_hwpm(O) host1x_fence(O) nvhost_isp5(O) nvhost_vi5(O) nvhost_nvcsi_t194(O) nvvrs_pseq_rtc(O) nvidia(O) i2c_nvvrs11(O) lm90 nvidia_vrs_pseq(O) tegra_camera(O) tegra_bpmp_thermal v4l2_dv_timings v4l2_fwnode tegra_dce(O) nvpmodel_clk_cap(O) tegra23x_perf_uncore(O) tegra234_oc_event(O) tegra_mce(O) thermal_trip_event(O) v4l2_async videobuf2_dma_contig videobuf2_memops nvhost_nvcsi(O) tegra_camera_platform(O) capture_ivc(O) governor_userspace cfg80211 tegra_camera_rtcpu(O) ivc_bus(O) hsp_mailbox_client(O) ivc_ext(O) rfkill tegra_drm(O) videobuf2_v4l2 nvhost_pva(O) nvhost_nvdla(O) videobuf2_common tegra_wmark(O) videodev nvhost_capture(O) nvhwpm(O) cec tegra_se(O) mc host1x_nvhost(O) crypto_engine tsecriscv(O)
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880035]  drm_kms_helper pwm_fan nvgpu(O) governor_pod_scaling(O) nvmap(O) nvsciipc(O) host1x(O) mc_utils(O) ina3221 drm ipv6 nvme nvme_core tegra_xudc ucsi_ccg typec_ucsi typec pcie_tegra194 phy_tegra194_p2u
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880072] CPU: 0 PID: 942 Comm: nvpmodel Tainted: G           O      5.15.136-l4t-r36.3-1009.9+g46cdb595bebc #1
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880080] Hardware name: NVIDIA NVIDIA Jetson AGX Orin Developer Kit/Jetson, BIOS v36.3.0 01/08/2024
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880084] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880090] pc : __alloc_pages+0x304/0x320
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880099] lr : __alloc_pages+0x38/0x320
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880107] sp : ffff80000e6d31c0
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880111] x29: ffff80000e6d31f0 x28: ffffafaf94b69000 x27: ffffafaf94b10000
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880119] x26: 0000000000000000 x25: 0000000000000004 x24: c7b2afafd6f5aeb8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880125] x23: 0000000000000000 x22: 0000000000000000 x21: 0000000000000010
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880134] x20: 0000000000000cc0 x19: 0000000000000cc0 x18: 00000000fffffffe
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880140] x17: 202c736567617020 x16: 3436383236203a65 x15: 7a69732d71657220
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880148] x14: 2c64656c69616620 x13: 32312d203a746572 x12: 202c736567617020
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880155] x11: 3436383236203a65 x10: 7a69732d71657220 x9 : 636f6c6c61203a61
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880161] x8 : 6d632c78756e696c x7 : 0000000000000000 x6 : 000000000000000c
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880167] x5 : 0000000000000000 x4 : ffff000fa2ba59f0 x3 : 0000000000000000
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880174] x2 : ffffafafd6e418d8 x1 : 0000000000000000 x0 : ffff000083fabe00
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880183] Call trace:
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880189]  __alloc_pages+0x304/0x320
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880197]  __dma_direct_alloc_pages.isra.0+0x194/0x250
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880215]  dma_direct_alloc+0x90/0x358
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880222]  dma_alloc_attrs+0x90/0xfc
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880228]  nvgpu_dma_alloc_flags_sys+0xfc/0x4c4 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.880798]  nvgpu_cbc_alloc+0xe8/0x15c [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.881128]  ga10b_cbc_alloc_comptags+0x1b8/0x2c4 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.881448]  nvgpu_cbc_init_support+0xcc/0x114 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.881755]  nvgpu_finalize_poweron+0x518/0x640 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882064]  gk20a_pm_finalize_poweron+0x188/0x708 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882365]  gk20a_pm_runtime_resume+0x58/0xa0 [nvgpu]
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882665]  pm_generic_runtime_resume+0x44/0x68
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882693]  __genpd_runtime_resume+0x40/0x90
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882704]  genpd_runtime_resume+0xec/0x26c
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882713]  __rpm_callback+0x54/0x200
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882717]  rpm_callback+0x8c/0xa0
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882723]  rpm_resume+0x49c/0x730
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882726]  pm_runtime_forbid+0x74/0xa8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882730]  control_store+0xa4/0xa8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882741]  dev_attr_store+0x4c/0x68
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882755]  sysfs_kf_write+0x64/0x78
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882773]  kernfs_fop_write_iter+0x140/0x1f8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882777]  new_sync_write+0x11c/0x1e0
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882787]  vfs_write+0x21c/0x284
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882791]  ksys_write+0x88/0x120
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882796]  __arm64_sys_write+0x2c/0x40
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882801]  invoke_syscall+0x5c/0x120
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882818]  el0_svc_common.constprop.0+0xf0/0x110
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882825]  do_el0_svc+0x3c/0x9c
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882832]  el0_svc+0x20/0x60
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882854]  el0t_64_sync_handler+0x108/0x120
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882859]  el0t_64_sync+0x1a4/0x1a8
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   13.882870] ---[ end trace 4613cc04920f90c5 ]---
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   14.889227] nvgpu: 17000000.gpu               nvgpu_dma_print_err:101  [INFO]  DMA alloc FAILED: [sysmem] size=257490944 aligned=257490944 flags:PHYSICALLY_ADDRESSED
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   14.889259] nvgpu: 17000000.gpu            nvgpu_cbc_init_support:91   [ERR]  Failed to allocate comptags
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   14.889271] nvgpu: 17000000.gpu            nvgpu_finalize_poweron:1095 [ERR]  Failed initialization for: g->ops.cbc.cbc_init_support
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 NATS server, port: 4222, http port: 8222, auth enabled: no
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 NATS server WS enabled on port: 9222
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 Open store: siot.sqlite?_pragma=foreign_keys(1)&_pragma=journal_mode(WAL)&_pragma=synchronous(NORMAL)&_pragma=busy_timeout(8000)&_pragma=journal_size_limit(100000000)
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 store connecting to nats server: nats://127.0.0.1:4222
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 SIOT started
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 Starting http server, debug: false
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 Starting portal on port: 8118
Jul 10 16:36:36 jetson-agx-orin-devkit yoe-kiosk-browser[965]: YOE_KIOSK_BROWSER_URL= "http://localhost:8118"
Jul 10 16:36:36 jetson-agx-orin-devkit yoe-kiosk-browser[965]: YOE_KIOSK_BROWSER_EXCEPTION_URL= "@EXCEPTION_URL@"
Jul 10 16:36:36 jetson-agx-orin-devkit yoe-kiosk-browser[965]: YOE_KIOSK_BROWSER_ROTATE= "0"
Jul 10 16:36:36 jetson-agx-orin-devkit yoe-kiosk-browser[965]: YOE_KIOSK_BROWSER_KEYBOARD_SCALE= "1"
Jul 10 16:36:36 jetson-agx-orin-devkit yoe-kiosk-browser[965]: YOE_KIOSK_BROWSER_RETRY_INTERVAL= "10"
Jul 10 16:36:36 jetson-agx-orin-devkit yoe-kiosk-browser[965]: libnvrm_gpu.so: NvRmGpuLibOpen failed, error=4
Jul 10 16:36:36 jetson-agx-orin-devkit audit[965]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=965 comm="yoe-kiosk-brows" exe="/usr/bin/yoe-kiosk-browser" sig=11 res=1
Jul 10 16:36:36 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
Jul 10 16:36:36 jetson-agx-orin-devkit kernel[851]: [   15.126951] nvgpu: 17000000.gpu                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 Server NATS client: reconnected
Jul 10 16:36:36 jetson-agx-orin-devkit sh[963]: 2024/07/10 16:36:36 OS version: 2024.6.0
Jul 10 16:36:36 jetson-agx-orin-devkit systemd[1]: Created slice Slice /system/systemd-coredump.
Jul 10 16:36:36 jetson-agx-orin-devkit audit: BPF prog-id=20 op=LOAD
Jul 10 16:36:36 jetson-agx-orin-devkit audit: BPF prog-id=21 op=LOAD
Jul 10 16:36:36 jetson-agx-orin-devkit audit: BPF prog-id=22 op=LOAD
Jul 10 16:36:36 jetson-agx-orin-devkit systemd[1]: Started Process Core Dump (PID 991/UID 0).
Jul 10 16:36:36 jetson-agx-orin-devkit systemd-coredump[992]: elfutils disabled, parsing ELF objects not supported
Jul 10 16:36:36 jetson-agx-orin-devkit systemd-coredump[992]: [🡕] Process 965 (yoe-kiosk-brows) of user 0 dumped core.
Jul 10 16:36:36 jetson-agx-orin-devkit systemd[1]: systemd-coredump@0-991-0.service: Deactivated successfully.
Jul 10 16:36:36 jetson-agx-orin-devkit systemd[1]: yoe-kiosk-browser.service: Main process exited, code=dumped, status=11/SEGV
Jul 10 16:36:36 jetson-agx-orin-devkit systemd[1]: yoe-kiosk-browser.service: Failed with result 'core-dump'.
Jul 10 16:36:36 jetson-agx-orin-devkit audit: BPF prog-id=22 op=UNLOAD
Jul 10 16:36:36 jetson-agx-orin-devkit audit: BPF prog-id=21 op=UNLOAD
Jul 10 16:36:36 jetson-agx-orin-devkit audit: BPF prog-id=20 op=UNLOAD
Jul 10 16:36:37 jetson-agx-orin-devkit systemd[1]: yoe-kiosk-browser.service: Scheduled restart job, restart counter is at 1.
Jul 10 16:36:37 jetson-agx-orin-devkit systemd[1]: Started Yoe Kiosk Browser.
Jul 10 16:36:37 jetson-agx-orin-devkit yoe-kiosk-browser[999]: YOE_KIOSK_BROWSER_URL= "http://localhost:8118"
Jul 10 16:36:37 jetson-agx-orin-devkit yoe-kiosk-browser[999]: YOE_KIOSK_BROWSER_EXCEPTION_URL= "@EXCEPTION_URL@"
Jul 10 16:36:37 jetson-agx-orin-devkit yoe-kiosk-browser[999]: YOE_KIOSK_BROWSER_ROTATE= "0"
Jul 10 16:36:37 jetson-agx-orin-devkit yoe-kiosk-browser[999]: YOE_KIOSK_BROWSER_KEYBOARD_SCALE= "1"
Jul 10 16:36:37 jetson-agx-orin-devkit yoe-kiosk-browser[999]: YOE_KIOSK_BROWSER_RETRY_INTERVAL= "10"
Jul 10 16:36:37 jetson-agx-orin-devkit nvpmodel[942]: NVPM ERROR: Error opening /sys/devices/platform/17000000.gpu/devfreq_dev/available_frequencies: 2
Jul 10 16:36:37 jetson-agx-orin-devkit nvpmodel[942]: NVPM ERROR: failed to read PARAM GPU: ARG FREQ_TABLE: PATH /sys/devices/platform/17000000.gpu/devfreq_dev/available_frequencies
Jul 10 16:36:37 jetson-agx-orin-devkit nvpmodel[942]: NVPM ERROR: failed to set power mode!
Jul 10 16:36:37 jetson-agx-orin-devkit nvpmodel[942]: NVPM ERROR: optMask is 2, no request for power mode
Jul 10 16:36:37 jetson-agx-orin-devkit systemd[1]: nvpmodel.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 10 16:36:37 jetson-agx-orin-devkit systemd[1]: nvpmodel.service: Failed with result 'exit-code'.
Jul 10 16:36:37 jetson-agx-orin-devkit systemd[1]: Failed to start NVIDIA power model daemon.
Jul 10 16:36:37 jetson-agx-orin-devkit kernel: tegra-hsierrrptinj: Callback for 0x0001 already registered
Jul 10 16:36:37 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu                nvgpu_cic_mon_init:75   [ERR]  Err inj callback registration failed: -22
Jul 10 16:36:37 jetson-agx-orin-devkit kernel[851]: [   16.143094] tegra-hsierrrptinj: Callback for 0x0001 already registered
Jul 10 16:36:37 jetson-agx-orin-devkit kernel[851]: [   16.143130] nvgpu: 17000000.gpu                nvgpu_cic_mon_init:75   [ERR]  Err inj callback registration failed: -22
Jul 10 16:36:37 jetson-agx-orin-devkit systemd[1]: nvpmodel.service: Consumed 1.060s CPU time.
Jul 10 16:36:37 jetson-agx-orin-devkit systemd[1]: Starting NVIDIA PHS daemon...
Jul 10 16:36:37 jetson-agx-orin-devkit systemd[1]: Started NVIDIA PHS daemon.
Jul 10 16:36:38 jetson-agx-orin-devkit systemd[1]: systemd-rfkill.service: Deactivated successfully.
Jul 10 16:36:38 jetson-agx-orin-devkit wpa_supplicant[957]: wlan0: CTRL-EVENT-REGDOM-CHANGE init=DRIVER type=WORLD
Jul 10 16:36:39 jetson-agx-orin-devkit kernel: rtl8822ce_interrupt: 13 callbacks suppressed
Jul 10 16:36:39 jetson-agx-orin-devkit kernel[851]: [   17.927024] rtl8822ce_interrupt: 13 callbacks suppressed
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu     nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xd8/0x164 [nvgpu]
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu             pmu_wait_message_cond:664  [ERR]  PMU wait timeout expired.
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu nvgpu_gr_falcon_load_secure_ctxsw_ucode:718  [ERR]  Unable to recover GR falcon
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu        nvgpu_gr_falcon_init_ctxsw:156  [ERR]  fail
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu nvgpu_cic_mon_report_err_safety_services:97   [ERR]  Error reporting is not supported in this platform
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu      gr_init_ctxsw_falcon_support:857  [ERR]  FECS context switch init error
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu            nvgpu_finalize_poweron:1095 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.213838] nvgpu: 17000000.gpu     nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xd8/0x164 [nvgpu]
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.213886] nvgpu: 17000000.gpu             pmu_wait_message_cond:664  [ERR]  PMU wait timeout expired.
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.213901] nvgpu: 17000000.gpu nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.213914] nvgpu: 17000000.gpu nvgpu_gr_falcon_load_secure_ctxsw_ucode:718  [ERR]  Unable to recover GR falcon
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.213922] nvgpu: 17000000.gpu        nvgpu_gr_falcon_init_ctxsw:156  [ERR]  fail
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.213954] nvgpu: 17000000.gpu nvgpu_cic_mon_report_err_safety_services:97   [ERR]  Error reporting is not supported in this platform
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.213961] nvgpu: 17000000.gpu      gr_init_ctxsw_falcon_support:857  [ERR]  FECS context switch init error
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.213972] nvgpu: 17000000.gpu            nvgpu_finalize_poweron:1095 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
Jul 10 16:36:40 jetson-agx-orin-devkit yoe-kiosk-browser[999]: libnvrm_gpu.so: NvRmGpuLibOpen failed, error=5
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
Jul 10 16:36:40 jetson-agx-orin-devkit kernel: audit: type=1701 audit(1720629400.568:23): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=999 comm="yoe-kiosk-brows" exe="/usr/bin/yoe-kiosk-browser" sig=11 res=1
Jul 10 16:36:40 jetson-agx-orin-devkit audit[999]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=999 comm="yoe-kiosk-brows" exe="/usr/bin/yoe-kiosk-browser" sig=11 res=1
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.244453] nvgpu: 17000000.gpu                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
Jul 10 16:36:40 jetson-agx-orin-devkit kernel[851]: [   19.245561] audit: type=1701 audit(1720629400.568:23): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=999 comm="yoe-kiosk-brows" exe="/usr/bin/yoe-kiosk-browser" sig=11 res=1
ichergui commented 2 weeks ago

Hi @kraj Which branch you are using scarthgap or master ? Are you using QT6 or QT5 ?

kraj commented 2 weeks ago

Hi @kraj Which branch you are using scarthgap or master ? Are you using QT6 or QT5 ?

QT6 and all layers at master branches.

ichergui commented 2 weeks ago

Thanks. I will try to reproduce the bug and will get back to you ASAP

kraj commented 2 weeks ago

btw. I am not using wayland or X11, its using eglfs to launch the browser.

ichergui commented 2 weeks ago

I see. I didn't try the L4T R36.3.0 with eglfs. Maybe @kekiefer tried that already. @kekiefer any thoughts about the issue reported here ?

madisongh commented 2 weeks ago

The problem looks similar to problems in the past - the power management-related services have to start in a specific order, and before anything else touches the GPU, or you get these kinds of tracebacks.

@kraj Are you using sysvinit or systemd as your init manager?

kekiefer commented 2 weeks ago

I don't have a working r36 system with graphics quite yet, but for unrelated reasons, so I can't say whether I would run into this issue or not.

I can say that I didn't run into this onto a system with qt and the eglfs gbm backend on r35.

Which qt eglfs backend is being used? For the gbm, since this looks like an allocation problem, you could explore using tegra-udrm-gbm as the rprovider for tegra-gbm-backend: https://github.com/OE4T/meta-tegra/blob/master/recipes-graphics/mesa/tegra-udrm-gbm_1.1.0.bb


Edit: looks like Matt's message crossed with my own, his answer sounds more promising, but I'll leave mine here for posterity

kraj commented 2 weeks ago

The problem looks similar to problems in the past - the power management-related services have to start in a specific order, and before anything else touches the GPU, or you get these kinds of tracebacks.

@kraj Are you using sysvinit or systemd as your init manager?

I am using systemd and I have looked at another issue where you have fixed some sequencing of services, those changes are in master already. However, I do see this

root@jetson-agx-orin-devkit:~# systemctl status nvpower.service
● nvpower.service - NVIDIA power management setup
     Loaded: loaded (/usr/lib/systemd/system/nvpower.service; enabled; preset: enabled)
     Active: active (exited) since Wed 2024-07-10 16:36:34 UTC; 2h 38min ago
    Process: 856 ExecStart=/usr/libexec/nvpower.sh (code=exited, status=0/SUCCESS)
   Main PID: 856 (code=exited, status=0/SUCCESS)
        CPU: 86ms

Jul 10 16:36:34 jetson-agx-orin-devkit systemd[1]: Starting NVIDIA power management setup...
Jul 10 16:36:34 jetson-agx-orin-devkit nvpower.sh[856]: /usr/libexec/nvpower.sh: line 473: /sys/class/hwmon/hwmon3/in1_label: No such…irectory
Jul 10 16:36:34 jetson-agx-orin-devkit systemd[1]: Finished NVIDIA power management setup.
Hint: Some lines were ellipsized, use -l to show in full.

and

root@jetson-agx-orin-devkit:~# ls -l /sys/class/hwmon/hwmon3/in*
-rw-r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in1_enable
-r--r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in1_input
-rw-r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in2_enable
-r--r--r--    1 root     root          4096 Jul 10 16:36 /sys/class/hwmon/hwmon3/in2_input
-r--r--r--    1 root     root          4096 Jul 10 16:36 /sys/class/hwmon/hwmon3/in2_label
-rw-r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in3_enable
-r--r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in3_input
-r--r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in4_input
-r--r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in5_input
-r--r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in6_input
-r--r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in7_input
-r--r--r--    1 root     root          4096 Jul 10 16:54 /sys/class/hwmon/hwmon3/in7_label

not sure how /sys/class/hwmon/hwmon3/in1_label is created. Its missing on my device.

kraj commented 2 weeks ago

I don't have a working r36 system with graphics quite yet, but for unrelated reasons, so I can't say whether I would run into this issue or not.

I can say that I didn't run into this onto a system with qt and the eglfs gbm backend on r35.

Which qt eglfs backend is being used? For the gbm, since this looks like an allocation problem, you could explore using tegra-udrm-gbm as the rprovider for tegra-gbm-backend: https://github.com/OE4T/meta-tegra/blob/master/recipes-graphics/mesa/tegra-udrm-gbm_1.1.0.bb

❯ bitbake-getvar -r qtbase QT_QPA_DEFAULT_EGLFS_INTEGRATION
WARNING: Published ports are discarded when using host network mode
#
# $QT_QPA_DEFAULT_EGLFS_INTEGRATION
#   set? /mnt/b/yoe/master/sources/meta-tegra/external/qt6-layer/recipes-qt/qt6/qtbase_%.bbappend:3
#     "${@bb.utils.contains('PREFERRED_RPROVIDER_tegra-gbm-backend', 'tegra-libraries-gbm-backend', 'eglfs_kms_egldevice', 'eglfs_kms', d)}"
QT_QPA_DEFAULT_EGLFS_INTEGRATION="eglfs_kms_egldevice"

for using tegra-udrm-gbm does it need wayland ?

Edit: looks like Matt's message crossed with my own, his answer sounds more promising, but I'll leave mine here for posterity

kekiefer commented 2 weeks ago

for using tegra-udrm-gbm does it need wayland ?

No, this is a mesa gbm backend for use with drm/kms

kraj commented 2 weeks ago

for using tegra-udrm-gbm does it need wayland ?

No, this is a mesa gbm backend for use with drm/kms

using tegra-udrm-gbm does not work either.

Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: MESA-LOADER: failed to open tegra: /usr/lib/dri/tegra_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/dri, suffix _dri)
Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: MESA-LOADER: failed to open kms_swrast: /usr/lib/dri/kms_swrast_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/dri, suffix _dri)
Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: MESA-LOADER: failed to open swrast: /usr/lib/dri/swrast_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/dri, suffix _dri)
Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: Could not create GBM device (No such file or directory)
Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: Could not open DRM device

The nvgpu crashed reported above still remain.

kekiefer commented 2 weeks ago

One more thought: Weston normally pulls in the nvidia-drm-loadconf recipe, but you won't have that here with a kms-only implementation that doesn't include weston, but you'll need it for kms to work properly at the end of the day.

kraj commented 2 weeks ago

for using tegra-udrm-gbm does it need wayland ?

No, this is a mesa gbm backend for use with drm/kms

using tegra-udrm-gbm does not work either.

Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: MESA-LOADER: failed to open tegra: /usr/lib/dri/tegra_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/dri, suffix _dri)
Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: MESA-LOADER: failed to open kms_swrast: /usr/lib/dri/kms_swrast_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/dri, suffix _dri)
Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: MESA-LOADER: failed to open swrast: /usr/lib/dri/swrast_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/dri, suffix _dri)
Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: Could not create GBM device (No such file or directory)
Jul 10 23:17:45 jetson-agx-orin-devkit yoe-kiosk-browser[1022]: Could not open DRM device

The nvgpu crashed reported above still remain.

then I added mesa-megadriver to image which left me with enabling the tegra gallium driver in mesa once that was added as a packageconfig all needed pre-requisites were available to run,

root@jetson-agx-orin-devkit:~# ls -l /usr/lib/dri/
-rwxr-xr-x    1 root     root      15216240 May  8 14:27 kms_swrast_dri.so
-rwxr-xr-x    5 root     root      81222560 May  8 14:27 nouveau_dri.so
-rwxr-xr-x    5 root     root      81222560 May  8 14:27 swrast_dri.so
-rwxr-xr-x    5 root     root      81222560 May  8 14:27 tegra_dri.so
-rwxr-xr-x    5 root     root      81222560 May  8 14:27 virtio_gpu_dri.so
-rwxr-xr-x    5 root     root      81222560 May  8 14:27 zink_dri.so

sadly, it gets over the above problem. but fails with

Jul 10 23:59:56 jetson-agx-orin-devkit yoe-kiosk-browser[1328]: libnvrm_gpu.so: NvRmGpuLibOpen failed, error=4
Jul 10 23:59:56 jetson-agx-orin-devkit yoe-kiosk-browser[1328]: Could not open egl display
kekiefer commented 2 weeks ago

Mesa should not be falling back on any DRI driver to initialize drm -- the drm implementation is provided by nvidia's libdrm, and mesa only gets used for buffer allocation in gbm.

Your last message looks suspicious though, can you verify that the nvidia-drm kernel driver is loaded with the option modeset=1?

kraj commented 2 weeks ago

One more thought: Weston normally pulls in the nvidia-drm-loadconf recipe, but you won't have that here with a kms-only implementation that doesn't include weston, but you'll need it for kms to work properly at the end of the day.

I built nvidia-drm-loadconf and installed the ipk did not help much.

Jul 11 00:16:42 jetson-agx-orin-devkit yoe-kiosk-browser[1157]: libnvrm_gpu.so: NvRmGpuLibOpen failed, error=4
Jul 11 00:16:42 jetson-agx-orin-devkit yoe-kiosk-browser[1157]: Could not open egl display
kekiefer commented 2 weeks ago

That recipe is supposed to load nvidia-drm with the option modeset=1 if you haven't restarted, you could do that, or you could try manually unloading and reloading this module with that option.

kraj commented 2 weeks ago

That recipe is supposed to load nvidia-drm with the option modeset=1 if you haven't restarted, you could do that, or you could try manually unloading and reloading this module with that option.

right, I have rebooted.

root@jetson-agx-orin-devkit:~# cat /etc/modprobe.d/nvidia-drm.conf
options nvidia-drm modeset=1
kekiefer commented 2 weeks ago

Do you have a /dev/dri/by-path/platform-13800000.display-card?

kraj commented 2 weeks ago

Do you have a /dev/dri/by-path/platform-13800000.display-card?

yep

root@jetson-agx-orin-devkit:~# ls -l /dev/dri/by-path/platform-13800000.display-card
lrwxrwxrwx    1 root     root             8 May  9 08:07 /dev/dri/by-path/platform-13800000.display-card -> ../card1
kekiefer commented 2 weeks ago

Set up a file /usr/share/tegra.conf or something like that with these contents:

{
    "device": "/dev/dri/by-path/platform-13800000.display-card"
}

Then export these variables before launching your qt application:

export QT_QPA_PLATFORM=eglfs
export QT_QPA_EGLFS_KMS_CONFIG=/usr/share/tegra.conf
kraj commented 2 weeks ago

Set up a file /usr/share/tegra.conf or something like that with these contents:

{
  "device": "/dev/dri/by-path/platform-13800000.display-card"
}

Then export these variables before launching your qt application:

export QT_QPA_PLATFORM=eglfs
export QT_QPA_EGLFS_KMS_CONFIG=/usr/share/tegra.conf

its already doing this in .service file

...
Environment=QT_QPA_EGLFS_INTEGRATION=eglfs_kms
Environment=QT_QPA_EGLFS_KMS_CONFIG=/etc/default/eglfs.json

and /etc/default/eglfs.json is

root@jetson-agx-orin-devkit:~# cat /etc/default/eglfs.json
{
  "device": "/dev/dri/by-path/platform-13800000.display-card",
  "hwcursor": false,
  "pbuffers": true,
  "outputs": [
    {
      "name": "LVDS-1",
      "mode": "1024x600"
    }
  ]
}
root@jetson-agx-orin-devkit:~#
kekiefer commented 2 weeks ago

I guess it's also worth asking if you're still seeing the dma allocation failures from the gpu in the kernel logs, because if you are, maybe this is all just a tangent.

kraj commented 2 weeks ago

I guess it's also worth asking if you're still seeing the dma allocation failures from the gpu in the kernel logs, because if you are, maybe this is all just a tangent.

yes I am seeing those messages consistently, as mentioned.

root@jetson-agx-orin-devkit:~# systemctl --failed
  UNIT                                 LOAD   ACTIVE SUB    DESCRIPTION
● nvpmodel.service                     loaded failed failed NVIDIA power model daemon
● systemd-networkd-wait-online.service loaded failed failed Wait for Network to be Configured
● yoe-kiosk-browser.service            loaded failed failed Yoe Kiosk Browser
kraj commented 2 weeks ago

snippet of journal where this is seen whenever the yoe-kiosk-browser service or nvpmodel.service is restarted/started

Jul 11 00:48:59 jetson-agx-orin-devkit audit: BPF prog-id=47 op=UNLOAD
Jul 11 00:48:59 jetson-agx-orin-devkit audit: BPF prog-id=46 op=UNLOAD
Jul 11 00:48:59 jetson-agx-orin-devkit systemd[1]: yoe-kiosk-browser.service: Scheduled restart job, restart counter is at 4.
Jul 11 00:48:59 jetson-agx-orin-devkit systemd[1]: Started Yoe Kiosk Browser.
Jul 11 00:48:59 jetson-agx-orin-devkit yoe-kiosk-browser[1262]: YOE_KIOSK_BROWSER_URL= "http://localhost:8118"
Jul 11 00:48:59 jetson-agx-orin-devkit yoe-kiosk-browser[1262]: YOE_KIOSK_BROWSER_EXCEPTION_URL= "@EXCEPTION_URL@"
Jul 11 00:48:59 jetson-agx-orin-devkit yoe-kiosk-browser[1262]: YOE_KIOSK_BROWSER_ROTATE= "0"
Jul 11 00:48:59 jetson-agx-orin-devkit yoe-kiosk-browser[1262]: YOE_KIOSK_BROWSER_KEYBOARD_SCALE= "1"
Jul 11 00:48:59 jetson-agx-orin-devkit yoe-kiosk-browser[1262]: YOE_KIOSK_BROWSER_RETRY_INTERVAL= "10"
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: tegra-hsierrrptinj: Callback for 0x0001 already registered
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu                nvgpu_cic_mon_init:75   [ERR]  Err inj callback registration failed: -22
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  282.985455] tegra-hsierrrptinj: Callback for 0x0001 already registered
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  282.985466] nvgpu: 17000000.gpu                nvgpu_cic_mon_init:75   [ERR]  Err inj callback registration fai
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu engine_fb_queue_set_element_use_state:144  [ERR]  FBQ last received queue element not processed yet
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu        nvgpu_engine_fb_queue_push:373  [ERR]  fb-queue element in use map is in invalid state
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu        nvgpu_engine_fb_queue_push:401  [ERR]  falcon id-0, queue id-1, failed
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu                     pmu_write_cmd:178  [ERR]  fail to write cmd to queue 1
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu             nvgpu_pmu_rpc_execute:727  [ERR]  Failed to execute RPC status=0xffffffea, func=0x3
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu gv100_pmu_lsfm_bootstrap_ls_falcon:100  [ERR]  Failed to execute RPC, status=0xffffffea
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu nvgpu_gr_falcon_load_secure_ctxsw_ucode:718  [ERR]  Unable to recover GR falcon
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu        nvgpu_gr_falcon_init_ctxsw:156  [ERR]  fail
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu nvgpu_cic_mon_report_err_safety_services:97   [ERR]  Error reporting is not supported in this platform
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu      gr_init_ctxsw_falcon_support:857  [ERR]  FECS context switch init error
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu            nvgpu_finalize_poweron:1095 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032139] nvgpu: 17000000.gpu engine_fb_queue_set_element_use_state:144  [ERR]  FBQ last received queue element not processed yet queue_pos 0
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032152] nvgpu: 17000000.gpu        nvgpu_engine_fb_queue_push:373  [ERR]  fb-queue element in use map is in invalid state
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032157] nvgpu: 17000000.gpu        nvgpu_engine_fb_queue_push:401  [ERR]  falcon id-0, queue id-1, failed
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032162] nvgpu: 17000000.gpu                     pmu_write_cmd:178  [ERR]  fail to write cmd to queue 1
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032167] nvgpu: 17000000.gpu             nvgpu_pmu_rpc_execute:727  [ERR]  Failed to execute RPC status=0xffffffea, func=0x3
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032171] nvgpu: 17000000.gpu gv100_pmu_lsfm_bootstrap_ls_falcon:100  [ERR]  Failed to execute RPC, status=0xffffffea
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032174] nvgpu: 17000000.gpu nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032179] nvgpu: 17000000.gpu nvgpu_gr_falcon_load_secure_ctxsw_ucode:718  [ERR]  Unable to recover GR falcon
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032182] nvgpu: 17000000.gpu        nvgpu_gr_falcon_init_ctxsw:156  [ERR]  fail
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032190] nvgpu: 17000000.gpu nvgpu_cic_mon_report_err_safety_services:97   [ERR]  Error reporting is not supported in this platform
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032194] nvgpu: 17000000.gpu      gr_init_ctxsw_falcon_support:857  [ERR]  FECS context switch init error
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.032198] nvgpu: 17000000.gpu            nvgpu_finalize_poweron:1095 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
Jul 11 00:48:59 jetson-agx-orin-devkit yoe-kiosk-browser[1262]: libnvrm_gpu.so: NvRmGpuLibOpen failed, error=4
Jul 11 00:48:59 jetson-agx-orin-devkit yoe-kiosk-browser[1262]: Could not open egl display
Jul 11 00:48:59 jetson-agx-orin-devkit audit[1262]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=1262 comm="yoe-kiosk-brows" exe="/usr/bin/yoe-kiosk-browser" sig=6 res=1
Jul 11 00:48:59 jetson-agx-orin-devkit kernel: nvgpu: 17000000.gpu                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
Jul 11 00:48:59 jetson-agx-orin-devkit kernel[854]: [  283.056999] nvgpu: 17000000.gpu                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
Jul 11 00:48:59 jetson-agx-orin-devkit audit: BPF prog-id=49 op=LOAD
Jul 11 00:48:59 jetson-agx-orin-devkit audit: BPF prog-id=50 op=LOAD
Jul 11 00:48:59 jetson-agx-orin-devkit audit: BPF prog-id=51 op=LOAD
Jul 11 00:48:59 jetson-agx-orin-devkit systemd[1]: Started Process Core Dump (PID 1265/UID 0).
Jul 11 00:48:59 jetson-agx-orin-devkit systemd-coredump[1266]: elfutils disabled, parsing ELF objects not supported
Jul 11 00:48:59 jetson-agx-orin-devkit systemd-coredump[1266]: [🡕] Process 1262 (yoe-kiosk-brows) of user 0 dumped core.
Jul 11 00:48:59 jetson-agx-orin-devkit systemd[1]: systemd-coredump@9-1265-0.service: Deactivated successfully.
Jul 11 00:48:59 jetson-agx-orin-devkit systemd[1]: yoe-kiosk-browser.service: Main process exited, code=dumped, status=6/ABRT
Jul 11 00:48:59 jetson-agx-orin-devkit systemd[1]: yoe-kiosk-browser.service: Failed with result 'core-dump'.
Jul 11 00:49:00 jetson-agx-orin-devkit audit: BPF prog-id=51 op=UNLOAD
Jul 11 00:49:00 jetson-agx-orin-devkit audit: BPF prog-id=50 op=UNLOAD
Jul 11 00:49:00 jetson-agx-orin-devkit audit: BPF prog-id=49 op=UNLOAD
Jul 11 00:49:00 jetson-agx-orin-devkit systemd[1]: yoe-kiosk-browser.service: Scheduled restart job, restart counter is at 5.
Jul 11 00:49:00 jetson-agx-orin-devkit systemd[1]: yoe-kiosk-browser.service: Start request repeated too quickly.
Jul 11 00:49:00 jetson-agx-orin-devkit systemd[1]: yoe-kiosk-browser.service: Failed with result 'core-dump'.
Jul 11 00:49:00 jetson-agx-orin-devkit systemd[1]: Failed to start Yoe Kiosk Browser.
kekiefer commented 2 weeks ago

@ichergui can you check on an orin devkit if the /sys/class/hwmon/ nodes have a setup like @kraj sees? The one it's bailing out on looks like it is expected to be a ina3221 which I don't have on my custom board.

kekiefer commented 2 weeks ago

For lack of anything better to try, it looks like the devkit sets up one of the ina3221's like this:

/hardware/nvidia/t23x/nv-public/nv-platform/tegra234-p3701-0000.dtsi

/ {
    bus@0 {
        i2c@c240000 {
...
            ina3221@41 {
                compatible = "ti,ina3221";
                reg = <0x41>;
                #address-cells = <1>;
                #size-cells = <0>;
                channel@0 {
                    reg = <0x0>;
                    status = "disabled";
                };
                channel@1 {
                    reg = <0x1>;
                    label = "VDDQ_VDD2_1V8AO";
                    shunt-resistor-micro-ohms = <2000>;
                };
                channel@2 {
                    reg = <0x2>;
                    status = "disabled";
                };
            };
...

And that could explain the missing label nodes. You could perhaps try removing this i2c device before loading the nvpower.sh? Or patch the devicetree, this is built by the recipe nvidia-kernel-oot

kekiefer commented 2 weeks ago

I tracked down an orin devkit and installed demo-image-egl from our tegra demo distro, with a local modification setting PREFERRED_RPROVIDER_tegra-gbm-backend = "tegra-udrm-gbm".

It looks like the warning from nvpower.sh is "normal"

Next I installed kmscube and nvidia-drm-loadconf, and rebooted. On the next boot, the nvidia-drm module was not loaded automatically, this is a problem.

After modprobe nvidia-drm modeset=1, I get one warning from the kernel:

[   62.470329] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for aarch64  540.3.0  Release Build  (@duckhawk)  Thu 11 Jul 2024 07:07:12 PM UTC
[   62.477555] [drm] [nvidia-drm] [GPU ID 0x00020000] Loading driver
[   62.992520] NVRM nvAssertFailedNoLog: Assertion failed: minRequiredIsoBandwidthKBPS <= clientBwValues[DISPLAY_ICC_BW_CLIENT_EXT].minRequiredIsoBandwidthKBPS @ kern_disp_0402.c:111
[   62.992532] CPU: 2 PID: 1115 Comm: kworker/u24:6 Tainted: G           O      5.15.136-l4t-r36.3-1009.9+g46cdb595bebc #1
[   62.992537] Hardware name: NVIDIA NVIDIA Jetson AGX Orin Developer Kit/Jetson, BIOS v36.3.0 01/08/2024
[   62.992539] Workqueue: dce-async-ipc-wq dce_client_async_event_work [tegra_dce]
[   62.992563] Call trace:
[   62.992564]  dump_backtrace+0x0/0x1e0
[   62.992577]  show_stack+0x34/0x44
[   62.992582]  dump_stack_lvl+0x68/0x84
[   62.992588]  dump_stack+0x18/0x34
[   62.992590]  os_dump_stack+0x1c/0x28 [nvidia]
[   62.992721]  nvAssertFailedBacktrace.part.0+0x80/0xa0 [nvidia]
[   62.992830]  kdispArbAndAllocDisplayBandwidth_v04_02+0x240/0x260 [nvidia]
[   62.992938]  kdispInvokeDisplayModesetCallback_KERNEL+0xa8/0x100 [nvidia]
[   62.993042]  osTegraDceClientIpcCallback+0x84/0xc0 [nvidia]
[   62.993147]  dce_client_async_event_work+0x90/0x18c [tegra_dce]
[   62.993156]  process_one_work+0x208/0x4e0
[   62.993164]  worker_thread+0x74/0x4a0
[   62.993167]  kthread+0x180/0x198
[   62.993172]  ret_from_fork+0x10/0x20
[   63.116412] [drm] Initialized nvidia-drm 0.0.0 20160202 for 13800000.display on minor 1

But still, after that, running kmscube works. I do not see any of the other aforementioned kernel traces from the gpu.

I noticed that we were looking at the nvpower.service earlier, but you have a failure loading nvpmodel.service. Maybe we should proceed by looking into this.

root@jetson-agx-orin-devkit:~# systemctl status nvpmodel.service
● nvpmodel.service - NVIDIA power model daemon
     Loaded: loaded (/usr/lib/systemd/system/nvpmodel.service; enabled; preset: enabled)
     Active: active (exited) since Thu 2024-07-11 20:27:25 UTC; 5min ago
    Process: 953 ExecStart=/usr/sbin/nvpmodel -f /etc/nvpmodel.conf (code=exited, status=0/SUCCES>
   Main PID: 953 (code=exited, status=0/SUCCESS)
        CPU: 96ms

Jul 11 20:27:23 jetson-agx-orin-devkit systemd[1]: Starting NVIDIA power model daemon...
Jul 11 20:27:25 jetson-agx-orin-devkit systemd[1]: Finished NVIDIA power model daemon.
kekiefer commented 2 weeks ago

Regarding nvidia-drm-loadconf, it turns out it's working fine. I installed a version from the wrong package feed that was missing the modules-load.d entry. Installing the correct version fixed that. The rest of my comment stands.

kraj commented 2 weeks ago

@kekiefer I tried to disable the i2c device in DT, it does not help, the second part is about failing nvpmodel.service and here is the service error msg

root@jetson-agx-orin-devkit:~# systemctl status nvpmodel
× nvpmodel.service - NVIDIA power model daemon
     Loaded: loaded (/usr/lib/systemd/system/nvpmodel.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Tue 2024-06-11 21:42:11 UTC; 4 weeks 2 days ago
 Invocation: 55eef1780ca04d6b8790bd0d7f694df1
    Process: 1049 ExecStart=/usr/sbin/nvpmodel -f /etc/nvpmodel.conf (code=exited, status=255/EXCEPTION)
   Main PID: 1049 (code=exited, status=255/EXCEPTION)

Jun 11 21:42:07 jetson-agx-orin-devkit systemd[1]: Starting NVIDIA power model daemon...
Jun 11 21:42:11 jetson-agx-orin-devkit nvpmodel[1049]: NVPM ERROR: Error opening /sys/devices/platform/17000000.gpu/devfreq_dev/available_frequencies: 2
Jun 11 21:42:11 jetson-agx-orin-devkit nvpmodel[1049]: NVPM ERROR: failed to read PARAM GPU: ARG FREQ_TABLE: PATH /sys/devices/platform/17000000.gpu/devfr…frequencies
Jun 11 21:42:11 jetson-agx-orin-devkit nvpmodel[1049]: NVPM ERROR: failed to set power mode!
Jun 11 21:42:11 jetson-agx-orin-devkit nvpmodel[1049]: NVPM ERROR: optMask is 2, no request for power mode
Jun 11 21:42:11 jetson-agx-orin-devkit systemd[1]: nvpmodel.service: Main process exited, code=exited, status=255/EXCEPTION
Jun 11 21:42:11 jetson-agx-orin-devkit systemd[1]: nvpmodel.service: Failed with result 'exit-code'.
Jun 11 21:42:11 jetson-agx-orin-devkit systemd[1]: Failed to start NVIDIA power model daemon.

Not sure why /sys/devices/platform/17000000.gpu/devfreq_dev/available_frequencies is missing infact devfreq_dev directory itself is missing.

yoe-kiosk-browser is failing too and following messages in journal are appearing which might be of interest

Jul 12 04:26:04 jetson-agx-orin-devkit yoe-kiosk-browser[1321]: libnvrm_gpu.so: NvRmGpuLibOpen failed, error=4
Jul 12 04:26:04 jetson-agx-orin-devkit yoe-kiosk-browser[1321]: eglQueryDevicesEXT could not find any EGL devices
Jul 12 04:26:04 jetson-agx-orin-devkit yoe-kiosk-browser[1321]: Could not set up EGL device!
kraj commented 2 weeks ago

btw. also seeing that 4 cores are marked offline, do you see this as well ?

image

kekiefer commented 2 weeks ago

Yes, this is part of the default 30W power model, unless changed with nvpmodel userspace or via NVPMODEL_CONFIG_DEFAULT in the build.

kekiefer commented 2 weeks ago

gk20a_scale_init in nvgpu/drivers/gpu/nvgpu/os/linux/scale.c from the nvidia-kernel-oot package seems to be what is responsible for setting up that devfreq_governor node. There are quite a few paths through here which can result in it not getting set up, so it's not clear yet what might be falling down.

kraj commented 2 weeks ago

CONFIG_GK20A_DEVFREQ seems to be not set in .config could that be an issue

kekiefer commented 2 weeks ago

That was my initial thought, but actually CONFIG_GK20A_DEVFREQ is just part of the out-of-tree configuration and should be enabled when the kernel has CONFIG_COMMON_CLK and CONFIG_PM_DEVFREQ.

Are you installing all kernel modules on the target? From both the out-of-tree collection and the kernel?

kraj commented 2 weeks ago

Seems to be ok. here is my lsmod

root@jetson-agx-orin-devkit:~# lsmod
Module                  Size  Used by
nvvrs_pseq_rtc         16384  0
rtk_btusb              77824  0
snd_hda_codec_hdmi     69632  1
mttcan                 69632  0
bluetooth             458752  2 rtk_btusb
tegra23x_perf_uncore    24576  0
nvethernet           1179648  0
snd_hda_tegra          16384  0
can_dev                40960  1 mttcan
ecdh_generic           16384  1 bluetooth
ecc                    36864  1 ecdh_generic
tegra234_aon           57344  1
tegra_mce              28672  1 tegra23x_perf_uncore
snd_hda_codec         139264  2 snd_hda_codec_hdmi,snd_hda_tegra
nvpmodel_clk_cap       16384  0
thermal_trip_event     16384  0
tegra234_oc_event      16384  0
nvpps                  32768  2 mttcan,nvethernet
rtl8822ce            3362816  0
tegra_cactmon_mc_all    16384  0
snd_hda_core          102400  3 snd_hda_codec_hdmi,snd_hda_codec,snd_hda_tegra
nvidia               1626112  0
pwm_tegra_tachometer    16384  0
at24                   24576  0
spi_tegra114           28672  0
i2c_nvvrs11            16384  0
pwm_tegra              20480  1
lm90                   28672  0
nvidia_vrs_pseq        16384  0
host1x_fence           20480  0
tegra_bpmp_thermal     16384  0
tegra_dce             110592  2 nvidia
mc_hwpm                16384  0
nvhost_isp5            16384  0
nvhost_vi5             20480  0
nvhost_nvcsi_t194      16384  0
tegra_camera          245760  3 nvhost_isp5,nvhost_nvcsi_t194,nvhost_vi5
v4l2_dv_timings        36864  1 tegra_camera
v4l2_fwnode            20480  1 tegra_camera
v4l2_async             24576  2 v4l2_fwnode,tegra_camera
videobuf2_dma_contig    24576  1 tegra_camera
videobuf2_memops       20480  1 videobuf2_dma_contig
nvhost_nvcsi           24576  1 tegra_camera
tegra_camera_platform    24576  4 nvhost_isp5,nvhost_nvcsi_t194,tegra_camera,nvhost_vi5
capture_ivc            28672  1 tegra_camera
cfg80211              856064  1 rtl8822ce
rfkill                 36864  4 bluetooth,cfg80211
governor_userspace     16384  0
tegra_camera_rtcpu    229376  2 capture_ivc,tegra_camera
ivc_bus                24576  2 capture_ivc,tegra_camera_rtcpu
hsp_mailbox_client     20480  2 ivc_bus,tegra_camera_rtcpu
ivc_ext                20480  2 ivc_bus,capture_ivc
videobuf2_v4l2         32768  1 tegra_camera
tegra_drm             372736  0
videobuf2_common       65536  4 videobuf2_dma_contig,videobuf2_v4l2,tegra_camera,videobuf2_memops
nvhost_pva            167936  0
nvhost_nvdla          110592  0
tegra_wmark            16384  0
videodev              266240  4 v4l2_async,videobuf2_v4l2,tegra_camera,videobuf2_common
mc                     61440  4 videodev,videobuf2_v4l2,tegra_camera,videobuf2_common
nvhost_capture         20480  2 nvhost_isp5,nvhost_vi5
nvhwpm                139264  4 mc_hwpm,tegra_drm,nvhost_nvdla,nvhost_pva
tegra_se               57344  0
cec                    57344  1 tegra_drm
crypto_engine          20480  1 tegra_se
tsecriscv              32768  1 nvidia
host1x_nvhost          40960  9 nvhost_isp5,nvhost_nvcsi_t194,nvidia,tegra_camera,nvhost_nvdla,nvhost_capture,nvhost_nvcsi,nvhost_pva,nvhost_vi5
drm_kms_helper        303104  1 tegra_drm
pwm_fan                20480  0
nvgpu                2793472  0
governor_pod_scaling    45056  0
nvmap                 237568  1 nvgpu
nvsciipc               24576  1 nvmap
host1x                208896  7 host1x_nvhost,host1x_fence,tegra_se,nvgpu,tegra_drm,nvhost_nvdla,nvhost_pva
mc_utils               16384  3 nvidia,nvgpu,tegra_camera_platform
ina3221                24576  0
drm                   630784  4 drm_kms_helper,nvidia,tegra_drm
ipv6                  503808  62
nvme                   49152  0
nvme_core             106496  1 nvme
tegra_xudc             45056  0
ucsi_ccg               28672  0
typec_ucsi             36864  1 ucsi_ccg
typec                  61440  1 typec_ucsi
pcie_tegra194          40960  0
phy_tegra194_p2u       16384  13

installed mods

root@jetson-agx-orin-devkit:~# opkg list-installed | grep nvidia-kernel
nvidia-kernel-oot-base - 36.3.0-r0.1.0
nvidia-kernel-oot-cameras - 36.3.0-r0.1.0
nvidia-kernel-oot-canbus - 36.3.0-r0.1.0
nvidia-kernel-oot-display - 36.3.0-r0.1.0
nvidia-kernel-oot-wifi - 36.3.0-r0.1.0
root@jetson-agx-orin-devkit:~# opkg list-installed | grep nv-kernel
nv-kernel-module-ar1335-common-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-arm64-ras-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-bmi088-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-cam-cdi-tsc-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-cam-fsync-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-camchar-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-capture-ivc-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-cdi-dev-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-cdi-gpio-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-cdi-mgr-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-cdi-pwm-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-cpuidle-debugfs-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-cpuidle-tegra-auto-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-fusb301-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-governor-pod-scaling-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-host1x-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-host1x-fence-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-host1x-nvhost-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-hsp-mailbox-client-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-i2c-nvvrs11-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-isc-dev-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-isc-gpio-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-isc-mgr-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-isc-pwm-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-ivc-bus-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-ivc-cdev-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-ivc-ext-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-lt6911uxc-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-max9295-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-max9296-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-max96712-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-maxim-gmsl-dp-serializer-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-maxim-gmsl-hdmi-serializer-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-mc-hwpm-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-mc-utils-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-mttcan-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nv-ar0234-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nv-hawk-owl-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nv-imx185-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nv-imx219-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nv-imx274-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nv-imx318-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nv-imx390-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nv-imx477-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nv-ov5693-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvethernet-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvgpu-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvhost-capture-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvhost-isp5-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvhost-nvcsi-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvhost-nvcsi-t194-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvhost-nvdla-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvhost-pva-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvhost-vi-tpg-t19x-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvhost-vi5-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvhwpm-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvidia-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvidia-drm-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvidia-modeset-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvidia-vrs-pseq-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvmap-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvpmodel-clk-cap-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvpps-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvsciipc-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-nvvrs-pseq-rtc-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-pca9570-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-pinctrl-tegra194-pexclk-padctrl-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-pinctrl-tegra234-dpaux-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-pwm-tegra-tachometer-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-r8168-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-rtk-btusb-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-rtl8822ce-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-spi-tegra210-quad-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-aon-ivc-echo-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-bpmp-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-cactmon-mc-all-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-camera-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-camera-platform-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-camera-rtcpu-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-dce-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-drm-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-mce-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-se-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-se-nvrng-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra-wmark-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra234-aon-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra234-oc-event-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra23x-perf-uncore-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tegra23x-psc-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-thermal-trip-event-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-tsecriscv-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-ufs-tegra-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-ufs-tegra-provision-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-virtual-i2c-mux-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
nv-kernel-module-watchdog-tegra-t18x-5.15.136-l4t-r36.3-1009.9+g46cdb595bebc - 36.3.0-r0.1.0
kekiefer commented 2 weeks ago

The problems loading the power management bits early on still seem to be likely at the root of this, causing later issues dealing with the gpu. I've got the nvidia_modeset and nvidia_drm modules loaded, but it looks like you have these installed, so I can only guess that they're failing to load because of the first problem.

root@jetson-agx-orin-devkit:~# lsmod
Module                  Size  Used by
bridge                266240  0
stp                    20480  1 bridge
llc                    20480  2 bridge,stp
usb_f_ecm              24576  2
usb_f_acm              16384  2
u_serial               20480  3 usb_f_acm
usb_f_rndis            32768  2
u_ether                28672  2 usb_f_rndis,usb_f_ecm
libcomposite           65536  14 usb_f_rndis,usb_f_ecm,usb_f_acm
rtk_btusb              77824  0
bluetooth             458752  22 rtk_btusb
ecdh_generic           16384  1 bluetooth
ecc                    36864  1 ecdh_generic
rtl8822ce            3362816  0
nvethernet           1179648  0
snd_hda_codec_hdmi     69632  1
mttcan                 69632  0
tegra_cactmon_mc_all    16384  0
can_dev                40960  1 mttcan
cfg80211              856064  1 rtl8822ce
tegra234_aon           57344  1
at24                   24576  0
nvpps                  32768  2 mttcan,nvethernet
rfkill                 36864  6 bluetooth,cfg80211
snd_hda_tegra          16384  0
snd_hda_codec         139264  2 snd_hda_codec_hdmi,snd_hda_tegra
snd_hda_core          102400  3 snd_hda_codec_hdmi,snd_hda_codec,snd_hda_tegra
host1x_fence           20480  0
pwm_tegra_tachometer    16384  0
spi_tegra114           28672  0
pwm_tegra              20480  1
mc_hwpm                16384  0
nvhost_vi5             20480  0
nvhost_isp5            16384  0
nvhost_nvcsi_t194      16384  0
nvvrs_pseq_rtc         16384  0
tegra_camera          245760  3 nvhost_isp5,nvhost_nvcsi_t194,nvhost_vi5
v4l2_dv_timings        36864  1 tegra_camera
v4l2_fwnode            20480  1 tegra_camera
v4l2_async             24576  2 v4l2_fwnode,tegra_camera
videobuf2_dma_contig    24576  1 tegra_camera
videobuf2_memops       20480  1 videobuf2_dma_contig
nvhost_nvcsi           24576  1 tegra_camera
lm90                   28672  0
i2c_nvvrs11            16384  0
nvidia_vrs_pseq        16384  0
snd_soc_tegra_machine_driver    16384  0
tegra_bpmp_thermal     16384  0
capture_ivc            28672  1 tegra_camera
snd_soc_tegra_utils    32768  1 snd_soc_tegra_machine_driver
snd_soc_simple_card_utils    28672  1 snd_soc_tegra_utils
tegra_camera_platform    24576  4 nvhost_isp5,nvhost_nvcsi_t194,tegra_camera,nvhost_vi5
tegra234_oc_event      16384  0
tegra23x_perf_uncore    24576  0
tegra_mce              28672  1 tegra23x_perf_uncore
nvpmodel_clk_cap       16384  0
thermal_trip_event     16384  0
tegra_camera_rtcpu    229376  2 capture_ivc,tegra_camera
ivc_bus                24576  2 capture_ivc,tegra_camera_rtcpu
hsp_mailbox_client     20480  2 ivc_bus,tegra_camera_rtcpu
ivc_ext                20480  2 ivc_bus,capture_ivc
videobuf2_v4l2         32768  1 tegra_camera
pwm_fan                20480  0
videobuf2_common       65536  4 videobuf2_dma_contig,videobuf2_v4l2,tegra_camera,videobuf2_memops
videodev              266240  4 v4l2_async,videobuf2_v4l2,tegra_camera,videobuf2_common
mc                     61440  4 videodev,videobuf2_v4l2,tegra_camera,videobuf2_common
tegra_se               57344  0
nvhost_pva            167936  0
crypto_engine          20480  1 tegra_se
nvhost_capture         20480  2 nvhost_isp5,nvhost_vi5
nvhost_nvdla          110592  0
nvidia_drm             90112  0
governor_userspace     16384  0
tegra_drm             372736  0
tegra_wmark            16384  0
nvhwpm                139264  4 mc_hwpm,tegra_drm,nvhost_nvdla,nvhost_pva
cec                    57344  1 tegra_drm
nvidia_modeset       1310720  1 nvidia_drm
nvidia               1626112  1 nvidia_modeset
tegra_dce             110592  2 nvidia
tsecriscv              32768  1 nvidia
drm_kms_helper        303104  2 tegra_drm,nvidia_drm
host1x_nvhost          40960  10 nvhost_isp5,nvhost_nvcsi_t194,nvidia,tegra_camera,nvhost_nvdla,nvhost_capture,nvhost_nvcsi,nvhost_pva,nvhost_vi5,nvidia_modeset
nvgpu                2793472  0
governor_pod_scaling    45056  0
nvmap                 237568  1 nvgpu
nvsciipc               24576  1 nvmap
host1x                208896  9 host1x_nvhost,host1x_fence,tegra_se,nvgpu,tegra_drm,nvhost_nvdla,nvidia_drm,nvhost_pva,nvidia_modeset
mc_utils               16384  3 nvidia,nvgpu,tegra_camera_platform
ina3221                24576  0
fuse                  139264  1
drm                   630784  5 drm_kms_helper,nvidia,tegra_drm,nvidia_drm
ipv6                  503808  55 bridge
nvme                   49152  0
nvme_core             106496  1 nvme
tegra_xudc             45056  0
ucsi_ccg               28672  0
typec_ucsi             36864  1 ucsi_ccg
typec                  61440  1 typec_ucsi
pcie_tegra194          40960  0
phy_tegra194_p2u       16384  13

For what it's worth, these are the git hashes I used to build the tegrademo demo-image-egl, where it is working on an orin devkit for me:

meta                 = "HEAD:0df57c2c739c09f6c128515e03f0c2c8758ef905"
meta-tegra           = "master:2c972e80d9715fd22022e1d95c8b4c192b7b1f7a"
meta-oe              
meta-python          
meta-networking      
meta-filesystems     = "HEAD:9363162b5147e2ecc21796047aefc7a10e0d999a"
meta-qt6             = "dev:bdc5526f0ea5fc79c05dc26ebb0d6ab4f42b484a"
meta-virtualization  = "HEAD:e96da98e4038f5388596b4294ac3d8425b2dacb2"
meta-tegra-community = "HEAD:84ef4249ae938c9065811e2c242655471dcc4bdf"
meta-tegra-support   
meta-demo-ci         
meta-tegrademo       
kraj commented 2 weeks ago

yoe-kiosk-browser ( which is based on qtwebengine ) gets a SIGSEGV and I could fathom the backtrace now.

(gdb) bt
#0  0x0000ffff7d43149c in ?? () from /usr/lib/gbm/tegra_gbm.so
#1  0x0000ffff7d431754 in ?? () from /usr/lib/gbm/tegra_gbm.so
#2  0x0000ffff7d452d8c in backend_create_device (bd=0xaaaab94e7f20, fd=5) at /usr/src/debug/mesa/24.0.7/src/gbm/main/backend.c:105
#3  load_backend (lib=0xaaaab94e8250, fd=fd@entry=5, name=0xaaaab94e3ea0 "tegra") at /usr/src/debug/mesa/24.0.7/src/gbm/main/backend.c:137
#4  0x0000ffff7d452ff0 [PAC] in backend_from_driver_name (fd=5) at /usr/src/debug/mesa/24.0.7/src/gbm/main/backend.c:211
#5  _gbm_create_device (fd=fd@entry=5) at /usr/src/debug/mesa/24.0.7/src/gbm/main/backend.c:226
#6  0x0000ffff7d4530d4 [PAC] in gbm_create_device (fd=5) at /usr/src/debug/mesa/24.0.7/src/gbm/main/gbm.c:138
#7  0x0000ffff7d510fa0 [PAC] in QEglFSKmsGbmDevice::open (this=0xaaaab94d9990) at /usr/src/debug/qtbase/6.7.3/src/plugins/platforms/eglfs/deviceintegration/eglfs_kms/qeglfskmsgbmdevice.cpp:40
#8  0x0000ffff7d48a02c [PAC] in QEglFSKmsIntegration::platformInit (this=0xaaaab94e3e80) at /usr/src/debug/qtbase/6.7.3/src/plugins/platforms/eglfs/deviceintegration/eglfs_kms_support/qeglfskmsintegration.cpp:37
#9  0x0000ffff7dfcc728 [PAC] in QEglFSIntegration::initialize (this=0xaaaab94d2e40) at /usr/src/debug/qtbase/6.7.3/src/plugins/platforms/eglfs/api/qeglfsintegration.cpp:87
#10 0x0000ffff81b9c938 [PAC] in QCoreApplicationPrivate::init (this=this@entry=0xaaaab94d34f0) at /usr/src/debug/qtbase/6.7.3/src/corelib/kernel/qcoreapplication.cpp:914
#11 0x0000ffff822ab68c [PAC] in QGuiApplicationPrivate::init (this=0xaaaab94d34f0) at /usr/src/debug/qtbase/6.7.3/src/gui/kernel/qguiapplication.cpp:1585
#12 0x0000ffff822acd9c [PAC] in QGuiApplication::QGuiApplication (this=this@entry=0xffffedfdd700, argc=@0xffffedfdd6ec: 1, argv=argv@entry=0xffffedfdd9e8) at /usr/src/debug/qtbase/6.7.3/src/gui/kernel/qguiapplication.h:172
#13 0x0000aaaab0f32718 [PAC] in main (argc=<optimized out>, argv=0xffffedfdd9e8) at /usr/src/debug/yoe-kiosk-browser/1.0.0+git/main.cpp:64
(gdb)
kekiefer commented 2 weeks ago

I really think you need to solve the prior problems setting up power management for the gpu in the kernel before diving into the details of the graphics stack.

kekiefer commented 2 weeks ago

One note though - without nvidia_modeset and nvidia_drm, you won't be able to load a graphics device with gbm.

kraj commented 2 weeks ago

yeah, I was putting it here for reference, to see if the path for a "eglfs" based image was still ok or is it using wrong libraries etc.

kraj commented 2 weeks ago

One note though - without nvidia_modeset and nvidia_drm, you won't be able to load a graphics device with gbm.

can you share your kernel .config so I can compare to mine so see any difference.

kekiefer commented 2 weeks ago

Regarding nvidia_drm (from the oot modules recipe), it looks like you have it installed, but it wasn't loaded in your printout of lsmod, despite installing nvidia-drm-loadconf. There are no kernel dependencies on this module -- does the module autoload work in your distro, or are you using an older version of this package that didn't install the modules-load.d entry? You can always manually run modprobe nvidia-drm modeset=1.

linux-jammy-nvidia-tegra-dot-config.txt

kraj commented 2 weeks ago

hmm nvidia-drm-loadconf ipk is missing on rootfs. but I built is separately now for tests and installed it. now

root@jetson-agx-orin-devkit:~# opkg files nvidia-drm-loadconf
Package nvidia-drm-loadconf (1.0-r0.7) is installed on root and has the following files:
/etc/modules-load.d/nvidia-drm.conf
/etc/modprobe.d/nvidia-drm.conf
/etc
/etc/modules-load.d
/etc/modprobe.d

and on reboot I do see these modules

root@jetson-agx-orin-devkit:~# lsmod | grep nvidia_[dm+]
nvidia_drm             90112  0
nvidia_modeset       1310720  1 nvidia_drm
nvidia               1626112  1 nvidia_modeset
drm_kms_helper        303104  2 tegra_drm,nvidia_drm
host1x_nvhost          40960  10 nvhost_isp5,nvhost_nvcsi_t194,nvidia,tegra_camera,nvhost_nvdla,nvhost_capture,nvhost_nvcsi,nvhost_pva,nvhost_vi5,nvidia_modeset
host1x                208896  9 host1x_nvhost,host1x_fence,tegra_se,nvgpu,tegra_drm,nvhost_nvdla,nvidia_drm,nvhost_pva,nvidia_modeset
drm                   630784  5 drm_kms_helper,nvidia,tegra_drm,nvidia_drm

The SEGV seen before remains as it is.

comparing .config there are no differences that would matter

❯ diff .config /tmp/linux-jammy-nvidia-tegra-dot-config.txt -u
--- .config     2024-07-11 23:20:19.665470855 -0700
+++ /tmp/linux-jammy-nvidia-tegra-dot-config.txt        2024-07-12 10:42:05.133743503 -0700
@@ -2,7 +2,7 @@
 # Automatically generated file; DO NOT EDIT.
 # Linux/arm64 5.15.136 Kernel Configuration
 #
-CONFIG_CC_VERSION_TEXT="aarch64-yoe-linux-gcc (GCC) 14.1.0"
+CONFIG_CC_VERSION_TEXT="aarch64-oe4t-linux-gcc (GCC) 14.1.0"
 CONFIG_CC_IS_GCC=y
 CONFIG_GCC_VERSION=140100
 CONFIG_CLANG_VERSION=0
kraj commented 2 weeks ago

here is my dmesg logs dmesg.txt

It seems that nvgpu messages are keys as they happen with yoe-kiosk-brower as well as nvpmodel service

kekiefer commented 2 weeks ago

Here are a journal and dmesg from a run where I interactively log in and run kmscube. The kernel logs look substantially the same on quick review, up until the errors, so maybe there are some clues in the journal? Does it make a difference if you delay starting yoe-kiosk-browser until much later?

journal.txt dmesg.txt

madisongh commented 2 weeks ago

From the dmesg log, it looks like you have a 64GiB module (P3701-0005) in there, rather than the 32GiB one (P3701-0000). You'll need to use MACHINE = "p3737-0000-p3701-0005" for that hardware.

kraj commented 2 weeks ago

From the dmesg log, it looks like you have a 64GiB module (P3701-0005) in there, rather than the 32GiB one (P3701-0000). You'll need to use MACHINE = "p3737-0000-p3701-0005" for that hardware.

ha! that could be root of all. I must say the machine names are a bit confusing and I got tripped. If there is some way to name them so they are more revealing would be good. Are the SKU numbers in some form readable from machine via some NVRAM read etc ?

kraj commented 2 weeks ago

From the dmesg log, it looks like you have a 64GiB module (P3701-0005) in there, rather than the 32GiB one (P3701-0000). You'll need to use MACHINE = "p3737-0000-p3701-0005" for that hardware.

Thanks a lot @madisongh this really helped and nailed the problem. Second minor issue was that I have to use /dev/dri/card1 instead of /dev/dri/card0 which yoe-kiosk-browser's default is. Now I can launch the browser on EGL surface. Onto doing some openCV and test some CUDA accelarations.

madisongh commented 2 weeks ago

Are the SKU numbers in some form readable from machine via some NVRAM read etc ?

The full part number is stored in an EEPROM on the module. The setup-nv-boot-control recipe installs a script for reading that info and programming a couple of EFI variables from it. It's also read by the flashing tools/scripts.

I must say the machine names are a bit confusing and I got tripped.

Yep, that's a problem, and it's worse now than with earlier L4T versions due to device trees being different between SKUs in the same family. It's less of a problem for NVIDIA, since everything's pre-built, and their flashing scripts read the module info before constructing the rootfs, so they can get away with using the same config name for all of the variants. That's harder for us, since we have to know some of these differences at build time.

Still, I think there's something we could do to at least catch these mismatches earlier in the process.