Sculpt 23.04 consolidation

m-stein commented 1 year ago

This may serve as a top level issue for all issues that arise while testing preliminary Sculpt 23.04 on staging.

m-stein commented 1 year ago

I just ran into a problem with the Wifi driver. When I'm on my native Xubuntu 20.04.6 on my Lenovo T490 and then switch to Sculpt 23.05 (e8d40b11d390d458e967beba7217b371ac2c527f) and turn on Wifi in the network dialog, the driver starts but ends up in this:

log_2023-04-20.txt

However, if I reboot directly into Sculpt after that, the problem disappears completely. I've tried the transitions Linux -> Sculpt, Sculpt -> Sculpt multiple times and, on each run, restarted Wifi multiple times as well. The result was always the same and restarting the driver didn't change it neither.

nfeske commented 1 year ago

I'm just leaving a note that I briefly mentioned to @alex-ab yesterday.

When booting Sculpt on my x201, the internal display is very dark. The brightness as visible in the connectors report is "1". When manually changing the fb_drv config, I could change the brightness just fine. But the default value should better somewhere around 70%.

chelmuth commented 1 year ago

I'm still investigating measures to fix the network TX performance regression in VirtualBox. The cause of the regression is a faulty implementation of GSO segment handling in our vbox network code. Disabling GSO prevents the regression.

ssumpf commented 1 year ago

While playing around with the new Sculpt version, I observed that output in the log terminal/report log was sometimes delayed significantly. 50d12af fixes this behavior (thanks to @nfeske).

nfeske commented 1 year ago

Thanks @ssumpf! I merged your fix to staging.

ssumpf commented 1 year ago

With the recent changes in gpu/intel and libdrm (16MB chunk allocations), the RAM quota for the gears demo became insufficient (3042fc5).

nfeske commented 1 year ago

Merged the gears-quota fix to staging.

chelmuth commented 1 year ago

Disabling GSO prevents the regression.

Commits f372b58216ed4f4a3424df173b9d434621a3b872 and d7bafb048002f3d4b68bd1dd08fcb68ed26f2a8c (on staging) prevent the regression (see also #4821).

cproc commented 1 year ago

I built new Falkon packages (currently in cproc_test depot) and according to the log output the browser starts, but no window appears. The Qt5 textedit component seems to have a similar problem.

ssumpf commented 1 year ago

I built new Falkon packages (currently in cproc_test depot) and according to the log output the browser starts, but no window appears. The Qt5 textedit component seems to have a similar problem.

@cproc: Make sure to have 6f3a098 applied, otherwise you might not see (or see very late) a possible resource request. Same happened to us with mesa_gears.

m-stein commented 1 year ago

Regarding wifi on T490: I tried 1145740706799b156452a6fda8922a26f332c9b7 by @cnuke but it didn't change the results. However, we found that when doing the transition Linux -> Sculpt hard by power button or by shutting down and cutting power, Sculpt wifi worked right away. In my initial tests I always did a warm reboot or shutdown with power plugged.

jschlatow commented 1 year ago

@m-stein I've never tried this with wifi but I've had the same effect (soft restart from Linux -> Sculpt breaks driver) with the nic driver on the T490s ever since.

alex-ab commented 1 year ago

I'm just leaving a note that I briefly mentioned to @alex-ab yesterday.

When booting Sculpt on my x201, the internal display is very dark. The brightness as visible in the connectors report is "1". When manually changing the fb_drv config, I could change the brightness just fine. But the default value should better somewhere around 70%.

Commit a03fc38e80c61a1fc5cdeafb3a1357e8f646bd3e sets the default brightness to 70%.

chelmuth commented 1 year ago

I'm just leaving a note that I briefly mentioned to @alex-ab yesterday. When booting Sculpt on my x201, the internal display is very dark. The brightness as visible in the connectors report is "1". When manually changing the fb_drv config, I could change the brightness just fine. But the default value should better somewhere around 70%.

Commit a03fc38 sets the default brightness to 70%.

Thanks, merged to staging.

m-stein commented 1 year ago

@jschlatow Sounds related, probably a platform driver issue. Nonetheless, with T490, I had no problems with the NIC driver so far.

m-stein commented 1 year ago

I just started nfeske/mesa_gears, nfeske_test/nano3d, and ssumpf/glmark2 at the same time. In the beginning, all worked fine with mesa_gears showing ~1K fps. But after a while the output of mesa_gears and glmark2 froze and the intel gpu driver ended up in a fault. I removed each of the three components and tried to start only one of glmark2/mesa_gears. The window shows up but it remains black and causes a similar fault in the intel gpu driver again. Restarting mesa_gpu-intel and then starting a single client component gives the same result

This is the log (the interesting part is at the end; it shows two driver faults, one for the initial 3-component attempt and a second for the single-component attempt):

log_2023-04-21.txt

cproc commented 1 year ago

I cleaned my local depot and version hash changes and rebuilt everything and the Qt applications are working now.

nfeske commented 1 year ago

On the 12th Gen Framewo Laptop, I encountered two problems.

The Laptop display stays dark. Here is the log gathered via report_dump:

[drivers -> dynamic -> intel_gpu_drv] MGGC_0_2_0_PCI
[drivers -> dynamic -> intel_gpu_drv]   Graphics_mode_select:               0x0
[drivers -> dynamic -> intel_gpu_drv]   Gtt_graphics_memory_size:           0x3
[drivers -> dynamic -> intel_gpu_drv]   Versatile_acceleration_mode_enable: 0x0
[drivers -> dynamic -> intel_gpu_drv]   Igd_vga_disable:                    0x0
[drivers -> dynamic -> intel_gpu_drv]   Ggc_lock:                           0x1
...
[drivers -> dynamic -> intel_gpu_drv] [34mWarning: No supported Intel GPU detected - no GPU service
[drivers -> dynamic -> intel_fb_drv] --- Intel framebuffer driver started ---
[drivers -> dynamic -> intel_fb_drv] NR_IRQS: 4352, nr_irqs: 4352, preallocated irqs: 256
[drivers -> dynamic -> intel_fb_drv] clocksource: dde_counter: mask: 0xffffffffffffff max_cycles: 0x1d854df40, max_idle_ns: 3526361616960 ns
[drivers -> dynamic -> intel_fb_drv] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[drivers -> dynamic -> intel_fb_drv] clocksource: Switched to clocksource dde_counter
[drivers -> dynamic -> intel_fb_drv] i915 00:02.0: [drm] Incompatible option enable_guc=3 - GuC submission is N/A
[drivers -> dynamic -> intel_fb_drv] disabling PPGTT to avoid GPU code paths
[drivers -> dynamic -> intel_fb_drv] i915 00:02.0: [drm] VT-d active for gfx access
[drivers -> dynamic -> intel_fb_drv] [31mError: could not read I/O port resource 0x3cc
[drivers -> dynamic -> intel_fb_drv] [31mError: could not write I/O port resource 0x3c2
[drivers -> dynamic -> intel_fb_drv] i915 00:02.0: [drm] Failed to load DMC firmware i915/adlp_dmc_ver2_16.bin. Disabling runtime power management.
...
[drivers -> dynamic -> intel_fb_drv] i915 00:02.0: [drm] DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915[drm] Initialized i915 1.6.0 20201103 for 00:02.0 on minor 0
[drivers -> dynamic -> intel_fb_drv] sched_clock: Marking stable (1217876000, 7429000)->(1229630000, -4325000)
[drivers -> dynamic -> intel_fb_drv] i915 00:02.0: [drm] Cannot find any crtc or sizes
[drivers -> dynamic -> intel_fb_drv] i915 00:02.0: [drm] Cannot find any crtc or sizes

The log also shows that the Wifi driver fails:

[runtime -> wifi_drv] NR_IRQS: 4352, nr_irqs: 4352, preallocated irqs: 256
[runtime -> wifi_drv] clocksource: dde_counter: mask: 0xffffffffffffff max_cycles: 0x1d854df40, max_idle_ns: 3526361616960 ns
[runtime -> wifi_drv] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[runtime -> wifi_drv] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[runtime -> wifi_drv] clocksource: Switched to clocksource dde_counter
[runtime -> wifi_drv] Intel(R) Wireless WiFi driver for Linux
[runtime -> wifi_drv] iwlwifi 00:14.3: pci_enable_msi failed - -38
[runtime -> wifi_drv] task_work_add: task: 0000000017404e79 work: 00000000ba035a3b notify: 0
[runtime -> wifi_drv] iwlwifi 00:14.3: api flags index 2 larger than supported by driver
[runtime -> wifi_drv] iwlwifi 00:14.3: TLV_FW_FSEQ_VERSION: FSEQ Version: 0.0.2.36
[runtime -> wifi_drv] iwlwifi 00:14.3: loaded firmware version 68.01d30b0c.0 so-a0-hr-b0-72.ucode op_mode iwlmvm
[runtime -> wifi_drv] iwlwifi 00:14.3: Detected Intel(R) Wi-Fi 6 AX201 160MHz, REV=0x370
[runtime -> wifi_drv] iwlwifi 00:14.3: Detected RF HR B3, rfid=0x10a100
[runtime -> wifi_drv] iwlwifi 00:14.3: base HW address: 2c:db:07:e8:5a:35
[runtime -> wifi_drv] iwlwifi 00:14.3: firmware didn't ACK the reset - continue anyway
[runtime -> wifi_drv] iwlwifi 00:14.3: Start IWL Error Log Dump:
[runtime -> wifi_drv] iwlwifi 00:14.3: Transport status: 0x0000004A, valid: 6
[runtime -> wifi_drv] iwlwifi 00:14.3: Loaded firmware version: 68.01d30b0c.0 so-a0-hr-b0-72.ucode
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000084 | NMI_INTERRUPT_UNKNOWN       
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00A082F0 | trm_hw_status0
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | trm_hw_status1
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x004D89A2 | branchlink2
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x004CF022 | interruptlink1
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x004CF022 | interruptlink2
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00015142 | data1
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x01000000 | data2
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | data3
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | beacon time
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x0001F37D | tsf low
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | tsf hi
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | time gp1
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x0002ACE7 | time gp2
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000001 | uCode revision type
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000044 | uCode version major
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x01D30B0C | uCode version minor
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000370 | hw version
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00489002 | board version
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x802FFF00 | hcmd
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x20020000 | isr0
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | isr1
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x48F00002 | isr2
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00C0000C | isr3
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | isr4
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | last cmd Id
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00015142 | wait_event
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | l2p_control
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | l2p_duration
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | l2p_mhvalid
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | l2p_addr_match
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000009 | lmpm_pmg_sel
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | timestamp
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000024 | flow_handler
[runtime -> wifi_drv] iwlwifi 00:14.3: Start IWL Error Log Dump:
[runtime -> wifi_drv] iwlwifi 00:14.3: Transport status: 0x0000004A, valid: 7
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x20000074 | ADVANCED_SYSASSERT
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | umac branchlink1
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x8045D964 | umac branchlink2
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x8047E4BE | umac interruptlink1
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x80479176 | umac interruptlink2
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x01000000 | umac data1
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x80479176 | umac data2
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000000 | umac data3
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000044 | umac major
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x01D30B0C | umac minor
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x0002AF59 | frame pointer
[runtime -> wifi_drv] iwlwifi 00:14.3: 0xC0886244 | stack pointer
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00040F07 | last host cmd
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000400 | isr status reg
[runtime -> wifi_drv] iwlwifi 00:14.3: IML/ROM dump:
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000B03 | IML/ROM error/state
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x0000500F | IML/ROM data1
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000080 | IML/ROM WFPM_AUTH_KEY_0
[runtime -> wifi_drv] iwlwifi 00:14.3: Fseq Registers:
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x20000000 | FSEQ_ERROR_CODE
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x80350002 | FSEQ_TOP_INIT_VERSION
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00150000 | FSEQ_CNVIO_INIT_VERSION
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x0000A482 | FSEQ_OTP_VERSION
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00000003 | FSEQ_TOP_CONTENT_VERSION
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x4552414E | FSEQ_ALIVE_TOKEN
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00080400 | FSEQ_CNVI_ID
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x01300504 | FSEQ_CNVR_ID
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x00080400 | CNVI_AUX_MISC_CHIP
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x01300504 | CNVR_AUX_MISC_CHIP
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x05B0905B | CNVR_SCU_SD_REGS_SD_REG_DIG_DCDC_VTRIM
[runtime -> wifi_drv] iwlwifi 00:14.3: 0x0000025B | CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR
[runtime -> wifi_drv] iwlwifi 00:14.3: WRT: Collecting data: ini trigger 4 fired (delay=0ms).
[runtime -> wifi_drv] iwlwifi 00:14.3: WRT: Failed to dump region: id=19, type=16
[runtime -> wifi_drv] NET: Registered PF_PACKET protocol family
[runtime -> wifi_drv] sched_clock: Marking stable (944961000, 13001000)->(963003000, -5041000)
[runtime -> wifi_drv] reg: Loading compiled-in X.509 certificates for regulatory database
[runtime -> wifi_drv] reg: loaded regulatory.db is malformed or signature is missing/invalid

nfeske commented 1 year ago

BTW, in order to boot the Framework laptop, I had to disable USB bios handoff (https://github.com/nfeske/genode/commit/06c9bd7c87eb020cfb750dd169d43e47a36822ea) but that's a known issue.

ssumpf commented 1 year ago

I just started nfeske/mesa_gears, nfeske_test/nano3d, and ssumpf/glmark2 at the same time. In the beginning, all worked fine with mesa_gears showing ~1K fps. But after a while the output of mesa_gears and glmark2 froze and the intel gpu driver ended up in a fault. I removed each of the three components and tried to start only one of glmark2/mesa_gears. The window shows up but it remains black and causes a similar fault in the intel gpu driver again. Restarting mesa_gpu-intel and then starting a single client component gives the same result

This is the log (the interesting part is at the end; it shows two driver faults, one for the initial 3-component attempt and a second for the single-component attempt):

log_2023-04-21.txt

@m-stein: I can reproduce this on my Sculpt system (GPU Gen9). On Fuji5 (GPU Gen12) it works fine. So I will have to convert my laptop into a test machine and the test machine into the machine I use for work. @alex-ab: In the fault case there are no DMAR faults to be observed here. So this might be a different issue.

ssumpf commented 1 year ago

@nfeske: In your log I see that the GPU is not detected. Since you have Alderlake with a GEN12 GPU would you mind testing the GPU driver by adding your card to repos/gems/sculpt/gpu_drv/intel once the framebuffer is working?

nfeske commented 1 year ago

The Laptop display [... on the Framework laptop...] stays dark.

Thanks to @alex-ab for the offline hint about the max_width/height values. The Framework has a vertical resolution of 1504 pixels, which is higher than the currently configured maximum of 1440. with commit 944867c6aa4f19e84ac11537a9653e5aa83e782a, the display works. :-)

alex-ab commented 1 year ago

@m-stein: I can reproduce this on my Sculpt system (GPU Gen9). On Fuji5 (GPU Gen12) it works fine. So I will have to convert my laptop into a test machine and the test machine into the machine I use for work. @alex-ab: In the fault case there are no DMAR faults to be observed here. So this might be a different issue.

@ssumpf: I will try to figure out the reasons of the DMA faults on my working machine.

jschlatow commented 1 year ago

@m-stein I've noticed that the file vault is not working properly anymore. I tried creating a new vault which fails with the following log messages:

[runtime -> file_vault -> file_vault] [31mError: failed to deliver Report session[0m
[runtime -> file_vault -> file_vault] [31mError: failed to deliver Report session[0m
[runtime -> file_vault -> file_vault] child "menu_view" requests resources: ram_quota=151628[0m
[runtime -> file_vault -> file_vault] child "truncate_file" exited with exit value 0[0m
[runtime -> file_vault -> file_vault] child "cbe_init_trust_anchor" exited with exit value 0[0m
[runtime -> file_vault -> file_vault] child "cbe_init" exited with exit value 0[0m
[runtime -> file_vault -> file_vault] child "sync_to_cbe_vfs_init" exited with exit value 0[0m
[runtime -> file_vault -> file_vault -> mke2fs] mke2fs 1.46.5 (30-Dec-2021)
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:0 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:256 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:256 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:264 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:264 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:272 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:272 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:768 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:768 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:776 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:776 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:784 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: Could not read block(s)[0m
[runtime -> file_vault -> file_vault -> mke2fs] [31mError: error while reading block:784 from block device[0m
[runtime -> file_vault -> file_vault -> mke2fs] /dev/block: Not enough space to build proposed filesystem while setting up superblock
[runtime -> file_vault -> file_vault] child "mke2fs" exited with exit value 1[0m
[runtime -> file_vault -> file_vault] [31mError: Uncaught exception of type 'File_vault::Main::_child_succeeded(Genode::Xml_node const&, File_vault::Child_state const&)::Child_exited_with_error'[0m
[runtime -> file_vault -> file_vault] [34mWarning: abort called - thread: ep[0m

Do you have an idea what might be wrong here?

skalk commented 1 year ago

@m-stein @jschlatow I've a similar problem using the file_vault, but in contrast on existing data, not a new vault:

[runtime -> file_vault -> file_vault] Error: failed to deliver Report session
[runtime -> file_vault -> file_vault] Error: failed to deliver Report session
[runtime -> file_vault -> file_vault] child "cbe_init_trust_anchor" exited with exit value 0
[runtime -> file_vault -> file_vault] child "sync_to_cbe_vfs_init" exited with exit value 0
[runtime -> file_vault -> file_vault -> client_fs_fs_query] Error: failed to watch '/'
[runtime -> file_vault -> file_vault -> client_fs_fs_query] Error: failed to watch '/data'
[runtime -> file_vault -> file_vault] child "rump_vfs" requests resources: ram_quota=4198400
[runtime -> file_vault -> file_vault -> rump_vfs] rump: /genode: file system not clean; please fsck(8)
[runtime -> file_vault -> file_vault -> rump_vfs] rump: panic: mount: lost mount
[runtime -> file_vault -> file_vault -> rump_vfs] rump: rump kernel halting...
[runtime -> file_vault -> file_vault -> rump_vfs] rump: halted
[runtime -> file_vault -> file_vault -> rump_vfs] Warning: rumpuser_sp_fini called (from 0x109b103) not implemented
[runtime -> file_vault -> file_vault -> rump_vfs] Error: Rump panic
[runtime -> file_vault -> file_vault] child "rump_vfs" exited with exit value -1

m-stein commented 1 year ago

@skalk @jschlatow Thanks for the detailed reports! I'm trying to bisect the problem.

m-stein commented 1 year ago

Regarding the File Vault problems: The initialization of a new file vault breaks with cc9368ccb45e7f9c972898d00c89bb7b5daa3b6a.

cnuke commented 1 year ago

@m-stein: I can reproduce this on my Sculpt system (GPU Gen9). On Fuji5 (GPU Gen12) it works fine. So I will have to convert my laptop into a test machine and the test machine into the machine I use for work. @alex-ab: In the fault case there are no DMAR faults to be observed here. So this might be a different issue.

@ssumpf FWIW, Doom3 leads to DMAR faults on my T470p:

[…]
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/plasmagun/pl_can_s
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/plasmagun/pl_can_blue
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 FRR:0 FR:0x7 BDF:0:2:0 FI:0x4bf0e9000 (0)
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 FRR:0 FR:0x7 BDF:0:2:0 FI:0x4bf0f8000 (0)
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 FRR:0 FR:0x7 BDF:0:2:0 FI:0x4bf107000 (0)
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 FRR:0 FR:0x7 BDF:0:2:0 FI:0x4bf115000 (1)
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 FRR:0 FR:0x7 BDF:0:2:0 FI:0x4bf122000 (2)
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 FRR:0 FR:0x7 BDF:0:2:0 FI:0x4bf12f000 (0)
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 FRR:0 FR:0x7 BDF:0:2:0 FI:0x4bf13c000 (0)
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 FRR:0 FR:0x7 BDF:0:2:0 FI:0x4bf14a000 (0)
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 FRR:0 FR:0x7 BDF:0:2:0 FI:0x4bf157000 (0)
^M[kernel] [ 0] IOMMU:0xffffffff817ec060 - disabling fault reporting
^M[core] [init -> runtime -> doom3]     0 purged from previous
^M[core] [init -> runtime -> doom3]   159 kept from previous
^M[core] [init -> runtime -> doom3]  1957 new loaded
^M[core] [init -> runtime -> doom3] all images loaded in  13.8 seconds
^M[core] [init -> runtime -> doom3] ----- idSoundCache::EndLevelLoad -----
^M[core] [init -> runtime -> doom3] 13353k referenced
^M[core] [init -> runtime -> doom3]   874k purged
^M[core] [init -> runtime -> doom3] sound: missing efxs/demo_mars_city1.efx
^M[core] [init -> runtime -> doom3]  29472 msec to load game/demo_mars_city1
^M[core] [init -> runtime -> doom3] idRenderWorld::GenerateAllInteractions, msec = 9, staticAllocCount = 0.
^M[core] [init -> runtime -> doom3] interactionTable size: 5639928 bytes
^M[core] [init -> runtime -> doom3] 29864 interaction take 4539328 bytes
^M[core] [init -> runtime -> doom3] ----- Warnings -----
^M[core] [init -> runtime -> doom3] during game/demo_mars_city1...
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: makealpha( models/weapons/bfg/bfg_wallmark_dirt)
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/bfg/bfg_wallmark
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/bfg/bfg_world
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/bfg/bfg_world_local
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/bfg/bfg_world_s
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/plasmagun/pl_can_blue
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/plasmagun/pl_can_d
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/plasmagun/pl_can_local
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load image: models/weapons/plasmagun/pl_can_s
^M[core] [init -> runtime -> doom3] WARNING: Couldn't load sound 'guisounds.wav' using default
^M[core] [init -> runtime -> doom3] WARNING: file def/weapon_bfg.def, line 102: Unknown entityDef 'damage_bfgSplash' inherited by 'damage_bfgsplash_cinematic'
^M[core] [init -> runtime -> doom3] WARNING: file def/weapon_bfg.def, line 92: Unknown entityDef 'damage_bfg' inherited by 'damage_bfg_cinematic'
^M[core] [init -> runtime -> doom3] WARNING: file def/weapon_bfg.def, line 97: Unknown entityDef 'damage_bfgFreq' inherited by 'damage_bfgfreq_cinematic'
^M[core] [init -> runtime -> doom3] WARNING: Unknown classname 'weapon_plasmagun' on 'weapon_plasmagun_1'.
^M[core] [init -> runtime -> doom3] WARNING: Unknown string id #str_07183
^M[core] [init -> runtime -> doom3] 15 warnings
^M[core] [init -> runtime -> doom3] WARNING: Failed to open save file 'savegames/Autosave_game_demo_mars_city1.save'
^M[core] [init -> runtime -> doom3] 
^M[core] 8 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003fcc 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003fd0 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003fd4 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003fd8 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003fdc 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003fe0 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003fe4 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003fe8 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003fec 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003ff0 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003ff4 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003ff8 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] 0x0000000000003ffc 0x00000000            
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] Error: could not stop engine
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] Warning: engine 0 reset failed
^M[core] [init -> drivers -> dynamic -> intel_gpu_drv] Warning: cannot reset device, engines not ready
[…]

Due to the logging nature it appears that some log entries are overwritten (basically the rest of the gpu multiplexer dump).

nfeske commented 1 year ago

The log also shows that the Wifi driver [... on the Framework laptop ...] fails:

The log messages turned out to be non-critical. I'm posting this from Sculpt 23.04 on the Framework laptop. :-)

Since you have Alderlake with a GEN12 GPU would you mind testing the GPU driver by adding your card to repos/gems/sculpt/gpu_drv/intel once the framebuffer is working?

@ssumpf, I added the gpu_drv entry. The GPU driver looks happy now.

[drivers -> dynamic -> intel_gpu_drv] MGGC_0_2_0_PCI
[drivers -> dynamic -> intel_gpu_drv]   Graphics_mode_select:               0x0
[drivers -> dynamic -> intel_gpu_drv]   Gtt_graphics_memory_size:           0x3
[drivers -> dynamic -> intel_gpu_drv]   Versatile_acceleration_mode_enable: 0x0
[drivers -> dynamic -> intel_gpu_drv]   Igd_vga_disable:                    0x0
[drivers -> dynamic -> intel_gpu_drv]   Ggc_lock:                           0x1
[drivers -> dynamic -> intel_gpu_drv] clock frequency: 38400000 Hz from crystal logic
[drivers -> dynamic -> intel_gpu_drv] GGTT
[drivers -> dynamic -> intel_gpu_drv]   vaddr:0x11800000 size:0x800000 entries:1048575 used:16384 aperture_size:0x10000000
[drivers -> dynamic -> intel_gpu_drv]   scratch_page:0x1000 (PA)
[drivers -> dynamic -> intel_gpu_drv] [34mWarning: no clock gating

However, when starting the gears demo, Mesa complains:

[runtime -> mesa_gears] libEGL debug: Native platform type: genode (build-time configuration)
[runtime -> mesa_gears] libEGL debug: MESA-LOADER: dlopen(/drivers/mesa_gpu_drv.lib.so)
[runtime -> mesa_gears] 
[runtime -> mesa_gears] libEGL debug: found extension `DRI_Core'
[runtime -> mesa_gears] libEGL info: found extension DRI_Core version 2
[runtime -> mesa_gears] libEGL debug: found extension `DRI_IMAGE_DRIVER'
[runtime -> mesa_gears] libEGL debug: found extension `DRI_DRI2'
[runtime -> mesa_gears] libEGL info: found extension DRI_DRI2 version 4
[runtime -> mesa_gears] libEGL debug: found extension `DRI_ConfigOptions'
[runtime -> mesa_gears] libEGL debug: found extension `DRI_Core'
[runtime -> mesa_gears] libEGL debug: found extension `DRI_IMAGE_DRIVER'
[runtime -> mesa_gears] libEGL debug: found extension `DRI_DRI2'
[runtime -> mesa_gears] libEGL debug: found extension `DRI_ConfigOptions'
[runtime -> mesa_gears] libEGL info: found extension DRI_ConfigOptions version 2
[runtime -> mesa_gears] Driver does not support the 0x46a6 PCI ID.
[runtime -> mesa_gears] libEGL warning: DRI2: failed to create dri screen
[runtime -> mesa_gears] EGL_VERSION = 1.4
[runtime -> mesa_gears] EGLUT: failed to choose a config
[runtime] child "mesa_gears" exited with exit value 1[0m

ssumpf commented 1 year ago

@nfeske: Looks like Alderlake will require a Mesa update.

ssumpf commented 1 year ago

@cnuke: I think that is what @alex-ab is seeing. I am looking into the one from @m-stein at the moment.

nfeske commented 1 year ago

BTW, in order to boot the Framework laptop, I had to disable USB bios handoff (https://github.com/nfeske/genode/commit/06c9bd7c87eb020cfb750dd169d43e47a36822ea) but that's a known issue.

Since this is the only (blocking) issue for booting the unpatched version of Sculpt on the Framework laptop, I would very much appreciate if could adjust the timeout of the USB-handover code to only a very few (like two or three?) seconds.

cnuke commented 1 year ago

Regarding the File Vault problems: The initialization of a new file vault breaks with cc9368c.

@m-stein commit 8056729 fixes the file_vault.run run-script for me. I have not checked why the commit referenced by you appears to break the file_vault but I noticed that max_vba() is 0 because the sync-operation is not performed and thus the wrong size is reported after the initialization is done.

ssumpf commented 1 year ago

@m-stein: 50939ad fixes the GPU hang for me on GEN9 GPUs. @cnuke: The DMAR thing I could still not trigger.

chelmuth commented 1 year ago

With the recent changes in gpu/intel and libdrm (16MB chunk allocations), the RAM quota for the gears demo became insufficient (3042fc5).

This is still insufficient on my x250 - setting ram="80M" in /config/deploy let it run without flaws. Could it be this isn't just an Iris thing as stated in the commit message?

skalk commented 1 year ago

@cnuke @m-stein I can confirm that with commit 8056729 I can also re-access my already existing file_vault data, it works. Thank you for your commitment!

chelmuth commented 1 year ago

@m-stein: 50939ad fixes the GPU hang for me on GEN9 GPUs. @cnuke: The DMAR thing I could still not trigger.

read execlist status multiple times in this commit means 100 times, right? As I can't assess the impact of this approach I'd like to ask if this may increase the CPU usage of the multiplexer noticeably.

ssumpf commented 1 year ago

@m-stein: 50939ad fixes the GPU hang for me on GEN9 GPUs. @cnuke: The DMAR thing I could still not trigger.

read execlist status multiple times in this commit means 100 times, right? As I can't assess the impact of this approach I'd like to ask if this may increase the CPU usage of the multiplexer noticeably.

In most cases goes through at once. In the case where it triggers it takes two times. The 100 is just there so we can detect the case when there is something severely broken and don't end up with an endless loop. So I am open to any other number. Btw. before we did not have this check and everything worked fine, I just added it because the documentation states it could lead to preemption. Without the loop or removing the check altogether you will get a GPU watchdog timer trigger. Which is what @mstein experienced.

jschlatow commented 1 year ago

@cnuke @m-stein I can confirm that with commit 8056729 I can also re-access my already existing file_vault data, it works. Thank you for your commitment!

The commit also fixes the issue I had with the file vault.

chelmuth commented 1 year ago

In most cases goes through at once. In the case where it triggers it takes two times. The 100 is just there so we can detect the case when there is something severely broken and don't end up with an endless loop. So I am open to any other number. Btw. before we did not have this check and everything worked fine, I just added it because the documentation states it could lead to preemption.

Thanks for the explanation (which may be worth to record in a comment at the code place in question). The 100 is completely fine with the knowledge that common cases succeed at iteration 1 or 2.

ssumpf commented 1 year ago

With the recent changes in gpu/intel and libdrm (16MB chunk allocations), the RAM quota for the gears demo became insufficient (3042fc5).

This is still insufficient on my x250 - setting ram="80M" in /config/deploy let it run without flaws. Could it be this isn't just an Iris thing as stated in the commit message?

Iris surely contributes, there are many other contributing factors from the multiplexer (heap upgrades, page table upgrades, platform driver upgrades) that may sometimes be accounted to an arbitrary client causing the situation. Until now I have tried to minimize allocations in order to safe capabilities, which obviously comes at the cost of RAM. When reducing the junk size from 16 to 4MB in Iris, I can start gears with 36MB. So as you said, it's just one factor.

ssumpf commented 1 year ago

In most cases goes through at once. In the case where it triggers it takes two times. The 100 is just there so we can detect the case when there is something severely broken and don't end up with an endless loop. So I am open to any other number. Btw. before we did not have this check and everything worked fine, I just added it because the documentation states it could lead to preemption.

Thanks for the explanation (which may be worth to record in a comment at the code place in question). The 100 is completely fine with the knowledge that common cases succeed at iteration 1 or 2.

Okay, will do.

atopia commented 1 year ago

On my X260, @nfeske's bsd_audio_drv fails with the following:

[runtime] child "mixer"
[runtime]   RAM quota:  61192K
[runtime]   cap quota:  66
[runtime]   ELF binary: mixer
[runtime]   priority:   2
[runtime]   provides service Audio_out
[runtime -> bsd_audio_drv] Error: output channel 'left' is unavailable, denying 'mixer -> '
[runtime -> mixer] Error: Audio_out-session creation failed (label="", ram_quota=535560, cap_quota=4, channel="left")
[runtime -> mixer] Error: Uncaught exception of type 'Genode::Service_denied'
[runtime -> mixer] Warning: abort called - thread: ep

I didn't actually use audio in 22.10, but at least the mixer connection didn't fail.

jschlatow commented 1 year ago

@m-stein @cnuke I might have broken the file vault again. I pressed shutdown while a client was still connected, which triggered the following exception:

[runtime -> file_vault -> file_vault] [31mError: Uncaught exception of type 'Genode::Id_space<Genode::Parent::Client>::Unknown_id'[0m
[runtime -> file_vault -> file_vault] [34mWarning: abort called - thread: ep[0m

After restarting the file vault and entering the password again, it fails with the following errors:

[runtime]   RAM quota:  225032K[0m
[runtime]   cap quota:  2166[0m
[runtime]   ELF binary: init[0m
[runtime]   priority:   3[0m
[runtime]   provides service File_system[0m
[runtime -> file_vault -> file_vault] [31mError: failed to deliver Report session[0m
[runtime -> file_vault -> file_vault] [31mError: failed to deliver Report session[0m
[runtime -> file_vault -> file_vault] child "cbe_init_trust_anchor" exited with exit value 0[0m
[runtime -> file_vault -> file_vault] child "sync_to_cbe_vfs_init" exited with exit value 0[0m
[runtime -> file_vault -> file_vault -> client_fs_fs_query] [31mError: failed to watch '/'[0m
[runtime -> file_vault -> file_vault -> client_fs_fs_query] [31mError: failed to watch '/data'[0m
[runtime -> file_vault -> file_vault] child "rump_vfs" requests resources: ram_quota=4198400[0m
[runtime -> file_vault -> file_vault -> rump_vfs] rump: /genode: file system not clean; please fsck(8)[0m
[runtime -> file_vault -> file_vault -> cbe_vfs] [31mError: Exception raised: explicit raise in cbe-meta_tree.adb:148[0m
[runtime -> file_vault -> file_vault -> cbe_vfs] [31mError: Uncaught exception of type 'Ada::Exception::Program_Error'[0m
[runtime -> file_vault -> file_vault -> cbe_vfs] [34mWarning: abort called - thread: ep[0m

chelmuth commented 1 year ago

Could it be this isn't just an Iris thing as stated in the commit message?

Iris surely contributes [...]

@cnuke clarified my misconception at the coffee break: You meant the Mesa driver, which is named Iris, not the Intel GPU platform :grin:

cnuke commented 1 year ago

[…] [runtime -> bsd_audio_drv] Error: output channel 'left' is unavailable, denying 'mixer -> ' [runtime -> mixer] Error: Audio_out-session creation failed (label="", ram_quota=535560, cap_quota=4, channel="left") [runtime -> mixer] Error: Uncaught exception of type 'Genode::Service_denied' [runtime -> mixer] Warning: abort called - thread: ep
I didn't actually use audio in 22.10, but at least the mixer connection didn't fail.

Judging by the error message it appears that the driver already has a session attached to left channel. Is there by any chance already an Audio_out client using the driver when you try to start the mixer?

m-stein commented 1 year ago

@cnuke Regarding 80567296f16f272d0c295bd34c5ba7da40dcdc63, I can confirm that I'm now able to use the File Vault in Sculpt 23.04 without the problems previously observed. Thanks for your quick acting! However, I'm left a bit puzzled on why the patch actually fixes the problem and whether it fixes the core problem or merely circumvents it. Could you please provide some clarification and add a commit message?

m-stein commented 1 year ago

@ssumpf Thanks for https://github.com/genodelabs/genode/commit/50939adeed7a7ad16858bfc47d93db0a43e84f14 ! I just stressed it for over an hour in Sculpt 23.04 with glmark, mesa gears, morph browser and doom3 and could not provoke a driver fault anymore!

m-stein commented 1 year ago

@jschlatow Thanks for reporting this! However, if no one considers it a pressing issue, I'd skip debugging this rather rare and avoidable use pattern because the problem seems to be connected to cbe library internals which have been completely re-worked anyway.

The same goes for an issue I observed right now that closing and re-starting the File Vault during a re-keying leads to a fault while opening the vault.

Any opinions?

cnuke commented 1 year ago

@cnuke Regarding 8056729, I can confirm that I'm now able to use the File Vault in Sculpt 23.04 without the problems previously observed. Thanks for your quick acting! However, I'm left a bit puzzled on why the patch actually fixes the problem and whether it fixes the core problem or merely circumvents it. Could you please provide some clarification and add a commit message?

@m-stein To be honest I have not further checked which commit introduced the breaking change so I cannot answer why it is necessary. I merely noticed that the VBD size was not correct and stumpled upon the sync band-aid. I am bit staggered on how that could have worked anyway as the libc (fsync) only executes the flushing operation if the fd was modfied (*), which was not the case with the previous implementation of the sync component. So I changed it to reflect what I believe is the intend of the original implementation (and mis-used Append_file along the way as its destructor implicitly triggers the sync-operation).

(*) from what I have gathered by browsing the WWW executing fsync on an r/o fd is allowed on at least Linux.

genodelabs / genode

Sculpt 23.04 consolidation #4820