Open noskb opened 8 months ago
Ouch.
Can you provide a way to reproduce this with a completely fresh AppVM?
I have the same issue, it stated happening recently maybe a week ago, I'm having multiple FF crashes daily.
Isolated Web Co[1205]: segfault at 1a39be38a0d8 ip 00001a39be38a0d8 sp 00007ffc3eea83b8 error 15 likely on CPU 2 (core 2, socket 0)
Code: fe ff 00 0d 3c be 39 1a 00 00 90 d0 59 1a 92 7b 00 00 c0 d1 59 1a 92 7b 00 00 e8 91 38 be 39 1a fe ff e8 91 38 be 39 1a fe ff <c0> 6f 3c be 39 1a 00 00 90 d0 59 1a 92 7b 00 00 c0 d1 59 1a 92 7b
It happens with a single tab open, when streaming video or using JS heavy sites, and it seems to happen randomly.
@renehoj Are you using the Fedora Firefox package? I suspect this is a Fedora packaging bug.
No, I'm using Debian 12 minimal with Firefox-ESR. I tried giving my browser qubes 6GB memory, it didn't stop the crashes.
Ouch.
Can you provide a way to reproduce this with a completely fresh AppVM?
I updated the steps to reproduce section.
I have the same issue, it stated happening recently maybe a week ago, I'm having multiple FF crashes daily.
Isolated Web Co[1205]: segfault at 1a39be38a0d8 ip 00001a39be38a0d8 sp 00007ffc3eea83b8 error 15 likely on CPU 2 (core 2, socket 0) Code: fe ff 00 0d 3c be 39 1a 00 00 90 d0 59 1a 92 7b 00 00 c0 d1 59 1a 92 7b 00 00 e8 91 38 be 39 1a fe ff e8 91 38 be 39 1a fe ff <c0> 6f 3c be 39 1a 00 00 90 d0 59 1a 92 7b 00 00 c0 d1 59 1a 92 7b
It happens with a single tab open, when streaming video or using JS heavy sites, and it seems to happen randomly.
I too noticed it at first with random crashes, then I recreated Firefox profiles, tried older versions, Flatpak's, suspected hardware failure, but what I finally ended up with was memory hotplug.
Disabling the memory hotplug fixes the problem like a charm, and I can reproduce it on another laptop with R4.2 installed, which is why I'm reporting the problem.
Could this be due to a memory allocation failure?
@noskb Is disabling hotplug the same as memory balancing?
Your test pass on my system with 8 GB initial memory, and balancing enabled, but fails with low values like 800 MB.
@renehoj Does that mean that even with memory hotplug feature disabled, it still fails if the init memory value is low?
No, disabling hotplug also solves the issue with low init memory settings, the system seems fully stable with the feature turned off.
I just didn't know if memory hotplug and memory balancing were doing something similar, turning off memory balancing and/or increasing the init memory also seems to improve stability, when running your test.
I'm still having crashes, even with qvm-features memory-hotplug ''
During the weekend, I've had Firefox fully crash 3 times, not just a single tab.
It doesn't leave any information in the logs except for
Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::SendContinueSignalToChild sent continue signal to child
Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::GenerateDump cloned child 2312
Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::WaitForContinueSignal waiting for continue signal...
I'm still having crashes, even with
qvm-features memory-hotplug ''
During the weekend, I've had Firefox fully crash 3 times, not just a single tab.
It doesn't leave any information in the logs except for
Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::SendContinueSignalToChild sent continue signal to child Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::GenerateDump cloned child 2312 Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::WaitForContinueSignal waiting for continue signal...
Even with memory balancing disabled and allocating memory statically to AppVM, does firefox still crash during normal use? If so, the most likely cause is a firefox problem or hardware failure.
I only had memory-hotplug disabled, now I'm trying with memory balancing disabled as well.
My guess is that it started after an update this month, I didn't use to have any issues with Firefox, and suddenly it becomes noticeably unstable. It is a problem specifically with Firefox, no other application is crashing, but it could have started after updates to the Linux kernel or Xen.
Disabling both memory-hotplug and memory balancing didn't stop the crashes.
I ended up downgrading the kernel in my browser qubes to 6.6.2-1, and now the crashes seem to have stopped.
@renehoj Okay, so a kernel regression.
Can you (in a test standalone VM) try doing a kernel bisection to see which upstream commit broke things?
@DemiMarie I spoke too soon, I just had libxul.so crash again.
Changing the kernel, just like disabling memory_hotplug, will allow the browser to pass noskb's test, but it doesn't stop the crashes.
@renehoj Ouch.
The usual advice for this kind of problem is “record an rr trace” but that:
The cause seems to be that domU detects initial memory instead of maxmem when memory hotplug is enabled.
A domU with an initial memory of 800 and max memory of 8000:
hotplug enabled
[ 0.242574] Memory: 731012K/818812K available (18432K kernel code, 3241K rwdata, 8924K rodata, 5132K init, 6172K bss, 87544K reserved, 0K cma-reserved)
hotplug disabled
[ 4.565233] Memory: 7974216K/8191612K available (18432K kernel code, 3241K rwdata, 8924K rodata, 5132K init, 6172K bss, 217140K reserved, 0K cma-reserved)
This made a difference in the kernel parameters with initial values calculated based on the amount of memory.
It seems that the low value of kernel.threads-max
when hotplug is enabled causes resource insufficiency and firefox crashes.
Whoops, I initially blamed this on Mozilla, ended up migrating to a Chromium browser, was there an actual fix in the end?
Worst case you could change kernel.threads-max
with a sysctl call? e.g. sysctl kernel.threads-max=256000
Would anyone here be willing to enable Firefox crash reporting and report this as a bug to Mozilla?
It could be a Qubes OS issue, a Linux kernel issue, or a Xen issue, but I think the Mozilla developers are in a better position to decide who is to blame.
Given the above, I will likely just reply to that bug report pointing out the system is misconfigured 😊
Running ps -eo nlwp | tail -n +2 | awk '{ num_threads += $1 } END { print num_threads }'
on my current desktop system already shows over 2000 threads in use. It seems entirely plausible that opening up a large number of tabs at once could cause it to spike over 5000.
@noskb diagnosis that using minimal memory to set kernel limits causes the problem seems correct
Also possible would be to just apply that kernel setting to a normal system and observe the results. I'll give this a shot tomorrow, but again, at this point it seems very likely to just confirm the above.
Given the above, I will likely just reply to that bug report pointing out the system is misconfigured 😊
Firefox still should not segfault in this case. Signal 15 is SIGSEGV, meaning that Firefox accessed invalid memory.
Firefox still should not segfault in this case. Signal 15 is SIGSEGV, meaning that Firefox accessed invalid memory.
I agree, I would've expected an internal assertion. I'll have a look but from the user perspective it won't make a difference obviously how we crash.
Firefox still should not segfault in this case. Signal 15 is SIGSEGV, meaning that Firefox accessed invalid memory.
I agree, I would've expected an internal assertion. I'll have a look but from the user perspective it won't make a difference obviously how we crash.
The main advantage of an assertion failure is that it makes the problem obvious. My first thought looking at this bug report was that there was some low-level bug causing unexplained memory corruption, which was quite worrying. A misconfigured thread limit is much less concerning and much easier to fix.
It would be great if you could have the crash message include something like Failed to create new thread. Check that /proc/sys/kernel/threads-max is set correctly and that cgroups or other resource limitations are correctly configured.
.
Signal 15 is SIGSEGV, meaning that Firefox accessed invalid memory.
Signal 15 is SIGTERM, not SIGSEGV. Looking at the reports above, you're either getting this, or Signal 6, i.e. SIGABORT. This is consistent with my testing that shows that we seem to correctly catch these errors with internal assertions, e.g.
[3314006] Hit MOZ_CRASH(Failed creating SwComposite thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) at gfx/wr/webrender/src/compositor/sw_compositor.rs:530
[3314676] Assertion failure: ((bool)(__builtin_expect(!!(!NS_FAILED_impl(rv)), 1))) && mIOThread (Should successfully create image I/O thread), at /home/morbo/hg/firefox/image/DecodePool.cpp:104
[3315719] Assertion failure: thread->mThread (Failed to create TaskController pool thread), at /home/morbo/hg/firefox/xpcom/threads/TaskController.cpp:286
Now I was wondering why you're not getting a message to the screen, but that appears to be intentional: https://searchfox.org/mozilla-central/source/mfbt/Assertions.h#275. Interesting question here whether our internal assertion is too coarse-grained, but I imagine there's some rather-safe-than-exploitable here.
That said, I sometimes get a:
[3393222] Sandbox: SandboxBroker: thread creation failed: EAGAIN
or other errors which highlight the problem.
However! I do actually see a lot of crashes with signal 11 when testing this in a non-debug build. They correctly go to the crash reporter like here: https://crash-stats.mozilla.org/report/index/39faa341-8fde-46cf-8bc8-f778f0240731
Which correctly shows MOZ_CRASH Reason (Sanitized): MOZ_RELEASE_ASSERT(thread->mThread) (Failed to create TaskController pool thread)
. That's strange: we hit the assertion, which should generate the report and then cause a SIGABORT or SIGTRAP or similar, but the crash reason is actually a SIGSEGV. The crash report came through, so we somehow crash somewhere between generating the report and trying to intentionally crash. Unfortunately I got nowhere in the debugger, which claims various syscalls are generating the SIGSEGV (maybe that's right?).
tl;dr Enabling crash reporting would've immediately pointed to the underlying issue.
I tested how Chrome can work with this problem, because I'm sure they use threads too 😉. From my testing it's basically the same as with Firefox, you just get empty tabs that don't load or crash on loading. To their credit, they consistently get an [3392571:27:0731/150714.608380:ERROR:platform_thread_posix.cc(155)] pthread_create: Resource temporarily unavailable (11)
error out which would've been helpful.
That's strange: we hit the assertion, which should generate the report and then cause a SIGABORT or SIGTRAP or similar, but the crash reason is actually a SIGSEGV.
Okay, this too is intentional, at least on Linux: https://searchfox.org/mozilla-central/source/mfbt/Assertions.h#242
I'll need to do some archeology to understand why we prefer to force SIGSEGV
over a SIGTRAP/SIGABORT
here but in any case this appears to be working as intended. About the only thing actionable would be the blanket assumption that if we hit one of those asserts, we're in a corrupted memory situation and shouldn't try to print the error - but again, enabling crash reporting would've gotten you the full diagnostic and is arguably much safer especially as that's out-of-process.
Edit: Result of archeology https://bugzilla.mozilla.org/show_bug.cgi?id=1858670#c8 it's a trick to recover the line number even in the most dire circumstances but we're considering to change this.
Does this mean that crash reporting being disabled should be considered a Fedora packaging bug?
Which package are you using? I'd be surprised if Fedora disables crash reporting, we'd be blind to to serious bugs on a major distro then. From the discussion here I thought this was a custom build.
Could the thread limit just be breaking crash reporting too? When I lowered the limit at some point even bash
started throwing errors.
I don’t use Firefox (due to its lack of per-site sandboxing), but I would not at all be surprised if the thread limit broke crash reporting. In that case the best that can be done is to write a literal string to stderr.
due to its lack of per-site sandboxing
If you refer to what Chrome calls Site Isolation, do note Firefox shipped this 3 years ago.
In that case the best that can be done is to write a literal string to stderr.
Filed this as https://bugzilla.mozilla.org/show_bug.cgi?id=1911044.
due to its lack of per-site sandboxing
If you refer to what Chrome calls Site Isolation, do note Firefox shipped this 3 years ago.
I meant Fission Site Sandboxing, needed to prevent universal XSS by a compromised renderer.
In that case the best that can be done is to write a literal string to stderr.
Filed this as https://bugzilla.mozilla.org/show_bug.cgi?id=1911044.
Thank you!
I still observe the issue even with:
$ cat /proc/sys/kernel/threads-max
256000
Crash:
[Parent 42716, IPC I/O Parent] WARNING: process 54174 exited on signal 11: file /builddir/build/BUILD/firefox-131.0.2/ipc/chromium/src/base/process_util_posix.cc:335
ExceptionHandler::GenerateDump attempting to generate:/home/user/.mozilla/firefox/tnttawhx.default-release/minidumps/4018e50c-0e8e-44f2-8ba6-07122d9af45f.dmp
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Is there some other limit to increase? Maybe some of those related to epoll/watches?
Are you getting a crash report out of those crashes?
The file mentioned in the error message doesn't exist. And I don't get crash message window either (some earlier comment here says it may be disabled in Fedora's build?)
The file mentioned in the error message doesn't exist.
That seems to support the idea that there's some limit that just stops everything from working :-/
some earlier comment here says it may be disabled in Fedora's build?
As already said, I doubt this is true. (I'm not sure you'd even get the message about the minidump then)
Someone above reported the settings that seem to change due to the wrong configuration, so that's a good starting point for finding out what the issue is: https://github.com/QubesOS/qubes-issues/issues/8960#issuecomment-2133190388
Someone above reported the settings that seem to change due to the wrong configuration, so that's a good starting point for finding out what the issue is: #8960 (comment)
Yes, and I'm trying to figure out what other limit is relevant here, in addition to threads-max.
Raising fs.fanotify.max_user_marks
too is not enough either.
Not vm.user_reserve_kbytes
either.
And not fs.epoll.max_user_watches
How to file a helpful issue
Qubes OS release
r4.2
Brief summary
As title. In my case, it crashes when opening more than 30 tabs from a bookmark at once. If memory-hotplug is disabled, this will not occur.
The following message appears in dmesg:
Steps to reproduce
In r4.2 with the latest update, the memory hotplug feature is enabled by default, so additional configuration is not needed.
Create an appvm with sufficient RAM space by running the following in dom0 terminal:
qvm-create ff-crash -l red --prop memory=800 --prop maxmem=8000
then, run the following in ff-crash terminal:
firefox -- google.com facebook.com youtube.com baidu.com yahoo.com amazon.com wikipedia.org qq.com twitter.com slashdot.org google.co.in taobao.com live.com sina.com.cn yahoo.co.jp linkedin.com weibo.com ebay.com google.co.jp yandex.ru bing.com vk.com hao123.com google.de instagram.com t.co msn.com amazon.co.jp tmall.com google.co.uk pinterest.com ask.com reddit.com wordpress.com mail.ru google.fr blogspot.com paypal.com onclickads.net google.com.br
To disable memory-hotplug, run the following in dom0 then restart ff-crash:
qvm-features ff-crash memory-hotplug ''
Expected behavior
No segfaults occur.
Actual behavior
Firefox crashes