Open varumugam123 opened 1 month ago
Another reliable way of reproducing this issue is by sending SIGSTOP signal to WPEWebProcess and resume it with SIGCONT after (2 * watchdoghangthresholdtinseconds) seconds
@varumugam123 : SIGFPE is sent by https://github.com/WebPlatformForEmbedded/ThunderNanoServicesRDK/blob/rdkservices/WebKitBrowser/WebKitImplementation.cpp#L3461 and it is sent after:
_config.WatchDogCheckTimeoutInSeconds.Value() * _config.WatchDogHangThresholdInSeconds.Value() / _config.WatchDogCheckTimeoutInSeconds.Value() = _config.WatchDogHangThresholdInSeconds.Value()
so it is before DeactivateBrowser(PluginHost::IShell::WATCHDOG_EXPIRED);
is called: https://github.com/WebPlatformForEmbedded/ThunderNanoServicesRDK/blob/rdkservices/WebKitBrowser/WebKitImplementation.cpp#L3465
I am not able to reproduce the problem with deadlock. Each time when I stop WPEWebProcess(SIGSTOP) for more than 2 * watchdoghangthresholdtinseconds and then reactivate it (SIGCONT), the WPEWebProcess receives SIGFPE(crash) and the plugin WebKitBrowser is deactivated. There is no hanging WPEProcess(UI process) - it is destroyed(deactivated) in proper way (without crash).
Here some standard and some my additional logs:
[13:23:21]:[SysLog]:[Fatal]: CRASH: WebProcess crashed: exiting ...
[PGPG] ~WebKitImplementation() START
[PGPG] ~WebKitImplementation() END
[13:23:22]:[SysLog]:[Fatal]: FORCED Shutdown: WebKitBrowser by reason: Failure.
[13:23:22]:[SysLog]:[Crash]: -== /proc/meminfo ==-
...
[13:23:22]:[SysLog]:[Shutdown]: Deactivated plugin [WebKitBrowser]:[WebKitBrowser]
I tested it on RPi (WPE 2.38).
@varumugam123 : is it 100% reproduction rate?
One question: when WPEWebProcess gets hung for some reason, the Browser plugin is supposed to kill the WPEWebProcess, which causes WPE to launch a new web process when a new request is done (or a reload). AFAIK this is how this has been always working. But the browser plugin wasn't mean to be killed in that situation. Is this a change in how it should work? Or is the problem that after WPEWebProcess is killed, and before triggering the creation of a new WPEWebProcess the browser plugin needs to be manually deactivated in order to reproduce this situation?
Is it the case now that the Browser plugin gets killed as well? From the trace, it seems that the Browser plugin gets deadlocked because it's trying to terminate, and in the process it tries to kill a WPEWebProcess that was killed before, so the call to kill it doesn't seem to return.
It seems that deadlock is caused by using Locker
() two times on the same callstack.
This is the first function on call stack of the Thread 5
where Locker
is used:
#45 WTF::RunLoop::threadWillExit () at ../git/Source/WTF/wtf/RunLoop.cpp:189
https://github.com/WebPlatformForEmbedded/WPEWebKit/blob/wpe-2.38/Source/WTF/wtf/RunLoop.cpp#L188
void RunLoop::threadWillExit()
{
m_currentIteration.clear();
{
Locker locker { m_nextIterationLock };
m_nextIteration.clear();
}
}
and this is the second one which causes the observed deadlock:
#11 WTF::RunLoop::dispatch(WTF::Function<void ()>&&) () at ../git/Source/WTF/wtf/RunLoop.cpp:151
https://github.com/WebPlatformForEmbedded/WPEWebKit/blob/wpe-2.38/Source/WTF/wtf/RunLoop.cpp#L151
void RunLoop::dispatch(Function<void()>&& function)
{
RELEASE_ASSERT(function);
bool needsWakeup = false;
{
Locker locker { m_nextIterationLock };
needsWakeup = m_nextIteration.isEmpty();
m_nextIteration.append(WTFMove(function));
}
if (needsWakeup)
wakeUp();
}
The problem probably can be reproduced when there is pending sendWithAsyncReply(Messages::AuxiliaryProcess::MainThreadPing()
(which starts in AuxiliaryProcessProxy::checkForResponsiveness
) and the process of destroying UI begins.
The main thread (Thread 1) is waiting for other threads to stop (one of them is Thread 5).
The stopping of the Thread 5 causes that all pending async replies are cancelled - but it means that the callback is called - in our case in callback we call RunLoop::main().dispatch
(and here we have a deadlock) :
void AuxiliaryProcessProxy::checkForResponsiveness(CompletionHandler<void()>&& responsivenessHandler, UseLazyStop useLazyStop)
{
startResponsivenessTimer(useLazyStop);
sendWithAsyncReply(Messages::AuxiliaryProcess::MainThreadPing(), [weakThis = WeakPtr { *this }, responsivenessHandler = WTFMove(responsivenessHandler)]() mutable {
// Schedule an asynchronous task because our completion handler may have been called as a result of the AuxiliaryProcessProxy
// being in the middle of destruction.
RunLoop::main().dispatch([weakThis = WTFMove(weakThis), responsivenessHandler = WTFMove(responsivenessHandler)]() mutable {
if (weakThis)
weakThis->stopResponsivenessTimer();
if (responsivenessHandler)
responsivenessHandler();
});
});
}
Probably we should add some condition to not call RunLoop::main().dispatch
if we are in the middle of destruction.
My fix proposal:
diff --git a/Source/WebKit/UIProcess/AuxiliaryProcessProxy.cpp b/Source/WebKit/UIProcess/AuxiliaryProcessProxy.cpp
index e660f7ef5f3a..d5fb2ea5ac6d 100644
--- a/Source/WebKit/UIProcess/AuxiliaryProcessProxy.cpp
+++ b/Source/WebKit/UIProcess/AuxiliaryProcessProxy.cpp
@@ -423,6 +423,9 @@ void AuxiliaryProcessProxy::checkForResponsiveness(CompletionHandler<void()>&& r
{
startResponsivenessTimer(useLazyStop);
sendWithAsyncReply(Messages::AuxiliaryProcess::MainThreadPing(), [weakThis = WeakPtr { *this }, responsivenessHandler = WTFMove(responsivenessHandler)]() mutable {
+ if (!weakThis || !weakThis->connection()->isValid())
+ return;
+
// Schedule an asynchronous task because our completion handler may have been called as a result of the AuxiliaryProcessProxy
// being in the middle of destruction.
RunLoop::main().dispatch([weakThis = WTFMove(weakThis), responsivenessHandler = WTFMove(responsivenessHandler)]() mutable {
@varumugam123 , @modeveci: can you test my proposal fix?
This issue was observed with wpe-2.38 (6448608056) + WebKitBrowser implementation. If the WPEWebProcess is made hung by running a busy loop from RWI console (or any other equivalent means). This will trigger issuing SIGFPE to WPEWebProcess after configured unresponsive timeout (i.e watchdoghangthresholdtinseconds).
Below is the stack trace of the main() function thread as well as the Core::Thread that runs g_main_loop_run() for WebKit. Before Crash
After the steps I outlined to reproduce the issue, below are the states of those two threads