Integration help: Image hangs when initializing CEF library and doesn't respond to break key

dolphinsmalltalk / Dolphin

Dolphin Smalltalk Core Image

MIT License

294 stars 58 forks source link

Integration help: Image hangs when initializing CEF library and doesn't respond to break key #1278

Closed JBetz closed 4 months ago

JBetz commented 5 months ago

Describe the bug An external call hangs when non-overlapping, and the only way I can get the call to complete is by making it overlapping, hitting the break key when it hangs, and then pressing Resume in the Interrupt dialog.

The external call that hangs is cef_initialize. Fortunately, there are some callbacks that get triggered inside of the method which provide some visibility into how much progress is being made.

Non-overlapping cef_initialize call:

A few callbacks are triggered (visible by transcript logs)
The image hangs
Press break key, nothing happens
Kill Dolphin in task manager

Overlapping cef_initialize call:

The image hangs
Press break key, interrupt dialog appears
A few callbacks are triggered (same ones as in the non-overlapping case)
Press resume, and the external method returns successfully

Expected behavior External method doesn't hang and break key opens up the Interrupted dialog.

Additional context I've asked the CEF support forum about the issue, and the only advice I was given was to make sure the call to cef_initialize happens on the main application thread. That probably means that it shouldn't be an overlapping call, even though it currently behaves better in that case.

My best guess is that some thread created by CEF is getting starved, and only by triggering the interrupt does the thread somehow get the resource it needs to unblock the cef_initialize call. That makes it seem like Dolphin is capable of resolving the hang but I have no idea what the correct way to do this would be.

For what it's worth, I tried using Win32's SetPriorityClass method to increase the priority of the Dolphin process but it had no visible effect. Nor did SetProcessDEPPolicy which was used in a previous implemention of Dolphin CEF bindings, though I don't know why.

blairmcg commented 5 months ago

Firstly, it is not really playing the game to raise issues using some arbitrary API as Dolphin bugs!

Secondly, I know very little about the CEF and don't really see the point of attempting to integrate it now anyway, given that for a few years now there has been a functioning integration of the chromium-based WebView2 SDK in the standard image.

However, reading further down that header file, CEF requires that you either use its message loop, or that you call cef_do_message_loop_work from the main message loop. It sounds to me like that is more likely the source of the problem. Since the initialize call is sending callbacks, it could well be doing some synchronisation through the message queue (e.g. to dispatch back to the main thread from a subsidiary thread), and may require that the message loop is calling that function even when cef_initialize is called. But really that is just speculation.

That probably means that it shouldn't be an overlapping call, even though it currently behaves better in that case.

Using an overlapped call would definitely be wrong - the header file is pretty clear that the CEF APIs there should be called from the main OS thread only. Likely windows are created under the covers, and these are specific to the thread that creates them. Any windows not created on the main thread will not be serviced by the main thread message loop. The only reason the overlapped call works at all is that Dolphin has to redirect any callbacks that are not on the main thread back to the main thread via the Windows message dispatch mechanism. This can't be guaranteed to work in many cases (and is quite likely to result in a deadlock), but the VM cannot process Smalltalk code on any OS thread other than the main thread so there is no other choice than just crashing.

For what it's worth, I tried using Win32's SetPriorityClass method to increase the priority of the Dolphin process but it had no visible effect.

I'm not surprised that changing the process priority has no effect on a lock up.

Nor did SetProcessDEPPolicy which was used in a previous implemention of Dolphin CEF bindings, though I don't know why.

Studying the SetProcessDEPPolicy API documentation should explain what that is for. It is just enabling Data Execution Protection to prevent code being run in memory allocated for data. The call is increasing the security from standard and opting the application into DEP. It seems unlikely that could be at all related.

If you want to persist with it, maybe just inserting a cef_do_message_loop_work into the main message loop will do what you want, although I rather doubt it.

To do this sort of thing you really need to be prepared to work at the C level and run Dolphin under the VS debugger and break in when it locks up to see what is going on. I don't have either the time or inclination to help you with that, especially as WebView2 would seem to be a perfectly good alternative. Debugging at the C level will be easier if you build the CEF DLL (I assume there is one) in debug mode, and may be easier if you do the same with Dolphin. Once you break, you can see what the CEF code is waiting on. I think your time would probably be better spent migrating to the WebView2 integration instead. This will allow you to embed a chromium based browser, and Microsoft has done most of the hard work for you in creating a wrapper that fits that more easily into a typical windows app.

JBetz commented 5 months ago

Firstly, it is not really playing the game to raise issues using some arbitrary API as Dolphin bugs!

Fair point! The reason I thought it was an issue with Dolphin was:

Break key doesn't work when the external call is non-overlapping, which is definitely a limitation if not a bug of Dolphin.
I could eventually get it to work with the Overlapping + Interrupt + Resume trick, which made it look like Dolphin was at least capable of resolving it, if not the cause of the problem.

I also already talked to the CEF support forum and wasn't able to get much help since it caters to C/C++ projects.

In the future, I'll put library-specific issues in this repo's discussions.

Secondly, I know very little about the CEF and don't really see the point of attempting to integrate it now anyway, given that for a few years now there has been a functioning integration of the chromium-based WebView2 SDK in the standard image.

You're assuming I don't have reasons for using CEF and haven't already tried WebView2 and found it lacking.

WebView2 doesn't have off-screen rendering, which means that it's impossible to draw widgets on top of a web page, and requires using some sort of window or screen capture tool for a task as simple as generating a thumbnail for web page previews. Additionally, I also need to be able to hook into DOM events in order to detect what type of element is currently under the mouse. Finally, I'm using CEF to build a browser with more advanced controls than any existing browser, so can't afford to give up lower level control for convenience. Even CEF is annoyingly limited in some areas.

Since the initialize call is sending callbacks, it could well be doing some synchronisation through the message queue (e.g. to dispatch back to the main thread from a subsidiary thread), and may require that the message loop is calling that function even when cef_initialize is called. But really that is just speculation.

How do I run Dolphin code concurrently with an external call? As far as I can tell, the entire image is blocked waiting for cef_initialize, even if I call it from a forked process.

I do think you're right about it being related to the message loop.

There are at least three different ways to handle the message loop in CEF:

Tell CEF to create a separate thread for the message loop using the multi_threaded_message_loop settings parameter sent to cef_initialize.
Start the message loop from the main thread using cef_run_message_loop.
Manually pump the message loop using cef_do_message_loop_work.

So far I've been using the first and it's worked fine as long as I use a workaround to avoid the hang in the call to cef_initialize. If I try using the second, the VM throws a divide by zero exception on the call to cef_run_message_loop. And I haven't spent too much time on the third one yet since the documentation anti-recommends it and the first call to cef_do_message_loop_work throws a memory error.

Re-reading the documentation now, option 3 is supposed to be used when "CEF's message loop must be integrated into an existing application message loop". Is there any reason why this would be the case in Dolphin?

To do this sort of thing you really need to be prepared to work at the C level and run Dolphin under the VS debugger and break in when it locks up to see what is going on.

I'll bite the bullet eventually. This does make me yearn for [Klein's VM](https://www.researchgate.net/publication/200040544_Constructing_a_Metacircular_Virtual_Machine_in_an_Exploratory_Programming_Environment). I hate leaving Smalltalk!

blairmcg commented 5 months ago

Break key doesn't work when the external call is non-overlapping, which is definitely a limitation if not a bug of Dolphin.

Yes, it is a limitation. It's not really possible to interrupt some arbitrary call safely unless you are a debugger debugging the process. Imagine you interrupt something inside, for example, a critical section, leaving it locked. The chances of killing the process are high. Whilst that might not matter if your process is locked up anyway, it wouldn't be helpful otherwise. Consequently the interrupt sequence is only intended for interrupting the execution of Smalltalk code where the VM knows the safe interrupt points.

You're assuming I don't have reasons for using CEF and haven't already tried WebView2 and found it lacking.

WebView2 doesn't have off-screen rendering, which means that it's impossible to draw widgets on top of a web page, ...

Fair enough.

and requires using some sort of window or screen capture tool for a task as simple as generating a thumbnail for web page previews.

There is ICoreWebView2::CapturePreview

Additionally, I also need to be able to hook into DOM events in order to detect what type of element is currently under the mouse. Finally, I'm using CEF to build a browser with more advanced controls than any existing browser, so can't afford to give up lower level control for convenience. Even CEF is annoyingly limited in some areas.

Yes, for that sort of thing WebView2 expects you to inject javascript that runs out of proc in the browser process.

How do I run Dolphin code concurrently with an external call? As far as I can tell, the entire image is blocked waiting for cef_initialize, even if I call it from a forked process.

I do think you're right about it being related to the message loop.

There are at least three different ways to handle the message loop in CEF:

Tell CEF to create a separate thread for the message loop using the multi_threaded_message_loop settings parameter sent to cef_initialize.

Start the message loop from the main thread using cef_run_message_loop.

Manually pump the message loop using cef_do_message_loop_work.

So far I've been using the first and it's worked fine as long as I use a workaround to avoid the hang in the call to cef_initialize. If I try using the second, the VM throws a divide by zero exception on the call to cef_run_message_loop. And I haven't spent too much time on the third one yet since the documentation anti-recommends it and the first call to cef_do_message_loop_work throws a memory error.

Re-reading the documentation now, option 3 is supposed to be used when "CEF's message loop must be integrated into an existing application message loop". Is there any reason why this would be the case in Dolphin?

An overlapped call will still block the calling Smalltalk process but allowing other Smalltalk processes to continue running. The call is marshalled to another thread for the duration of the call, and then the result marshalled back when it returns. While that happens, the calling process is waiting. This keeps the programming model simple (synchronous), and really it is designed to be used from background processes. In most cases there is nothing to be gained by making overlapped calls from the main Smalltalk UI process. Doing so will block the main process and so stop the main message loop. Overlapped calls from background processes allowing the message loop to keep pumping.

I mention this because if the cef_initialize method is sending callbacks on threads other than the main Dolphin thread, then as I described above Dolphin must marshal them over to the main thread. To do this it sends a (windows) message to the main thread. For the message to be received, the message loop needs to be pumping. If your main Dolphin process is blocked on the call to the cef_initialize, then it won't be able to receive any marshalled callbacks from foreign threads. However, while an overlapped call is waiting because the Dolphin thread is still processing Smalltalk, the VM is able to interrupt the wait side of the call in the main thread. Note that it doesn't interrupt the overlapped thread - it will queue an async procedure call to raise an exception if the calling process is terminated, but this will only be processed if/when the overlapped call thread enters an alertable wait state. It will definitely enter an alertable wait state when the called code returns, but most functions one calls do not enter alertable wait states (reading the Microsoft docs will tell you what windows functions wait in an alertable way). Anyway, the attempt to handle the user break by displaying a walkback will cause the Dolphin session manager to start another UI process, resuming the pumping of messages. This might explain the behaviour you see - i.e. if you press break, it wakes up and allows the call to complete. If that is what is happening then just making an overlapped call to cef_initialize from a background process other than the UI process should work. You'll have to design a mechanism to handle the asynchronous initialization so that you can continue with whatever else you have done, but there are a number of existing ways to achieve that, e.g. by queuing a deferred action from the background process when the cef_initialize call returns. The deferred actions are processed in the main UI process.

I'd certainly expect that calling cef_initialize with an overlapped call would send callbacks on a foreign thread - even if it isn't multi-threading, it has been invoked on the overlapped call thread, so would be expected to send callbacks on that thread. Dolphin can't handle these directly, and has to marshall them back as described. It seems likely that when configured to run in the multi-threaded mode you describe, that it might send callbacks on still other threads that it creates. This could be established by running Dolphin under the VS debugger and then breaking in when everything locks up. You can check the running threads and look at the stacks to establish whether there are any callbacks into Dolphin that are being marshalled.

Alternatively I would use the cef_do_message_loop_work approach. Calling cef_run_message_loop will mean that Dolphin will lose the ability to do a number of things that rely on it having control of the message loop, such as:

quiescing properly when idle (pretty important when running your laptop on the battery)
processing accelerator/shortcut keys
scheduling background processes will be choppy, and may not work at all if idle (WM_ENTERIDLE) messages are not received.

You can get an idea of this by opening up a file open dialog from a workspace. The other Dolphin windows will continue to function, but accelerator keys won't work. Also if you fire off a background process that makes a sound at regular intervals or prints something to the Transcript, you'll be able to see some of the effect on process scheduling, although the Windows system dialogs are well behaved and send WM_ENTERIDLE, etc.

The divide by zero, if a floating point division by zero, might be caused by an actual FP zero divide in the CEF code. Dolphin unmasks most FP exceptions by default, but this can be changed. Since FP exceptions are masked by default in C/C++programs, it isn't unusual to encounter code that generates FP exceptions. The programmers may not even have noticed, or they may not care because they expect the exceptions to be masked.

The memory error sounds like you have something wrong in your FFI, either an incorrectly sized or initialized structure, or you are passing incorrect parameters.

I'll bit the bullet eventually. This does make me yearn for Klein's VM. I hate leaving Smalltalk!

For some complex C/C++ APIs that are multi-threaded you really needed a multi-threaded VM. Most Smalltalks do not have properly multi-threaded VMs. You could probably use C#/.NET successfully. It is multi-threaded and high performance. Personally I'd probably create a C/C++ layer to do the low-level interfacing and which exposed a higher level abstraction into Smalltalk. COM is quite good for the interface between the two. Back in the 90's we built a plugin to Netscape that worked exactly this way to enable us to run Smalltalk applets (at the time everyone was enamoured of java applets).

JBetz commented 5 months ago

If that is what is happening then just making an overlapped call to cef_initialize from a background process other than the UI process should work.

Yes, that worked. Thanks!

Although I am confused since your previous comment said that using an overlapped call would definitely be wrong.

Can you clarify whether there is anything wrong with this solution?

I'd certainly expect that calling cef_initialize with an overlapped call would send callbacks on a foreign thread - even if it isn't multi-threading, it has been invoked on the overlapped call thread, so would be expected to send callbacks on that thread. Dolphin can't handle these directly, and has to marshall them back as described. It seems likely that when configured to run in the multi-threaded mode you describe, that it might send callbacks on still other threads that it creates.

What I still don't understand is why multiple callbacks come through with a non-overlapping call to cef_initialize rather than zero. How would those have managed to make it to Dolphin if the message loop is stalled waiting for a response? Is there a mechanism for resuming the message loop when callbacks are made on the main thread? And if so, why doesn't the same thing happen for threads created by CEF?

Update: I reread your comment and I think I understand now. When the callbacks are made on the main thread, there is no need for marshalling, meaning no need to go through the message loop. So that's why it's only a problem for callbacks on non-main threads.

blairmcg commented 5 months ago

Yes, that worked. Thanks!

Great.

But I'm a bit confused since your previous comment said that using an overlapped call would definitely be wrong.

Yes, I did say that didn't I. This was based on...

the only advice I was given was to make sure the call to cef_initialize happens on the main application thread

...which would not be the case if you overlapped it. However, since you are using the multi-threaded message loop, the advice you received from CEF may not be relevant.

Can you clarify whether there is anything wrong with this solution?

I think the answer to that hinges on whether it actually matters that the cef_initialize call happens on the main app thread when you are using the multi-threaded loop. I can't say for sure (you'd have to ask them, or spend some time inspecting/debugging the cef code), but obviously it is designed to handle the fact that the main apps windows will be serviced by a separate message loop, so I suspect it is OK.

Having the parent and child windows owned by separate threads does create some synchronisation complexities (and additional overhead), but hopefully this is something they've worked through - it is after all the whole point of that option to run independently. And in any case, the Chromium code is running in entirely separate processes.

From a Dolphin perspective it should be OK, although I'd be on the lookout for, e.g., things failing to paint correctly due to timing issues.

Update: I reread your comment and I think I understand now. When the callbacks are made on the main thread, there is no need for marshalling, meaning no need to go through the message loop. So that's why it's only a problem for callbacks on non-main threads.

Correct.

JBetz commented 5 months ago

I'm running into some issues with the multi-threaded message loop that's making me want to investigate adding the C/C++ layer.

Namely, I get errors when trying to call methods on certain CEF objects from Dolphin if it's being executed on the "wrong" thread. And apparently even the CEF sample application uses a trick of creating a hidden Window to get around this problem. Which sounds horrible and not at all appealing as a solution.

That being said, I'm not really understanding how a C/C++ layer would help.

For some complex C/C++ APIs that are multi-threaded you really needed a multi-threaded VM. Most Smalltalks do not have properly multi-threaded VMs. You could probably use C#/.NET successfully. It is multi-threaded and high performance. Personally I'd probably create a C/C++ layer to do the low-level interfacing and which exposed a higher level abstraction into Smalltalk.

Would it mean that all direct CEF calls would need to happen in C/C++? The surface area of CEF's API is huge, hundreds of methods and callbacks, so that doesn't sound like it would scale.

I also asked the CEF developer about this and he said:

[The C/++ layer] might (for example) marshal all calls to the main smalltalk thread instead of calling directly into smalltalk from those threads. This should be fine for callbacks that have no completion or complete asynchronously. Anything that completes synchronously (e.g. needs to immediately return a value) will need some careful thinking/planning around.

Which sounds like a much more focused solution, but I'm not seeing how it solves the problem since you already said that Dolphin marshals all callbacks to the main thread anyway.

Creating a bottleneck for all callbacks sounds like the right idea, but I'd think it'd need to be the other way, for all Dolphin calls to be forced to run on whatever CEF's definition of a main thread is. Though again, don't really see how to achieve this short of replicating the entire CEF API and exposing it to Dolphin via an executable.

JBetz commented 5 months ago

Having the parent and child windows owned by separate threads does create some synchronisation complexities (and additional overhead), but hopefully this is something they've worked through - it is after all the whole point of that option to run independently. And in any case, the Chromium code is running in entirely separate processes.

I think the hidden Window trick is how they worked through it? Or around it, rather.

I guess if you're just extending the C++ code it's a ready made solution, but trying to replicate that behavior in Dolphin sounds like a bad idea.

blairmcg commented 5 months ago

I get errors when trying to call methods on certain CEF objects from Dolphin if it's being executed on the "wrong" thread

That is probably because you are calling the cef_initialize method from an overlapped call (i.e. not from the main thread), and there is clearly some thread affinity in CEF. It may explain the advice to ensure you call cef_initialize from your main app thread.

I'm not seeing how it solves the problem since you already said that Dolphin marshals all callbacks to the main thread anyway.

Yes. It is suggesting something similar to what you already have. Dolphin's marshalling protects the Smalltalk code from executing on a foreign thread. However it sounds like the suggestion is to "post" asynchronous messages, and presumably only call back into CEF from the C++ code. With the Dolphin arrangement I think you have now if you call back into CEF from one of those callbacks into Smalltalk that was marshalled over by the VM (i.e. re-entrantly), you will be calling CEF from a different thread than it expects.

Referring back to:

even the CEF sample application uses a trick of creating a hidden Window to get around this problem. Which sounds horrible and not at all appealing as a solution.

It may sound ugly, but actually it is a common pattern for marshalling between threads in a Windows application, especially where you need to synchronise with a UI. Actually it is such a common pattern that there is a specific type of non-visible window you can create for this purpose, message-only windows. This is the purpose of the HWND_MESSAGE constant in their example code. The window that the Dolphin VM uses for marshalling calls back to the main thread is created the same way.

Where you have a main app thread that is running a message loop, a message-only window provides you with much of the mechanism you need to be able to synchronise calls, A SendMessage call will deliver a message to the thread owning the target window when it is ready to receive them (i.e. it calls one of the functions to retrieve a message from its queue), and to wait synchronously for the result. Some of COMs threading models work this way, and it is the way that Dolphin marshals callbacks back to its main thread. In other words this is already in place, and as you suspected is similar to the suggestion you received from the CEF developer, except that it is happening synchronously. SendMessageCallback, SendMessageNotify, and PostMessage are various non-blocking/asynchronous alternatives for different situations.

It is difficult to make a judgement as to the best approach without a full understanding of the thread affinity that exists in CEF. One of the nice things about COM is that it provides a set of explicit threading models and the necessary plumbing to support marshalling and thread synchronisation for those. In practice this means that if you are an apartment threaded COM object the framework ensures you are called only on the correct thread. This is probably one of the reasons that Microsoft's integration of chromium in WebView2 is exposed as a COM object model (the other being that this allows for any language with a good COM binding to use it). Most frameworks/libraries don't have that level of explicit up-front design of their threading model. In the absence of any obvious documented threading model for CEF, safest would be to just assume you always need to call it on the same thread as the thread you initialized it from (i.e. treat it as apartment threaded, but where you are responsible for ensuring the calls are only made from the apartment). If it calls you from other threads it sounds as though you need to ensure that any re-entrant calls to it are from that thread. That may not be possible to achieve directly in Smalltalk in Dolphin and may require an intervening layer with the ability to run code in other threads.

Three potential solutions would be:

Stop using the multi-threaded loop option in the initialize call and use one of the others. As I said before there is really only one of the other choices you mentioned that will not upset background processing, quiesce on idle, keyboard shortcuts, etc, in Dolphin, i.e. cef_do_message_loop_work.
Continue with the multi-threaded message loop option, but vector all calls that you do to CEF through the same Smalltalk process that made the original overlapped cef_initialize call. All the CEF calls would need to be overlapped. This will ensure that all the CEF calls are made on the same overlapped call thread (because when a Smalltalk Process first makes an overlapped call it is associated with a real OS thread that it owns for making further overlapped calls for the rest of its life). It is perhaps unfortunate that you can't make the overlapped cef_initialize call from the main thread, as this would avoid the need to use multiple Processes in Dolphin, but likely even if that did work you would encounter similar issues where CEF calls back from some overlapped call you make into it, but the main Dolphin process is then blocked on the overlapped call and not processing messages, so it can't receive the marshalled callback (i.e. deadlock).
Build a COM binding around it using something that is OS multi-thread capable, e.g. C++. The COM objects themselves would be apartment threaded (i.e. must always be called from the thread that created them, which COM enforces). Any callbacks would need to be defined as COM interfaces that Dolphin would implement. Dolphin's COM framework effectively declares all the COM objects it implements as apartment threaded and on the main thread, so COM will marshals call-ins to those objects through the main thread message queue, regardless of which thread they originate from.
An entirely out-of-proc solution using an IPC mechanism of your choice. You could find some existing working integration in C++, e.g. their sample, and add in your own API to talk to it, e.g. through a socket. Iterate over this to remove the stuff you don't need (e.g. visible UI) and add the stuff you do (e.g. off-screen rendering).

Considering these in reverse order:

(4) This is appealing in the sense that it keeps things isolated and you could start with an existing standalone sample application and increment from there. Ultimately this is arguably just a variant of (3) - if COM were used for the APIs it really could be a variant. Marshalling across processes is of course expensive, and that is already happening with chromium so you'd just be adding a further level. In the end it seems like just a simpler way to start on (3), but perhaps that is useful for approachability. (3) would allow for a robust solution, but would also be a lot of work even to get off the ground, and a steep learning curve too. You'd need a deep understanding of COM and CEF to be successful (although the latter may be needed anyway). Given sufficient time and expertise, this would probably result in the best solution. It's like your own version of WebView2 or WebKitX but with the features you need, and presumably not those you don't. (2) seems like an unreliable solution to me. It may beguile with initial success, but turn out to have insurmountable problems. Given the need to marshal back and forth between the Dolphin main thread, overlapped threads, and potentially other threads created by CEF, I'd not be surprised to encounter deadlocks and further multi-threading issues. (1) keeps things simple, and is really the best way to integrate into an app that is essentially single (OS) threaded and which needs to keep control of the message loop, i.e. Dolphin.

There may not be a feasible solution, either technically or at reasonable cost. To proceed and either create a solution, or establish that it is either impossible, too hard, or too expensive, you will need to be prepared to put in the effort to debug through the CEF code yourself at the C/C++ level using the VS debugger. If you build the CEF code in debug mode and ensure that version of the DLL is loaded by Dolphin, if you then just attach the VS debugger to the running Dolphin process you should be able to break into the CEF code, set breakpoints in it, etc. That will enable you to describe the problem you are having to the CEF developers specifically enough that they (or I) may be able to give a more actionable answer. Given the apparent complexity of CEF and that it is natively C++, this will be unavoidable eventually.

P.S. (5) use Smalltalk MT. For commercial purposes this might allow a true end-to-end multi-threaded Smalltalk solution.

JBetz commented 5 months ago

Okay, that helps. Thank you.

Admittedly, I've been too focused on the details of CEF's implementation and not on what the best architecture would be. But after talking and thinking it over with my business partner, I'm going to try get solution 1 working.

The reason being that I'm using both SDL and CEF, and both need to be assigned work, so the whole problem makes a lot more sense when thinking of it from the perspective of a CPU scheduler. And this is the only one that enables Dolphin to play that role.

Which brings me back to the divide by zero exceptions.

The divide by zero, if a floating point division by zero, might be caused by an actual FP zero divide in the CEF code. Dolphin unmasks most FP exceptions by default, but this can be changed. Since FP exceptions are masked by default in C/C++programs, it isn't unusual to encounter code that generates FP exceptions. The programmers may not even have noticed, or they may not care because they expect the exceptions to be masked.

Here's what I'm currently doing:

CefBrowserProcessRunner>>iterate
    | cefMask currentMask |
    cefMask := Float exceptionMask maskSet: CRTConstants._EM_ZERODIVIDE.
    currentMask := Float exceptionMask: cefMask.
    [self library doMessageLoopWork] ensure: [Float exceptionMask: currentMask].
    [self iterate] postToInputQueue

And I no longer get divide by zero exceptions. Also, the memory error I mentioned previously also seems to be have vanished, which I think was an issue with the timing of the call rather than any of the ExternalStructure definitions.

But yeah, while setting and restoring the mask does work, it's not thorough since there might be other CEF methods that throw FP exceptions, not just cef_do_message_loop_work.

Is it feasible to set the ZERODIVIDE mask for the entire process whenever CEF is used? Or would I be better off finding some way to configure my CefLibrary class such that all method calls do the equivalent of the excerpt above?

blairmcg commented 5 months ago

Is it feasible to set the ZERODIVIDE mask for the entire process whenever CEF is used? Or would I be better off finding some way to configure my CefLibrary class such that all method calls do the equivalent of the excerpt above?

Yes. By default invalid operation and zero divide FP exceptions are unmasked partly for historical reasons, but mostly because that can help to detect bugs in code under development. However, Dolphin 7+ should work fine using FP continuations (replacement values) instead. This is already the case for overflow, underflow, and inexact results (since those exceptions are masked), but will also work for divide-by-zero and invalid operations E.g.

Processor activeProcess setFpControl: CRTConstants._EM_ZERODIVIDE mask: CRTConstants._EM_ZERODIVIDE.
1.0/0.0 "-> Float.Infinity"

You'll get +/- infinity for zero divide, and (if also masked) NaNs for invalid operations.

The exception mask is held on a per-process basis, and the VM transfers to the CPU as part of Process switches. When set on the main UI process the value should persist if you start a new main process (e.g. to debug the old one). You can also change the Process.DefaultFpControl constant to apply universally if you want. See the class comment of FloatingPointException for a detailed explanation.

For hooking in the call to cef_do_message_loop_work, you might want to look at hooking this off the idle mechanism (browse defs/refs to Presenter>>onIdleEntered). This should allow the system to quiesce properly, but might not call the hook often enough. You'd need to experiment to see.

JBetz commented 5 months ago

You'll get +/- infinity for zero divide, and (if also masked) NaNs for invalid operations.

Okay, I masked ZERODIVIDE and INVALID exceptions, but now I'm getting some weird and seemingly disconnected IDE errors.

In IdeaSpace tabs, the toolbar with the close button is sometimes greyed out and can't be selected.
Searching by text becomes incredibly slow, or more likely just hangs.
Various dialogs are broken. For example, the Ok button in the rename method dialog is greyed out and can't be selected even after changing the name. And a similar problem when creating a new class.

The first only seems to happen after closing and reopening the image. Any existing IdeaSpaces will have disabled toolbars, and any newly created tabs will have a working toolbar. However, the refactoring and class creation dialogs will still have the aforementioned problems even in tabs with working toolbars.

Any idea what the root cause of all this could be?

JBetz commented 5 months ago

Actually, they all seem to only happen after closing and reopening the image, which makes me think that something is going wrong in the CEF shutdown procedure.

JBetz commented 5 months ago

Okay, I fixed CEF's shutdown handling and the IDE issues seem to be resolved. The CefBrowserProcessRunner>>iterate loop was not getting stopped correctly on image shutdown and would resume when the image was reopened.

In the process, I also refactored both the CEF and SDL message loops to use a separate Dolphin process rather than BlockClosure>>postToInputQueue. That enables nice()-esque behavior since now they can raise and lower their priority as needed. Also, more visibility since they both show up in the Process Monitor. And from CEF's perspective, this shouldn't look any different since all Dolphin processes run on the same Win32 thread anyway. Nice.

blairmcg commented 5 months ago

Okay, I fixed CEF's shutdown handling and the IDE issues seem to be resolved. The CefBrowserProcessRunner>>iterate loop was not getting stopped correctly on image shutdown and would resume when the image was reopened.

That makes sense because the symptoms you were seeing of toolbar buttons not enabling is caused by UI idle starvation. Revalidation of toolbar commands requires that the UI get some time to run "idle" tasks when the queue is empty. If the queue doesn't empty, because you keep pushing things to it, that won't happen. BTW: This doesn't mean you would be able to run a command that shouldn't be enabled, because enablement is always checked before running a command. It just means that the visible UI cues are not updating.

In the process, I also refactored both the CEF and SDL message loops to use a separate Dolphin process rather than BlockClosure>>postToInputQueue. That enables nice()-esque behavior since now they can raise and lower their priority as needed. Also, more visibility since they both show up in the Process Monitor. And from CEF's perspective, this shouldn't look any different since all Dolphin processes run on the same Win32 thread anyway. Nice.

Yes, it will be the same OS thread, but be aware that if you call that CEF "do work" function from other than the Smalltalk UI process, you'll be doing it out of sync with the message loop, and potentially in the middle of processing some other windows message, which may cause unpredictable behaviour. It is better to push UI activity into the message queue so that it happens interleaved with other UI activity and not in the middle of processing some other message. Posting to the input queue from your background thread would be fine. You have two options there:

Deferred actions, e.g. using BlockClosure>>postToInputQueue. These are processed only when the Windows message queue is empty, so are effectively lower priority that windows messages.
Interleaved actions, e.g. using BlockClosure>>postToMessageQueue. These are processed in the order they are inserted because they do actually post a message into the Windows queue. Any subsequently arriving Windows messages will be processed later, so these actions are effectively the same priority as other Windows messages.

(2) may be more suitable for your purposes. Either way if you continually queue requests to run the CEF work you will prevent the system quiescing so that it never properly goes idle. They do include a caution about this in the comment:

When using this function care must be taken to balance performance against excessive CPU usage

Unfortunately it may be difficult to get the balance right if they don't provide any kind of indication both synchronously and asynchronously as to when it has work to do, and it looks like they don't. Dolphin calls the Windows API MsgWaitForMultipleObjectsEx to wait for the arrival of new windows messages when it finds the message queue empty and it has no other work to do (no other Processes want to run), or signalling of an event that the VM uses to wake thinks up when one of its background OS thread requires (e.g. timer thread signals the end of a delay). This happens from InputState>>#idle, which is called from the idle process, which runs at priority 1 specifically so it only runs if nothing else wants to. This effectively puts the process to sleep at the OS level so it is a good citizen and doesn't consume power. It is difficult to see how you would integrate with this without a return value that tells you there is more work to do, and a Windows kernel synchronisation object they signal to indicate the arrival of work when you are sleeping. The instruction that "care must be taken to balance performance against excessive CPU usage" is not very helpful if there is no option but to poll in a loop.

I don't know whether it would matter if you queued a bunch of requests to the UI process when it is busy and not servicing the queue for a while. These would then "arrive" (get processed) all at once like a bunch of late buses when the UI process unblocks. If it does matter (and it wouldn't if repeated calls to the CEF function handled quickly if it has no work and just returns immediately) you could either avoid queueing a new request when the last is not processed yet, or discard all but the latest request when you process them by capturing the value of some monotonically increasing counter when queueing the request and checking whether the counter has advanced when actioning the request. Something like that.

JBetz commented 5 months ago

(2) may be more suitable for your purposes. Either way if you continually queue requests to run the CEF work you will prevent the system quiescing so that it never properly goes idle. They do include a caution about this in the comment:

Good to know. I'm expecting it will take some time to find the perfect balance of thread priority, call rate, and where to queue actions, but at least now I'm aware of the configuration space.

Unfortunately it may be difficult to get the balance right if they don't provide any kind of indication both synchronously and asynchronously as to when it has work to do, and it looks like they don't.

There is a callback named on_schedule_message_pump_work that CEF uses to tell the application when it has work to do, whether immediately or in the future.

The callback includes a delay value that represents the number of milliseconds before the application should call cef_do_message_loop_work . If it's zero, then it needs to happen immediately, otherwise, the application can wait for however long before the next call. The example application also uses a timer to ensure that it's called at least 30 frames per second, so it doesn't eliminate the need for polling altogether.

Problem is that every time I've enabled this callback, the image locks up almost immediately after opening a browser and doesn't respond to the break key. And if I add a transcript log inside the callback, it only gets called about a dozen times before locking, each with a delay value of 0. So the impression is that it does more harm than good, assuming that the lock is happening due to Dolphin being overwhelmed by callbacks and not some other reason.

The only unusual thing about the callback is that it has a int64_t / sqword parameter, but I've never had any issues with them in Dolphin before, and they seem to be used frequently in the WebView2 bindings.

All this being said, I'm not entirely sure I want to use this callback, even if it worked correctly. Really, I'd rather it only be called when the delay is over some meaningful threshold, and not constantly like an overbearing child. It ends up dragging the architecture to something more resembling cef_run_message_loop, where CEF takes the reins. And that was always the worst of the three possibilities offered by CEF.

blairmcg commented 5 months ago

Problem is that every time I've enabled this callback, the image locks up almost immediately after opening a browser and doesn't respond to the break key. ...

The comment does say that it can be called on any thread. There is deadlock potential caused by the marshalling from those foreign threads. You'd have to be very careful not to call back to CEF while responding to this callback, not just from within the callback itself, but from any other activity you having going on in other Processes that might for some reason run before the callback returns. Let's say you receive one of those callbacks, and before returning you call some cef function that causes another such callback to be triggered. There is a high probability of a deadlock, or just blocking all the cef threads in turn, because the current cef callback thread is waiting on the original callback to return.

And if I add a transcript log inside the callback, it only gets called about a dozen times before locking, each with a delay value of 0. So the impression is that it does more harm than good, assuming that the lock is happening due to Dolphin being overwhelmed by callbacks and not some other reason.

That does sound rather like it has a pool of threads that it sends the callback on and they get progressively blocked, but that is just a guess. It may just be that there is an overwhelming flood. Callbacks into Dolphin are relatively expensive anyway. Marshalling between threads even more so.

The only unusual thing about the callback is that it has a int64_t / sqword parameter, but I've never had any issues with them in Dolphin before, and they seem to be used frequently in the WebView2 bindings.

Correct.

All this being said, I'm not entirely sure I want to use this callback, even if it worked correctly. Really, I'd rather it only be called when the delay is over some meaningful threshold, and not constantly like an overbearing child. It ends up dragging the architecture to something more resembling cef_run_message_loop, where CEF takes the reins. And that was always the worst of the three possibilities offered by CEF.

Depends how you handle it. The right solution here would probably be to write a little C++ DLL to receive the multi-threaded calls and translate them into either APCs posted to the Dolphin thread, or (simpler) post windows messages, or (lowest overhead, but loses the timing argument so you'd have to find another way to transmit that) signal a Win32 IPC sync object such as an event that can be added to the set of events that Dolphin waits on when it goes idle. Some hybrid of the Win32 event to wake up from slumber and a postmessage to pass through the desired delay might work. Seems like a detail for later anyway.

JBetz commented 4 months ago

The comment does say that it can be called on any thread. There is deadlock potential caused by the marshalling from those foreign threads. You'd have to be very careful not to call back to CEF while responding to this callback, not just from within the callback itself, but from any other activity you having going on in other Processes that might for some reason run before the callback returns. Let's say you receive one of those callbacks, and before returning you call some cef function that causes another such callback to be triggered. There is a high probability of a deadlock, or just blocking all the cef threads in turn, because the current cef callback thread is waiting on the original callback to return.

Okay, that would explain it.

With the way I've set things up, all of the CEF calls should be happening on the Main process, since the message loop processes for both CEF and SDL are using postToMessageQueue on blocks that run the actual work. So I don't think the calls could be coming from other Dolphin processes. More likely is that another CEF callback comes in before on_schedule_message_pump_work returns.

The right solution here would probably be to write a little C++ DLL to receive the multi-threaded calls and translate them into either APCs posted to the Dolphin thread, or (simpler) post windows messages

This sounds like a smaller scale version of the third solution you proposed in this comment? Where the thread routing is limited to a single callback handler rather than the entire interface.

It might be something I revisit at the Make It Fast stage but I think my current architecture is good enough. I'm still getting invalid memory access errors every now and then. However, they are almost certainly issues that will need to be addressed on a case by case basis and not an issue with the overall architecture of the Dolphin + CEF integration. I'll close the ticket with this comment to reflect that.

Again, I appreciate the support. I almost certainly will have further questions, but as I said, I think the biggest issues are resolved and hopefully the rest will just be details and optimizations.