dolphinsmalltalk / Dolphin

Dolphin Smalltalk Core Image
MIT License
294 stars 58 forks source link

"Invalid access to memory location" closing windows #1283

Closed daniels220 closed 2 weeks ago

daniels220 commented 5 months ago

Describe the bug When I close a window, sometimes I get an "Invalid access to memory location" walkback right after the window closes. I can terminate it and nothing seems broken afterward. Looking at the stack, the problem seems to be that a view is responding to a WM_PAINT after it is destroyed, and ends up calling defaultWindowProcessing: with a NULL handle.

As far as I can tell, this only—or at least mostly—occurs when:

It happens about 5-10% of the time when these conditions are met—most often closing an unsaved workspace document, but any browser with unsaved method changes can probably be affected too. The View Composer is also particularly bad—that may be a separate I vaguely thought it happened more often with windows that have been open for a while, but that may be a red herring.

I am using a version of Dolphin based on circa 7.1.16 with a bunch of patches applied, so it's possible this is my fault—I wanted to see if you've experienced it as well, or know that this is fixed in the latest.

To Reproduce Steps to reproduce the behavior:

  1. Open a workspace and type something
  2. Close it and answer "No" to the save-changes prompt
  3. Repeat until the error occurs (it can be pretty erratic, but you also will have likely seen it in normal usage if stock Dolphin is affected)

Please complete the following information):

Additional context The most common stacktrace looks like:

Unhandled exception - a GPFault('Invalid access to memory location.  Reading 0x8, IP 0x6D400415 (C:\WINDOWS\SYSTEM32\D2D1.DLL)')

ProcessorScheduler>>gpFault:
[] in ProcessorScheduler>>vmi:list:no:with:
BlockClosure>>ifCurtailed:
ProcessorScheduler>>vmi:list:no:with:
UserLibrary>>callWindowProc:hWnd:msg:wParam:lParam:
ScintillaView(ControlView)>>defaultWindowProcessing:wParam:lParam:
ScintillaView(ControlView)>>wmPaint:wParam:lParam:
ScintillaView(View)>>dispatchMessage:wParam:lParam:
[] in InputState>>wndProc:message:wParam:lParam:cookie:
BlockClosure>>ifCurtailed:
GuiInputState(InputState)>>wndProc:message:wParam:lParam:cookie:
GuiInputState(InputState)>>pumpMessage:
GuiInputState(InputState)>>loopWhile:
GuiInputState(InputState)>>mainLoop
[] in GuiInputState(InputState)>>forkMain
ExceptionHandler(ExceptionHandlerAbstract)>>markAndTry
[] in ExceptionHandler(ExceptionHandlerAbstract)>>try:
BlockClosure>>ifCurtailed:
BlockClosure>>ensure:
ExceptionHandler(ExceptionHandlerAbstract)>>try:
BlockClosure>>on:do:
[] in BlockClosure>>newProcess:

Sometimes it will originate from Reading 0x524D4FFB, IP 0x6E79779E (C:\Program Files (x86)\Dolphin Smalltalk 7\SciLexer.DLL) or Writing 0x83AE35C2, IP 0x83AE35C2 (C:\Program Files (x86)\Dolphin Smalltalk 7\Dolphin7.exe) instead. The one time with an owner-drawn button was:

Unhandled exception - a GPFault('Invalid access to memory location.  Reading 0x0, IP 0x7658607E (C:\WINDOWS\System32\USER32.dll)')

ProcessorScheduler>>gpFault:
[] in ProcessorScheduler>>vmi:list:no:with:
BlockClosure>>ifCurtailed:
ProcessorScheduler>>vmi:list:no:with:
UserLibrary>>defWindowProc:msg:wParam:lParam:
ContainerView(View)>>defaultWindowProcessing:wParam:lParam:
ContainerView(View)>>dispatchMessage:wParam:lParam:
[] in InputState>>wndProc:message:wParam:lParam:cookie:
BlockClosure>>ifCurtailed:
GuiInputState(InputState)>>wndProc:message:wParam:lParam:cookie:
DrawnPushButton(ControlView)>>defaultWindowProcessing:wParam:lParam:
DrawnPushButton(ControlView)>>wmPaint:wParam:lParam:
DrawnPushButton(View)>>dispatchMessage:wParam:lParam:
...as above...
blairmcg commented 5 months ago

This is a duplicate of #928. You are missing c7b65b8bf70677f798561b3a1d7f0a89f6bcee33, which was in 7.1.21. Your base version is more than 3 years out of date.

daniels220 commented 5 months ago

Yes, I wondered if it might have been fixed but wasn't able to check. I did try searching commit messages and didn't see anything—I see now I would have needed to search gpFault: rather than the raw Invalid access to memory location error. Thanks for pointing me in the right direction, and sorry to bother you!

Although, I don't remember this being a problem in Dolphin 6, or even 7.0 (though that I'm less sure of). Do you have any idea what changed? I would wonder about an updated Scintilla, except that I have experienced it without Scintilla involved at all. The thing is, I can't say I'm entirely comfortable with the fix...it makes sense to keep message processing synchronized, but it also means that aWindow close; isOpen might answer true, which is surprising behavior. Would it make sense to at least do it postToMessageQueue rather than postToInputQueue? It seemed like you migrated most existing core users of postToInputQueue to use the new message-based mechanism when it first appeared—is there a reason to use the old mechanism here?

blairmcg commented 5 months ago

... I don't remember this being a problem in Dolphin 6, or even 7.0 (though that I'm less sure of). Do you have any idea what changed? I would wonder about an updated Scintilla, except that I have experienced it without Scintilla involved at all.

It was caused by the change to invoke Scintilla directly by calling its library function instead of through SendMessage. That Windows API prevented the issue being detected because it won't invoke the window procedure for closed windows.

Any other issues you are linking with it may or may not be related. Certainly the race condition that exists has the potential to be problematic, even if the AVs are unlikely due to the guards in the Windows APIs for calling window procedures through SendMessage, CallWindowProc, etc.

... The thing is, I can't say I'm entirely comfortable with the fix...

I can understand that because it is not ideal. This is a case of Bower's Principle of Equal And Opposite Frigs (and to be clear, he meant this in the 3rd sense listed in the Collins dictionary, distinctly not the others which actually I don't recognise as British English uses at all). The use of an overlapped call to invoke the message box is the original frig to workaround the fact that it takes control of the message loop. Deferring the close after such an overlapped message box is then a consequential frig.

As to why we put in place the original frig, architecturally Windows intends that all the windows associated with the UI thread that opens the message box are associated with that message box. Where there are multiple independent top-level windows, there really ought to be a UI thread for each of those, either that or you accept that a message box is really task modal and when one is open you have to respond to it before doing anything else. Since Dolphin is not truly multi-threaded (don't forget it was designed in a different era for Windows 95), multiple UI threads is not an option. We found blocking all the IDE windows to display a message to be a poor experience for developers, so we put in place the frig of invoking the MessageBox on an overlapped call. This blocks the Dolphin UI process so another one is started. This has a number of unwanted consequences, but is a trade off that is worthwhile in the IDE. In an application I would recommend disabling this trade off by opening all message boxes in task modal mode.

it makes sense to keep message processing synchronized, but it also means that aWindow close; isOpen might answer true, which is surprising behavior.

In theory, yes, but in practice it doesn't really matter. Closing a top-level window (as opposed to destroying it) is a request, not a deterministic action. You don't know that sending it close it will actually close the window ever It follows that you should also have no expectation that if it does close it will do so immediately before the call returns - to have that guarantee would impose that windows that are very slow to close have to block everything, whereas you might want to just initiate the process, e.g. to close some remote communication.

Also bear in mind that when your close actually returns in this case you are no longer the main UI process. If you take further actions conditionally based on whether the window closed or not, you are in danger of creating further race conditions. That is a possibility anyway as an unfortunate consequence of the original frig, but suggesting further guarantees of deterministic behaviour doesn't help with that. If you need to know that the window has closed in order to take further action, hooking that one of the close/destroy events is the right thing to do. This will ensure that the action happens in the new UI process.

Would it make sense to at least do it postToMessageQueue rather than postToInputQueue? It seemed like you migrated most existing core users of postToInputQueue to use the new message-based mechanism when it first appeared—is there a reason to use the old mechanism here?

Not really. postToMessageQueue is for situations where you require to happen immediately any currently queued messages have been processed, and before anything else that might be queued subsequently. It will also happen before any regular repainting of windows. WM_PAINT is a faked up message delivered only when the queue has been drained of other messages. postToMessageQueue here would prevent the underlying window from repainting after the message box closes before it is itself closed. On fast modern machines this may not matter anymore, but it used to be aesthetically unpleasing as you would get a flash of an unpainted box. In this case a "do this as soon as you can, and before you do anything else you haven't already started" request is not really necessary, and has negative effects. postToInputQueue seems a better conceptual fit in this case.

daniels220 commented 4 months ago

It was caused by the change to invoke Scintilla directly by calling its library function instead of through SendMessage. That Windows API prevented the issue being detected because it won't invoke the window procedure for closed windows.

Makes sense. I guess that safeguard is why this doesn't happen more often. I guess there must be something about the owner-drawn button code that slips a message through despite it.

I can understand that because it is not ideal. This is a case of Bower's Principle of Equal And Opposite Frigs (and to be clear, he meant this in the 3rd sense listed in the Collins dictionary, distinctly not the others which actually I don't recognise as British English uses at all). The use of an overlapped call to invoke the message box is the original frig to workaround the fact that it takes control of the message loop. Deferring the close after such an overlapped message box is then a consequential frig.

Hah! Yeah, I've thought similar things before when looking at the MessageBox code. Honestly it's amazing how well-contained such unintended consequences are in Dolphin, considering the age and cruftiness of Win32 and the impedance mismatches inherent in implementing a highly-dynamic language like Smalltalk on top of it. I probably don't say this enough—good work! :)

I've had the thought that it might make sense to replace the IDE usage of the actual MessageBoxEx function with a custom dialog—perhaps something similar to WalkbackDialog, i.e. using built-in templating and with very minimal dependencies—that emulates the look of a true MessageBox, but without hijacking the message loop. What do you think?

As to why we put in place the original frig, architecturally Windows intends that all the windows associated with the UI thread that opens the message box are associated with that message box. Where there are multiple independent top-level windows, there really ought to be a UI thread for each of those

Oooh, this makes sense as a design decision.

Since Dolphin is not truly multi-threaded (don't forget it was designed in a different era for Windows 95), multiple UI threads is not an option.

It's not just Dolphin...Pharo (the whole Squeak lineage) and AFAIK VisualWorks are both single-threaded as well. In fact Pharo only recently gained the threaded FFI functionality that Dolphin has had since at least D6. So I'd say you were ahead of your time, if anything! Seems like a great many high-level languages are either strictly single-threaded, have contention issues that make threading less useful than it could be (Ruby last I knew), or have severely limited thread semantics (e.g. web workers in JavaScript—though this sort of approach may be a good compromise favoring performance and safety over flexibility).

In an application I would recommend disabling this trade off by opening all message boxes in task modal mode.

Interesting...as it happens the application I work on favors many independent windows about as heavily as Dolphin does, so we're faced with the same tradeoffs. Fortunately it doesn't involve Scintilla, so we've not had issues even without the defer-close workaround.

Closing a top-level window (as opposed to destroying it) is a request, not a deterministic action.

I should definitely have specified "and nothing intervenes to abort the close process, i.e. nobody listening for #closeRequested: sets the value holder to false". But your broader point is fair enough, all the same. I suspect if it ever trips me up, it'll be in sloppily-written development utility code, not application code...so, not a big deal.

Not really. postToMessageQueue is for situations where you require to happen immediately any currently queued messages have been processed, and before anything else that might be queued subsequently. [...]

Nice, that helps me get a better grasp of the difference between the two—thanks!