dolphinsmalltalk / Dolphin

Dolphin Smalltalk Core Image
MIT License
301 stars 58 forks source link

Wait cursor often survives long-running operation #1154

Open jgfoster opened 2 years ago

jgfoster commented 2 years ago

I've noticed this frequently and finally took the time to dig into it further. I can reproduce it in my application and now can (often) reproduce it in the Dolphin IDE. I open an IdeaSpace, add two system browsers on Object>>??, then close the second one. At least half the time the busy cursor shows up and never goes away until I move the mouse over something that has a non-default cursor (such as a text field with an I-beam cursor).

blairmcg commented 2 years ago

I think this is 99% certain to be an external bug, presumably in VMWare Fusion. For further details see my comments in #1155.

jgfoster commented 2 years ago

The problem has been reproduced on a bare-metal Windows install so it is not due to VMware Fusion. In order for the external designation to remain, could you suggest where to look next?

jgfoster commented 2 years ago

See the following for the problem:

https://user-images.githubusercontent.com/1577872/143784699-bb2171b0-1ce3-4509-9e45-2270e5ead880.mp4

blairmcg commented 2 years ago

The problem has been reproduced on a bare-metal Windows install so it is not due to VMware Fusion. In order for the external designation to remain, could you suggest where to look next?

Well my comments about the circumstances under which this can occur still stand, i.e. it can only really be the case that a cursor will persist if: 1) You have a Window of a Window class that does not have a cursor handle associated with it. This should not be the case for normal Dolphin windows as they are all of the same window class. The Windows controls are different classes of course. The Windows behaviour where it will reset the class cursor is well documented, and also easily demonstrated by evaluating the following. When the shell window appears, be sure that the mouse cursor is over it, moving it if necessary, then wait. The wait cursor will appear. Now move the mouse. It will immediately disappear and revert to the arrow even though it has not been explicitly reset.

[Shell show. 5 seconds wait. [Cursor wait setCursor] postToInputQueue] fork.

The cursor is never reset (i.e. the windows API is not called again), but that doesn't matter. Windows will just change it back unless the WM_SETCURSOR query is setting something different. So essentially this is not a likely explanation, but brings us on to (2), a much more likely cause. 2) Cursor.Current is not actually nil. It may appear to be when you inspect it or show its value, but any command you run will change and then reset it to its prior value at the end of Cursor>>showWhile:, so be careful it is what you think it is.

A possible scenario by which the global Current.Cursor state could end up left as the wait cursor is as follows:

  1. UI process runs a command, standard command invocation runs the operation in a Cursor>>showWhile:
  2. Invoked command starts another process and this performs UI operations (which should never be done other than from the UI process) including an invocation of Cursor>>showWhile: that does not exit until the command from (1) has finished. When this 2nd Cursor>>showWhile: was started, the value of Cursor.Current would have been the wait cursor (from 1), so when it exits that is what it will restore to Cursor.Current.

This can be simulated by evaluating:

[Cursor wait makeCurrent] postToInputQueue

If I do this, however, what I see still doesn't look like your video as the Cursor doesn't change back over any of the Dolphin windows, except the scintilla controls where it flickers between the ibeam and wait, and the splitters.

To restore sanity:

[Cursor.Current := nil] postToInputQueue

Maybe the framework does something like this, but it seems unlikely given that this probably hasn't been changed in more than 20 years. It is possible of course, but seems very unlikely that such a bug could exist for such a long time without being reported before. I can't reproduce it by following your steps either. Without a repro, there is not much more I can suggest other than inserting some diagnostic tracing into the onGetCursor handling so see what cursor is being set as the mouse moves. You can try inspecting the messages using Spy++ from the VS2019 tools menu too. If dumping to the Transcript is too slow, try using running under the VS debugger and dumping to a DebugTraceStream.

In your video are you running with a debug VM? The UI seems very slow. Also what exactly do you mean by a bare metal Windows install (what version of Windows, what hardware)?

blairmcg commented 2 years ago

As background, this old Raymond Chen blog post is useful in understanding how the cursor setting process works in Windows and why the class cursor will override any previously set cursor if default processing happens: What is the process by which the cursor gets set?

It is fairly easy to relate this to the code in the MVP framework. View>>wmSetCursor:wParam:lParam: is the main entry point. This decodes the hit test argument to a symbol, and then forwards to the interactor->presenter->view as onGetCursor. Any of these can return a Cursor object. If they do, the Cursor is set by calling the SetCursor API and true is returned to suppress invocation of the default window procedure. Presenter>>onGetCursor: is responsible for offering the global cursor (Cursor.Current) if over the client area and a global cursor is set. If there is no cursor returned from onGetCursor:, then nil is returned from View>>wmSetCursor:wParam:lParam:. As this does not convert to an integer value when sent asUIntPtr the default window procedure is invoked (see View>>dispatchMessage:wParam:lParam:). Raymond explains the flow that then happens and that results in the class cursor being set.

jgfoster commented 2 years ago

The video is me running non-debug Dolphin (but in VMware Fusion on a 2012 machine with a screen recorder and some other software running), so it might not be very snappy. By "bare metal" I mean without virtualization (so no VMware Fusion to suspect of distorting things). I observed the problem on a student's machine and I've asked for hardware/software details. In my case I'm running Windows 10 21H2 (but the problem is long-standing).

To further add to the mystery, I've added logging to everything that changes the cursor and at the end of a #'showWhile:' change after it calls the library to restore the original cursor I follow it with a 1-second delay and then query the OS library for the current cursor. It claims to be showing the arrow when it is actually the busy cursor that is visible. As far as I can tell (with lots of debugging and logging), all the showWhile: calls are properly nested and Cursor.Current is properly nil at the end. Later attempts to reset the cursor fail, perhaps because the system thinks it is already correct. Since the added delays seem to have avoided the problem for the Dolphin Idea Space, it seems like some sort of timing issue, but I can't figure out what it would be.

I wonder if the failure to display the package tree is related. It feels a bit like we lost a redraw after an invalidate in each case (though the two are not linked one-to-one). Do you ever experience the failure to redraw a tree (then redraws line-at-a-time as the mouse moves over the item)?

blairmcg commented 2 years ago

To further add to the mystery, I've added logging to everything that changes the cursor and at the end of a #'showWhile:' change after it calls the library to restore the original cursor I follow it with a 1-second delay and then query the OS library for the current cursor. It claims to be showing the arrow when it is actually the busy cursor that is visible.

That's interesting I assume you are saying that when you evaluate UserLibrary default getCursor (carefully arranging so not inside a Cursor>>showWhile:) you get back the same handle as is associated with Cursor arrow handle? What do you see if you effectively do (Cursor fromHandle: UserLibrary default getCursor) inspect, i.e. which graphic? Are the handles actually mixed up? What are the numeric values of the handles you see, and are they the same on different machines/installations? Do they remain constant across occurrences of the bug? What about across image saves?

Coming back to what you see, assuming that the handling of the WM_SETCURSOR in Dolphin is behaving itself, then the words on the SetCursor documentation page should apply: If the class cursor is not NULL, the system restores the class cursor each time the mouse is moved. So, if the class cursor is being set (and it appears so, since you report that GetCursor returns the handle of the arrow cursor), the most logical explanation is that the handles are mixed up. I'm not sure how that corruption could occur though.

I did have an idea of something you could try. Looking at Cursor>>showWhile: it actually saves and restores whatever the current cursor handle was, i.e.

showWhile: aBlock
    "Maintain the wait cursor while aBlock is executed, answering the result of the block.
    The actual cursor which was current is restored (rather than the one we think is Current)."

    | previous actual |
    previous := Current.
    Current := self.
    actual := self setCursor.
    ^aBlock ensure: 
            [Current := previous.
            UserLibrary default setCursor: actual]

I think the code has always been like that in every released version of Dolphin, but it did set me wondering about the lifetime of and ownership of the returned cursor handle. The GetCursor documentation says nothing on this, and obviously the code above is depending on it remaining valid, but also assuming ownership is not returned. The system cursors are owned by the system and remain valid throughout the application's life, so it's difficult to see how this would be causing problems but you could try simplifying it to this for elimination purposes:

showWhile: aBlock
    | previous |
    previous := Current.
    Current := self.
    self setCursor.
    ^aBlock ensure: 
            [Current := previous.
            previous ifNotNil: [previous setCursor]]

Later attempts to reset the cursor fail, perhaps because the system thinks it is already correct.

Can you explain what you mean by fail here? There isn't really a notion of a correct or persistent cursor as such. It's continually updated as the mouse moves, and although User32 setCursor: User32 getCursor has no effect, it doesn't fail?

I wonder if the failure to display the package tree is related. It feels a bit like we lost a redraw after an invalidate in each case (though the two are not linked one-to-one). Do you ever experience the failure to redraw a tree (then redraws line-at-a-time as the mouse moves over the item)?

Yes, I am able to repro that easily, at least inside the IdeaSpace. It doesn't happen to me otherwise, or not that I have noticed so far. I don't think this is related though, but is rather a separate issue altogether. I don't know its vintage, but it could have started happening due to a change in Windows, or a framework change. We'd have to try going back to older Dolphin versions (starting with 7.x) to see if it repro's there, and then try and narrow down the change if it doesn't. It is probably caused by the sliding card tray (aka SlideyInneyOuteyThing), in combination with the rather complicated way the IdeaSpace nests ShellViews inside itself. I've always been surprised that worked as ShellView's are designed to be top-level windows, not child windows. It seems to occur in this and other tree views (and only tree views as far as I can tell) that are in pinned sliding card trays inside an IdeaSpace. It looks like some kind of z-order bug, as if you look closely the tree does get drawn, and then immediately overpainted with the background colour of one of the parent views (could be from the idea space itself, since it has a Color face3d background). I can't see how this could be related because Cursor management is separate to window painting. It's not affected by z-order, but rather by view hierarchy as explained in Raymond's article where he gives an example of the default window procedure going up and back down the view hierarchy.

jgfoster commented 2 years ago

Blair, I very much appreciate your responses on this. Tonight I spent some time trying to follow-up on your suggested areas of investigation. I'm particularly intrigued by the idea of the handles getting mixed up. It does seem unlikely, but there are few remaining plausible explanations, so any idea is worth investigating. Interestingly, I was unable to reproduce the problem tonight, but I'll try to answer your questions:

I assume you are saying that when you evaluate UserLibrary default getCursor (carefully arranging so not inside a Cursor>>showWhile:) you get back the same handle as is associated with Cursor arrow handle?

Yes, the handles I've had consistently are as follows:

Cursor arrow handle "16r10003"
Cursor iBeam handle "16r10005"
Cursor wait handle "16r10007"

My observation is that the GUI shows cursor #7 when it claims to be showing #3. I hadn't thought to inspect the result of a getCursor, but I'll try that when I next reproduce the problem.

Later attempts to reset the cursor fail, perhaps because the system thinks it is already correct.

Can you explain what you mean by fail here? There isn't really a notion of a correct or persistent cursor as such. It's continually updated as the mouse moves, and although User32 setCursor: User32 getCursor has no effect, it doesn't fail?

Sorry for the ambiguity and thanks again for your patience and questions. This refers to code I've added at the end of #'showWhile:' to make a second attempt to get the arrow cursor. I've also added background tasks to check if Cursor current is nil and if so attempt to set the cursor back to the arrow. But you clarified that there isn't really a "persistent" cursor, just the one assigned to the current window. Does the OS reset the cursor to the one associated with the window class directly or does it provide an onGetCursor event and let the application set the cursor itself.

I'll check to see if the setCursor: method fails (I don't think I looked at the result; I just looked for a GUI change).

And finally (for now!), thanks for the discussion of the package tree drawing problem. I agree that it seems unrelated, but I'm grasping here!

I'll follow-up when I'm able to reproduce the problem and add more debugging.

blairmcg commented 2 years ago

Yes, the handles I've had consistently are as follows:...

The ones I get are different, so obviously these are not fixed, even if they are always in the same range.

Cursor arrow "16r10005"
Cursor iBeam handle "16r10007"
Cursor wait handle  "16r10009"

This suggests that these may be allocated dynamically into some internal array as the cursors are required, but then you'd expect maybe that there ought to be a 16r10000, 16r10001, etc, and if I try (Cursor fromHandle: 16r10003) inspect on my system, I get an invalid cursor handle.

Does the OS reset the cursor to the one associated with the window class directly or does it provide an onGetCursor event and let the application set the cursor itself.

It's kind of a combination of the two. There is an onGetCursor event, WM_SETCURSOR, that allows a window to set a specific cursor. The window under the mouse will get these events repeatedly as the mouse moves. And then there is default behaviour that will set a class cursor if no Window responds to WM_SETCURSOR by returning TRUE. The default behaviour bubbles that message up through the parent window chain until either some window sets the cursor, and indicates as much by returning TRUE (1), or the top of the hierarchy is reached. If the top is reached and no window has returned TRUE, the event propagation is unwound and at each level Windows sets the class cursor so that the effect is that the cursor is set to the class cursor of the nearest parent of a class with a cursor. I haven't described this entirely precisely, but Raymond Chen does in his article, so I'd recommend reading that.