bitwiseworks / mozilla-os2

Mozilla for OS/2 and OS/2-based systems
Other
34 stars 9 forks source link

Make plugin painting work in OOP mode on OS/2 #229

Closed dmik closed 7 years ago

dmik commented 7 years ago

With closing #106 and #225 where we fixed a lot of IPC issues, binary plugins generally work in OOP (out-of-process) mode now. However, those that are supposed to paint to the browser window (e.g. the Flash plugin) fail to do so by leaving a black rectangle.

This is most likely a regression of the ESR 45 update where they completely ripped off code responsible for in-process paint events for the plugins (see https://bugzilla.mozilla.org/show_bug.cgi?id=1218688). Note that OOP painting has never worked on OS/2 so far so it is expected that painting in plugins doesn't work now. This issue to specifically track the plugin part of it. See #200 and #228 which may be also related.

dmik commented 7 years ago

I was not entirely correct above saying that "OOP painting has never worked on OS/2" — it did work back then but we had problems with Flash so OOP was disabled for it (see #43). The reason why it broke in ESR 45 is this Mozilla bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1165903. In short, they moved reparenting of the plugin HWND in the child process (needed for its contents to appear in the parent's HWND) from the child process to the parent process and this part was missing on OS/2. As a result, the plugin HWND would remain with no parent so it contents was just not visible.

Bringing the missing reparenting piece makes Flash work again (a commit will follow later after some more tests). And surprisingly, I don't see any crashes in FILT.DLL (#43) so far. However, there are some problems when unloading the OOP plugin after it's no longer in use (on when the child process serving it crashes). The main FF window stops handling messages in such a case. Perhaps, we should somehow detach the plugin HWND from the FF window or such. Needs investigation.

dmik commented 7 years ago

JFYI, all plugins seem to work now including NPDJVU and NPMP (the latter paints well but fails to get the HTTP address of the stream properly, the same happens under ESR 38 in in-process mode, so I guess NPMP is broken somehow; it's beyond this issue though).

dmik commented 7 years ago

Tests show that simply disabling a WinSetParent call from the above commit and then killing/crashing plugin-container.exe does not make the main FF process message queue hang. So, it's definitely this inter-process relation that screws it. I need to find a way on how to remove this relation when the child terminates abnormally (during the normal termination the child window most likely gets properly closed so the hang does not occur).

dmik commented 7 years ago

Some more observations: with inter-process reparenting enabled, the hang of the main process' GUI happens not only when the child process crashes but also when it terminates normally (which implies WinDestroyWindow on the child's side). So, obviously, destroying the window is not enough to break the relation and save the parent from hanging. Perhaps, WinSetParent(HWND_OBJECT) is needed when the child passes away.

dmik commented 7 years ago

No, WinSetParent(HWND_OBJECT) doesn't help.

dmik commented 7 years ago

I've been trying hard to catch the main window hang over the last days and still no luck. 100% reproducible, traced many code paths but not yet able to find an entry point in the FF code that calls something in PMMERGE.DLL which never returns. Attaching to a hung process with IDBUG is of little use as it only shows some PMMERGE assembly and no stack trace which led to it. I have to recall on how to use PMDF to get a process dump it seems.

dmik commented 7 years ago

Given that the hang completely disappears if the child process does not reparent its window to the main process' HWND, it has something to do with IPC, definitely. Most likely, the main FF process' GUI thread wants to send some PM message to the child process' window but since the child process is completely gone already (normally terminated by a request or crashed — no matter), nobody is replying it and it's hanging forever in a wait loop. Somehow the PM message queue isn't processed during this loop, the main FF GUI hangs. It may be the FF code which screws it up somehow but my worst fear is that it's PM which does some IPC behind FF and we're in a big trouble in such a case.

lerdmann commented 7 years ago

There are some restrictions regarding parent and owner relationship. Do I understand correctly that there is a parent relationship between a child window created in a secondary process and the parent window created in the main process ? I'd try and do a WinSetParent(NULLHANDLE) on the child window before it is destroyed. That should remove any parent relationship. I understood that this is what you want in order to destroy the child window. Yes, if a parent/child relationship exists, then a WM_DESTROY against the parent will be addressed to all child windows.

As to ownership, this requires that owner window and owned window were created in the same thread (not only process).

These are PM restrictions.

lerdmann commented 7 years ago

WinSetOwner(NULLHANDLE) will remove all ownership. Maybe that's also required.

dmik commented 7 years ago

@lerdmann thanks for your input but you perhaps didn't read carefully enough. There is a parent-child relationship between the HWNDs indeed (installed by WinSetParent()). And I already stated that I tried WinSetParent(HWND_OBJECT) (note that WinSetParent(NULLHANDLE) is wrong as the window must have a parent) — and this didn't help. You are right about WM_DESTROY though, but since the child window gets destroyed before the parent, it's not an issue at all. By the time when the main FF window hangs, there is neither child nor parent HWND any more — both are successfully destroyed when the page with the Flash object is closed.

Regarding the ownership, there is simply none between the windows in question, the owner of the child window is always NULLHANDLE (i.e. it's unowned). So ownership is definitely not an issue.

dmik commented 7 years ago

More over, I created a simple test case that does just the same: the parent passes its own HWND to a child process, this child process creates a new HWND of its own as a HWND_OBJECT child with no owner and then and sets the parent HWND as its parent. The good news is that it works perfectly well: WM_PAINT messages for the child HWND end up in the child process, for the parent HWND — in the parent process. Everything paints well, everything clips out well, and no hangs when the child terminates (no matter if it destroys its HWND or any other resource at termination or not, and no need to do WinSetParent(HWND_OBJECT) either). The news is good because it means that PM supports such an IPC approach per se.

The bad part of it is that it means that the problem is most likely somewhere in the FF code. Something else is done by FF somewhere which screws things up. The question is what and where.

dmik commented 7 years ago

Did more tests. The strange thing is that the main GUI event loop in FF is in fact still running when the hang happens. It only looks like paint and input messages get stuck somewhere. Interesting...

dmik commented 7 years ago

I have a strong suspicion now that it's the FF own message queue that gets stuck somehow, not the PM message queue (that appears to run).

lerdmann commented 7 years ago

Is there a possibility that the main process tries to post or (even worse) send a message to a no longer existing window in the secondary process or some such ? LIke a HWND reference that is no longer valid or so ...

Maybe you can check the QMSG hwnd variable that WinGetMsg / WinPeekMsg returns and see if that is either NULLHANDLE or has a an invalid/no longer valid handle value. Maybe in case of error you need to drop the message instead of handing it over to WinDispatchMsg.

By the way: in what source file do I find the main FF message loop handling ?

dmik commented 7 years ago

Funny enough, if I reparent the child window to HWND_DESKTOP, then all works well (except that the plugin draws right on the desktop rather than within the FF window, of course). So this still proves that PM is somehow involved here. Interesting enough that only repainting (which is done using some custom, non-native event handling) doesn't appear to work. All regular PM messages seem to work; I see periodic message delivery and the Firefox window can actually be resized. The toolbar buttons (and the respective keyboard shortcuts from the frame window's system menu) also work; I can do everything except close the window. My latest guess is that FF gets stuck in some temporary nested event loop which calls the native PM processing routines to handle native messages but doesn't process FF own events respective for painting and window close signal handling. For some reason this nested event loop does never exit so panting does not appear to happen. A question is still — how this all is related to the native parent-child relationship...

dmik commented 7 years ago

I know the exact reason why it "hangs" now. For some reason, right after the child process is finally terminated, WM_PAINT messages stop being delivered to the main FF window. As a result, nsWindow::OnPaint isn't called any more so no further drawing into the FF window occurs. Given that the PM message queue works per se, I really wonder where the main window's WM_PAINT messages are stuck.

dmik commented 7 years ago

Some weird thing happens. The nsWindow object that represents the main FF content window gets deleted when the child process terminates. And it seems that this deletion happens not through PM but rather through the object's destructor. But this destructor somehow skips calling WinDestroyWindow so the real PM window continues to exist and receive paint and all events but since the associated nsWindow is not there, nothing happens. This is really strange. Some internal FF window relationship gets broken somewhere. I only need to find where.

dmik commented 7 years ago

No, the destructor of that nsWindow is never called. What happens is that user data associated with the respective HWND (QWL_USER+4) which contains a pointer to the associated nsWindow object gets zeroed. Of course, this breaks the association so when PM delivers the next message to the given HWND, the window procedure cannot find a nsWindow object and skips message processing. Hence no redraws, no reaction no the keyboard events etc.

There are just a very few uses of user-level HWND data (WInSetWindow(Ptr|ULong) etc) in Firefox. And none of them seem to be responsive for this cleanup (at least not the ones I could find). OTOH, I cannot reproduce anything like that in my isolated test case and this doesn't let me blame PM for this. It may be so that FF code somehow corrupts PM memory somewhere... But I have no idea how to find this out. I will play around a bit more. I'm really close now.

dmik commented 7 years ago

Using a different user data index (e.g. QWL_USER in place of QWL_USER+4) solves the problem. All starts working then. I'm pretty confused. The only idea that comes to my mind is that it's either Odin or Flash (most likely, Odin) who fiddles around with the window hierarchy and sets user data here and there. I will have a quick look at it on Monday but I tend not to dive into Odin too deep now. If I can't find anything I will then just hack it around with using a different index in FF.

What makes me think it's Flash/Odin rather than PM is that because other plugins (e.g. NPDJVU/NPMP) are free from this problem although both paint to FF from a child process just like Flash does. A possible explanation to the fact that parenting the Flash window to a different parent rather than the needed one also fixes the problem is that in this case Flash/Odin fiddles with some other HWND not associated to nsWindow so clearing that data index doesn't break Firefox.

StevenLevine commented 7 years ago

This might be a window class issue. When you register a window class, you tell PM how many extra words you need. Standard PM classes, such as the Container class allocate a number of extra words for their use. However, the rules are that QWL_USER is always available for application use.

My questions would be what is the problematic window class and why does the application think that it owns QWL_USER+4?

dmik commented 7 years ago

@StevenLevine this is a private window class registered by Firefox for its own needs (to back nsWindow in PM). This class reserves 8 bytes (two DWORDs) but FF itself only uses the second one (QWL_USER+4) for reasons I don't know. And since it's a private class, no-one besides FF should use this data anyway. However, it's clear enough that someone is clearing QWL_USER+4 upon Flash wrapper DLL termination. And QWL_USER+8 is actually gets cleared as well (I specifically checked). I bet it's either the Flash wrapper itself or Odin (most likely, Odin, it does a lot of fancy stuff regarding HWND processing and inheritance).

I quickly checked Flash and Odin, there is no obvious solution and Anyway, I'm not going to sort out this Odin mess for now, it's too complicated and given we have an easy workaround is IMHO not worth it. I will simply use QWL_USER instead (which works) and add a comment referring to this ticket.

dmik commented 7 years ago

Btw, I've just accidentally found OFFSET_WIN32WNDPTR and OFFSET_WIN32PM_MAGIC in Odin that appear to be QWL_USER+4 and QWL_USER+8, respectively. And these are reset set to 0 upon Win32 window handle destruction (which may happen at Flash DLL unload time). The question is why a WIn32 window gets associated with an existing OS/2 PM window. My bet is that Win32 Flash tries to access the top-level HWND of the plugin window it gets for some silly reason. IIRC Odin creates Win32 wrappers for existing OS/2 windows to satisfy such requests and maybe messes up with these offsets while doing so (it must not change user-level data of foreign OS/2 windows it didn't create of course).

This may actually be quickly checked so I will do it while I'm at it.

dmik commented 7 years ago

Congrats, I found who's guilty, exactly. This is the Flash plugin, it's OS/2 wrapper part (NPFLOS2.DLL). The wrapper tries to create fake Win32 windows for the complete ancestor hierarchy of the browser's plugin window up to the browser top-level window. The code doing this is just plain wrong. It assumes that all of these ancestor windows classes have an unused data slot at QWL_USER which it uses to attach a fake WIn32 handle to the OS/2 handle. Of course that cannot be generally true since the plugin doesn't own these windows (and even Firefox may not own all of them). So it may simply end up overwriting other programs' data. According to the comments in the code, this was done because

         * Now Acrobat shows us that we must make fake wrappers for every f**king
         * window till we find the browser frame window. (ie. stop at destop)

I.e. some Win32 Flash version required this it seems.

I don't know why letting the plugin do so causes an overwrite of QWL_USER+4 and QWL_USER+8 slots but simply disabling this code eliminates the problem and Flash still works. Apparently some corruption somewhere (data or logic) and seems that the latest Flash doesn't need this anyway.

While it's clear why a hierarchy could be needed for the Flash plugin, it's unclear why the wrapper code wants to store the Win32 handle in QWL_USER. This is never used afterwards. Perhaps it was done back then to track if a fake window for the given OS/2 window already exists, but it's apparently not necessary any more as Odin already tracks this. I guess just removing usage of QWL_USER but leaving hierarchy creation should fix the problem too.

dmik commented 7 years ago

Ok, it turned out to be more complex than that. Not only Flash fucked up but Odin as well. The fake Win32 window is implemented as a subclass of a normal Win32 window in Odin. And the normal window uses OFFSET_WIN32WNDPTR and OFFSET_WIN32PM_MAGIC data slots for its machinery. These slots are set to some values at window creation and reset to 0 at destruction. The destructor of the normal window gets called when destroying the fake window because of inheritance but this destructor does not check if the window is fake before setting OFFSET_WIN32WNDPTR and OFFSET_WIN32PM_MAGIC to 0. As a result, data slots f unrelated OS/2 windows (Firefox nsWindow in this case) get cleared. Bad bad bad. Adding such a check fixes the problem. Commits to Odin/Flash will follow. After that I will do a bit of final testing and we may close this.

dmik commented 7 years ago

The Odin fix is here: http://trac.netlabs.org/odin32/changeset/22134.

dmik commented 7 years ago

The Flash fix is here: https://github.com/bitwiseworks/flashwrapper/commit/87f365269ce171eaf690f8c4dc0d4539952545b2

dmik commented 7 years ago

As painting works now, this may be closed.