youtube.com takes really long time to load

dmik commented 6 years ago

It appears that sometimes opening https://youtube.com takes minutes before anything appears on the page — in the mean time you only see the progress ring spinning and something like Read www.youtube.com, Connecting to s.ytimg.com... and alike on the status tooltip at the bottom left corner of the page. You may make it work faster by reloading the page several times with Ctrl+R (in this part it's similar to #242).

All 45.9.0 builds as well as 45.5.0 from May 2017 are affected while 38.x and earlier builds seem to be not.

My current guess is still that it has something to do with the network connection. However, it looks like it's on the Firefox side, not on the TCP/IP stack side. It might be some security issues, missing certificates or such and delays in the connection caused by them. At least I sometimes see a lot of errors in JS regarding certificates. I need to study it closer. The problem is, as usual, that the failure is irregular and once it starts working, it's quite hard to make it fail again.

dmik commented 6 years ago

Note that builds from Dave Yeo (e.g. https://bitbucket.org/dryeo/dry-comm-esr31/downloads/firefox-45.9.0.en-US.os2.zip) don't make any difference here. I couldn't make it hang so far but I can't my own test builds (or our official RPM builds) hang either any more. I doubt that the compiler optimization options (the only thing which is different in Dave's builds apart that he also uses the unofficial gcc 5.1.0 build) play any significant role here. It more looks like a timing issue which gets triggered at different moments in different builds depending on the optimization options. Which means that Dave's builds are subject to these hangs too sooner or later.

One possible workaround for the issue is to use a clean Mozilla profile (by renaming %HOME%/Mozilla to %HOME%/Mozilla.old to let Firefox create a new one). But if I'm right about timings, this is also only a workaround.

dspiatkowski commented 6 years ago

@dmik After several days of running the GA1.1 drop I noticed that the video performance did actually get worse as compared to the previous test builds. Specifically, the native YouTube video coded (VP9) would simply hang at start (both GA1.0 and GA1.1), exhibiting the symptoms you described above. However, utilizing the h264ify add-on allowed me to play back videos smoothly. The looping sound was addressed with the upgraded UniAudio drivers, so prior to the GA1.1 build I finally had a well working YouTube playback. Following the upgrade the VP9 continued to be a problem, just as it was before, but to make matters worse the h264ify add-on no longer produced a smooth playback, it now stutters.

A couple of days ago Dave released his build. I installed it and for the first time ever can successfully play back the YouTube VP9 stuff, no more need to run the h264ify add-on, the videos simply play. Smooth, no stuttering, etc, etc. His build also does not show the larger CPU consumption, although compared to the previous releases, such as GA1.0, the CPU usage is still higher. I previously commented on that in #265.

I have a Fibre-120 (125 Mbit/s) network connection, with a sustained high data rates to YouTube, therefore, I believe it is not a network speed issue. Whatever changes Dave implemented, could be just the GCC 5.1.0 or the other optimizations, they seem to have been a step in the right direction from my perspective. Dave's release is also an i686 build, I would like to test a pentium4 build for a closer 'apples to apples' comparison. Not sure that has anything to do with it, but it's another question mark that we can address.

dmik commented 6 years ago

@dspiatkowski still, what you say doesn't prove it's not a timing issue. I'm pretty sure you will see similar problems with the Dave's build one day. I will do a test build with the same optimizer options Dave did but even that won't be a proof. It simply must not depend on those options this way. And if it does, something is wrong somewhere else (kernel task scheduling, network stack, bad Firefox code design, whatever).

dspiatkowski commented 6 years ago

@dmik I agree, not a hard proof at all, but it is the only working configuration I seem to have at the moment and so I share my experience in hopes of providing additional data points for analysis. For what it's worth, given that I see a consistent outcome (and a different video playback result) between these two builds I would be more than happy to do whatever additional data capture/debug you need to try to narrow this one down. In #264 I included visual screenshots of the YouTube video 'Nerd Info' for that very reason, to show a side-by-side VP9 vs h264ify outcomes. As others have pointed on in the OS2World thread, they are also seeing different results.

There is something else I wanted to point out that pertains to Dave's build's behaviour that is consistent with what I have seen in the past, and that included the BWW FF builds. When using a i686 build of FF sometimes I will get a solid 100% CPU spike that persists for about 5-10 sec (usually) and which all of a sudden goes away. Your GA1.0 and 1.1 would never show this, but again, these was a pentium4 builds. Dave's test is a i686 build and it shows these very same spikes once again. This is part of the reason why I asked Dave to produce pentium4 build. Given the various results people are reporting I wonder if there is a specific corelation to the RPM/YUM platform settings and how this affects FF. After all, something as simple as having ffmpeg libraries installed for video playback will either get you a pentium4 of i686 versions. Maybe this is all that is required to cause the sort of timing issues that you are concerned about???

dmik commented 6 years ago

@dspiatkowski thanks for the feedback but so far I have no idea which info besides what you already said you could provide. We need at least some hint to understand what's behind all of it. But I doubt it's just machine type. Here I test both pentium4 as well as i686 and haven't noticed any consistent difference.

May be finishing the profiler code (#264, your reference was to #242 I believe) and actually enabling profiling could shed some light on it.

dmik commented 6 years ago

I've uploaded a test build with the same optimization as Dave's as http://rpm.netlabs.org/test/firefox-45.9.0-3.t1.i686.7z (and it's also i686, yes). Please test. And compare with the official GA1.1.

dryeo commented 6 years ago

This build plays YouTube videos fine. Never could get the previous build to play one with YouTube suggesting to restart my device and waiting or pressing F5 to reload didn't help. I've found straight -O3 optimization with SeaMonkey results in the least CPU usage but -Os helps a bit with memory. You may want to test both. The -O2 optimization in configure.in has been there since forever, probably originally for EMX builds where -O3 was likely unstable. I actually have a different problem with compiling with 4.9.2 where where VP9 videos display static and then a message comes up about an error occurring.

dryeo commented 6 years ago

This build also cleanly shutdown unlike the previous build which would leave the icon hatched until using TOP or such to kill it.

dspiatkowski commented 6 years ago

Confirming Dave's findings here as well. There appears to be as substantial difference in the GA1.1 and this T1 build. Not only does FF now play the YouTube VP9 videos, but they are pretty smooth (short of some stutter when the system is being heavily tasked by other processes - I actually stress tested it by having a faull-screen FF playback while openning OpenOffice document), but even attempting to run h264ify add-on also gives smooth playback. Basically VP9 quality now matches (if not actually exceeds) the h264ify playback quality here.

Also, what I noticed is that this test build appears to be less CPU cycle hungry. I've only got a few hours of runtime at the moment so this could be just the result of limitted browsing when FF is generaly well bahaved. However, even compared to a similar runtime on GA1.1 the CPU appears to be less tasked. Further update to follow, so this last comment may be a bit premature...

dspiatkowski commented 6 years ago

...some YouTube playback stats screenshots using VP9 and h264ify add-on. t1-youtube_vp9 t1-youtube_h264ify

dmik commented 6 years ago

@dryeo are you sure about GA1.1? Can you just wait for several minutes? Can you let FF create a new profile? As I said, both work here but I had to wait and Ctrl-R my GA1.1 build to make it work. And since then it works like a charm. And I still don’t see how compiler optimization could break logic here. So it still looks like a random (=timing) issue for me which just less rare with different optimization options for some reason. At least we proved that GCC 5.1.0 is not involved here.

I will also try full -O3 here and pentium4 to see if it makes any difference. Regarding the default of -O2, it’s not an OS/2 only thing - this is default on many other platforms too. I don’t know reasons behind that though. I guess it’s just a balance between the size and the speed. Again, this should not make a difference function wise. If it does, something is seriously wrong somewhere else.

dmik commented 6 years ago

And I will just state it once again that the only difference between ga1.1 and t1 is forcing -O3 for JS and -Os for the rest and -march=i686 instead of pentium4. The rest is fully identical.

dmik commented 6 years ago

Note also that it’s not -march=pentium4 either as 45.5.0 released in May 2017 is also i686 and it was broken here too when I found 45.9.0 to be broken. Now both work.

NeilWaldhauer commented 6 years ago

With t1, I get good YouTube playback on Lenovo ThinkCentre Tiny M92p With ga1.1 it I did not see it play; perhaps I didn't wait long enough. With Dave Yeo's build YouTube playback is good.\

dryeo commented 6 years ago

Using GA1.1, new profile, waiting for 10 minutes just sees different ads trying to play, same with CTRL-R, F5, and CTRL-F5 all result in the same hang. National Film Board of Canada (NFB) videos do play, but they're likely MP4. This is a LTE connection if that matters. See the hang on exit again as well. I suspect it is JS that needs the -O3, though that needs testing and then testing to see which option that is required, which at 2.5 hours a build would be slow :( We could be seeing a compiler bug.

dmik commented 6 years ago

Well, different ads? Hmm, are we talking about the same thing? Here I saw a blank page in in all 45.* builds when it broke and some constant loading progress of the background content (like scripts and so on). Can you give me a screen shot of GA1.1 when it’s trying to open YouTube.com? Maybe several if it changes while you wait.

dryeo commented 6 years ago

Yes, Youtube wants me to watch ads before the video I clicked, it would paint the first frame and then do the circle thing and eventually suggest restarting my device. I have to go to work but will try to post a screenshot later.

dryeo commented 6 years ago

Here are a couple of screen shots, one of a commercial and one of a hung video. I misspoke earlier, I only get a new ad when reloading. After a while, the ad turns into a black screen. While taking a screenshot, an ad finally played, but afterwards no more videos would play. yt_test1 yt_test2

dryeo commented 6 years ago

Today I built Firefox with GCC 5.1.0 and -O2 -march=pentium4 and it consistently plays videos here.

lerdmann commented 6 years ago

I now tried dmiks test build: 1) playability of Youtube videos is entirely tied to using h264ify add-on. If I have it installed I can play Youtube videos, if I don't have it installed I cannot. It does not matter what build I use. 2) on my system, the test build works worse than the latest release build. I am pretty sure that dmik is right that it is some sort of timing issue (and not a network issue). The test build blocks the browser for extended periods of time but funny enough I can use the file menu and close it. 3) both, the test build and the latest release do not properly delete the temporary files they create, for example the "parent.lock" file in the profile directory. I think that is a good indicator that FF does not properly clean up after itself.

I have tried this with the 14.106 SMP kernel as well as with the latest OS4 kernel. I have an 8-core AMD system. What I could try is to enable only one core per package (where a package contains 2 cores) to avoid and HT issues but the past has shown that that helps only marginally.

dspiatkowski commented 6 years ago

On Wed, 25 Apr 2018 23:08:24 -0700 lerdmann wrote:

I now tried dmiks test build: 1) playability of Youtube videos is entirely tied to using h264ify
add-on. If I have it installed I can play Youtube videos, if I don't have it installed I cannot. It does
not matter what build I use.

By saying you have it "installed", do you mean installed AND Enabled?

My testing was always done with the add-on installed but in Enabled and
Disabled states. I would expect the Disabled setting to entirely move that
code out of execution, but maybe that is not the case afterall?

dmik commented 6 years ago

@dryeo well, ok, your situation is different from what this ticket originates from. Your case looks more like #242. Anyway, I'm sure the reason behind both problems is the same. However, -O2 and -march=pentium4 is exactly what the GA1.1 build in 7z is (except GCC but my t1 is a proof that it doesn't matter). Which, in turn, proves that compiler options are just a side effect here and the main problem is timing. My current guess is that the new JVM "task scheduler" (that was heavily changed after Firefox 38) works very unstable on OS/2 so that some JS scripts start starving like hell especially when several are executed in parallel. Looks like we miss some platform-specific implementation detail and the generic code is just too dumb (it's also possible that the new JVM scheduler design is just not good). All this requires quite a complex research.

Anyway, I will try another build: -O3 and -march=pentium4 to see if it makes any difference.

@lerdmann re parent.lock are you sure it should go away at shutdown? This might be a usual unix code path that deletes an open file (which doesn't work on OS/2 or Windows of course). May be some fix is needed in the JS routines responsible for that. It's worth a separate ticked, please create one if you are sure it should be gone.

dryeo commented 6 years ago

Interesting that this problem only seems to exist in the latest release. I just tried the firefox 45.9.0-2 rpm and that plays youtube videos fine. I'm also pretty sure that the parent.lock file has always stayed in the profile

dmik commented 6 years ago

BTW, I'm looking at optimization options of other platforms now. In fact, current platforms like Linux and Darwin use -O3 by default. I'm also seeing they play around with -freorder-functions and -freorder-blocks which reorder code in the executable to make it more local and reduce the number of branches. And -freorder-functions actually requires support from the linker which we might not have.

The GCC documentation is a bit contradictive about these options (as usual), one place says that both are enabled in -O2, -O3 and -Os while the other place says that -Os disables -freorder-blocks as well as some code alignment optimizations. This might be the reason why timings change that much when using different options. I might also try to disable -freoder-functions on OS/2 given there is no support in the linker. I see that Android builds use -Os for XUL, -O3 for JS and in both cases disable function reordering but enable block reordering. BTW, Linux also uses -Os for XUL and -O3 for JS. However, it doesn't disable function reordering.

dmik commented 6 years ago

Dave, I finally proved that it's not compiler options or such. I got it broken again and your build dated 23.04.2018 also doesn't work, this is what I get with it and it's hanging there for many minutes already:

default

I made a couple of other test builds though, will upload them shortly.

dryeo commented 6 years ago

Reminds me of trying to load Youtube on dial-up, which sometimes sorta worked and sometimes similarly hung.

lerdmann commented 6 years ago

It must be a timing problem or such. I now went back to GA 1.1 with h264ify still being activated. All of a sudden Youtube again ceased to work. If would have to guess I would believe that some thread tries to preload some data in order to start playing the video and gets interrupted by the OS scheduler changing to another execution thread and later it cannot properly pick up where it left off. Or, especially on multi core, 2 FF threads are not properly synchronized and the problem starts to show if these 2 threads are simulatenously executed on 2 cores.

As to "parent.lock": it's fairly obvious that this serves some sort of notification (the file always has zero content). And in the past people stated that having that file in your profile will eventually lead to Firefox asking you to refresh your profile (killing all your preferences and such). So yes I am fairly sure it has to go away on FF shutdown. But it's well possible that this has been broken for a long time and therefore we all got used to finding that file in our profile ... On the other hand: Thunderbird also creates that file and does not delete it ...

dmik commented 6 years ago

BTW, an attempt to build XUL with -O3 failed, XPCSHELL.EXE fails with this:

______________________________________________________________________

 Exception C0000005 - Access Violation
______________________________________________________________________

 Process:  D:\USERS\DMIK\RPMBUILD\BUILD\MOZILLA-OS2-FIREFOX_45_9_0ESR_RELEASE_OS2_GA1_1\OBJDIR\DIST\BIN\XPCSHELL.EXE (04/27/2018 04:12:28 274,602)
 PID:      8BAF (35759)
 TID:      01 (1)
 Priority: 200

 Filename: D:\USERS\DMIK\RPMBUILD\BUILD\MOZILLA-OS2-FIREFOX_45_9_0ESR_RELEASE_OS2_GA1_1\OBJDIR\DIST\BIN\XUL.DLL (04/27/2018 04:12:22 439,005,274)
 Address:  005B:10910AC2 (0001:01160AC2)
 Cause:    Unknown access fault

______________________________________________________________________

 Failing Instruction
______________________________________________________________________

 10910AAC  MOV    BYTE [EAX+0x7], 0xa        (c640 07 0a)
 10910AB0  MOV    EAX, 0x8                   (b8 08000000)
 10910AB5  JMP    0x10910983                 (e9 c9feffff)
 10910ABA  MOVDQA XMM0, DQWORD [0x1093ecc0]  (660f6f05 c0ec9310)
 10910AC2 >MOVDQA DQWORD [0x17a01588], XMM0  (660f7f05 8815a017)
 10910ACA  MOVDQA XMM0, DQWORD [0x1093ecd0]  (660f6f05 d0ec9310)
 10910AD2  MOVDQA DQWORD [0x17a01598], XMM0  (660f7f05 9815a017)
 10910ADA  MOVDQA XMM0, DQWORD [0x1093ece0]  (660f6f05 e0ec9310)

______________________________________________________________________

 Registers
______________________________________________________________________

 EAX : 0000003A   EBX  : 00000008   ECX : 0000003A   EDX  : 20202DE0
 ESI : 0000003A   EDI  : 2006E2F8
 ESP : 0013F7F0   EBP  : 2006E2F8   EIP : 10910AC2   EFLG : 00010246
 CS  : 005B       CSLIM: FFFFFFFF   SS  : 0053       SSLIM: FFFFFFFF

 EAX : not a valid address
 EBX : not a valid address
 ECX : not a valid address
 EDX : read/write memory allocated by LIBC066
 ESI : not a valid address
 EDI : read/write memory allocated by LIBC066

______________________________________________________________________

 Stack Info for Thread 01
______________________________________________________________________

   Size       Base        ESP         Max         Top
 00100000   00140000 -> 0013F7F0 -> 0013C000 -> 00040000

______________________________________________________________________

 Call Stack
______________________________________________________________________

   EBP     Address    Module     Obj:Offset    Nearest Public Symbol
 --------  ---------  --------  -------------  -----------------------
 Trap  ->  10910AC2   XUL       0001:01160AC2  nsTextFragment.cpp#60 __ZN14nsTextFragment4InitEv + 180 0001:01160942 (D:\Users\dmik\rpmbuild\BUILD\mozilla-os2-FIREFOX_45_9_0esr_RELEASE_OS2_GA1_1\objdir\dom\base\Unified_cpp_dom_base8.cpp)

 2006E2F8  200712A0   *Unknown*

 Lost Stack chain - new EBP below previous

Given that there is no any significant benefit from -O3 (and Linux builds specifically use -Os for some reason), I'm not going to debug this.

So, here is the other test build: http://rpm.netlabs.org/test/firefox-45.9.0-3.t2.pentium4.7z. It has JS compiled with -O3 -fno-reorder-functions -freorder-blocks -march=pentium4 and the rest (XUL etc) is the same but -Os is used instead of -O3. Given that I don't see any effect from -fno-reorder-functions -freorder-blocks I'm going to leave them out (and -freorder-blocks should be on anyway both for -O3 and -Os according to GCC docs). The only thing I think these reorder things may affect is crashes in LIBC at exit. So those who experiences them, please test to see if t2 is any different to t1 and GA1.1 in this regard.

dspiatkowski commented 6 years ago

@dmik T2 test result: FF traps, the window frame shows up and it's "game over" after that. I've attached the trap dump.

Exception C0000005 - Access Violation

Process: G:\APPS\TCPIP\FIREFOX\FIREFOX.EXE (04/27/2018 08:18:16 50,900) PID: 60 (96) TID: 01 (1) Slot: CF (207) Priority: 200

Module: XUL Filename: G:\APPS\TCPIP\FIREFOX\XUL.DLL (04/27/2018 08:18:16 29,284,524) Address: 005B:11484AE5 (0001:02564AE5) Cause: Unknown access fault

Failing Instruction

11484AD6 MOV EAX, [ESI+0x8] (8b46 08) 11484AD9 MOV [ESP+0x4], EAX (894424 04) 11484ADD MOV [ESP], EBP (892c24) 11484AE0 CALL 0x113fbab4 (e8 cf6ff7ff) 11484AE5 >MOVDQA XMM0, DQWORD [ESP+0x50] (660f6f4424 50) 11484AEB MOV DWORD [EBX], 0x0 (c703 00000000) 11484AF1 MOV DWORD [EBX+0x4], 0x0 (c743 04 00000000) 11484AF8 MOV DWORD [EBX+0x8], 0x0 (c743 08 00000000)

FF_T2_0060_01_TRP.zip

dmik commented 6 years ago

@dspiatkowski Hmm, that's kinda similar to what I get with my full -O3 build. Strange. Does GA1.1 work at the same time? Please also install .dbg symbols so that we could see where it fails.

dspiatkowski commented 6 years ago

@dmik Yes, GA1.1 and T1 both work fine. I am back on T1 right now. The browser trap occured when FF was attempting to recover a session (since I killed the previous session to try T2). I thought maybe something about the recovery process was causing the issue so I tried the Profile Manager first and wanted to create a NEW profile to rule out anything else (add-ons, my settings, etc) but even that caused a trap here as well.

dryeo commented 6 years ago

On 04/27/18 04:18 AM, Dmitriy Kuminov wrote:

|10910AC2 >MOVDQA DQWORD [0x17a01588], XMM0 (660f7f05 8815a017)|

Isn't that an unaligned SSE or SSE2 instruction? Was this a Pentium 4 build? Probably just needs -mstackrealign. I built a Pentium M optimized build sometime back and it needed the alignment for IIRC, this problem. BTW, after 52ESR they target the Pentium M to take advantage of SSE2 and it is the minimum requirement now. If there still weren't OS/2 users with older CPU's, I'd suggest that as a target. I've experimented more with SeaMonkey, using i686, and while the -Os does make smaller binaries, the -O3 seemed to use lower CPU. I'll do another test build after work.

dmik commented 6 years ago

Yes, it was a pentium4 build. And yes this is SSE2. But I wonder why SSE2 is used somewhere w/o a proper compiler option (-msse2 etc) — because everywhere where it's used, -mstackrealign is also currently used. I found one case where -msse is used w/o -mstackrealign though. If @dspiatkowski provides me with the proper trap report with symbols, I might be able to tell if that is it. But I don't think pure SSE needs 16 bit alignment. And from the report we see it's SSE2, not SSE.

However, in my case it's not the stack where it crashes, it's direct memory access. And why GCC generates SSE2 for unaligned addresses, I have no idea. Perhaps, it's a bug in our compiler (most likely). So I guess we should suppress automatic SSE2 usage (caused by -O3 or whatever). But I have no idea how. According to gcc -Q --help=target, no SSE is enabled by default, nor for i686, neither for pentium4.

dryeo commented 6 years ago

Mingw also needs -mstackrealign at times, probably due to the same compiler failing. I think that the Pentium 4 target will automatically use SSE[2] instructions and that memory also needs to be aligned. Previously I used this to avoid the trap, --enable-optimize="-mtune=generic -march=pentium-m -O3 -mstackrealign"

dmik commented 6 years ago

Yes, pentium4 does have SSE and SSE2 but gcc (at least our 4.9.2 build) says it does not enable any SSE by default. May be it's just bullshit and it still enables it of course. Or may be it only enables it with -O3. As I wonder why it's the first time we're seeing this given that all .7z builds are pentium4 (but with -O2). Anyway, I will try a global -mstackrealign overnight to see where it gets us. But I tend to use -Os just like Linux does. I don't think the difference in speed is noticeable.

lerdmann commented 6 years ago

MOVDQA (move double qword ALIGNED) definitely needs 16-byte alignment for access. I even think that there are enough other cases where data accessed by SSE needs 16-byte alignment. I'd consider that mandatory. The only instruction that is non-critical is MOVDQU (move double qword UNALIGNED) because it is for the corner cases where you are forced to load from an unaligned address. But the latter has a horrible performance. As we can see Dariusz trap is suffering from the very same problem.

dspiatkowski commented 6 years ago

@dmik "...If @dspiatkowski provides me with the proper trap report with symbols...", happy to oblige, do I need any specific debug symbol files for this T2 build, or can I use the GA1.1 ( firefox-debuginfo-45.9.0-3.oc00.pentium4.7z)? I will try this official GA1.1 one right now...

dmik commented 6 years ago

@dspiatkowski no, you will need t2 symbols. They are nearby the main archive.

dspiatkowski commented 6 years ago

@dmik OK, got it. Installed, re-ran, I've attached the output trap file. FF_T2_DEBUGINFO_0079_01_TRP.zip

dmik commented 6 years ago

@dspiatkowski thanks! This is much more informative.

Ok, I tried a global -mstackrealign but it expectedly didn't help in my case (still crashes in nsTextFramgent.cpp), as it trapped not with the stack address but with the static variable address. Regarding @dspiatkowski crash, it's pretty obvious that -mstackrealign will help here. But still, we need a better solution for pentium4 builds that also cover global and static variables (and the heap!).

dryeo commented 6 years ago

There are the 'no-sse' and 'no-sse2' attributes, see https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html. Not sure if that'll help.

lerdmann commented 6 years ago

You will be forced to align everything to 16 bytes that is accessed by the MOVDQA instruction, no matter if data segment or stack. I guess that is a hopeless case and then David Yeo's suggestion is the only one that will help.

dmik commented 6 years ago

@dryeo yes, and you can also force 16-byte (or any other) alignment with variable attributes on a per-variable basis (https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#Common-Variable-Attributes). However, doing it with hand looks like a nightmare so not a global solution.

@lerdmann sure, the question is how to tell GCC to align all involved memory locations automatically. Analyzing each function/variable if it's subject to SSE is a no-go. These attributes, I'm sure, are done for cases where the code uses inline assembly or such, i.e. something out of compiler control. The code generated by the compiler must be aligned as necessary and this part is clearly broken in the OS/2 port. And -mstackrealign is a dirty workaround as its purpose is for code that for some reason is generated with a different alignment in mind but should still work with SSE.

Perhaps, we should drop the pentium4 target at all as there is not much benefit in it w/o SSE if I get it right. But first we should find out the exact reason why SSE ends up in GCC-geenrated code with no direct SSE options.

lerdmann commented 6 years ago

That's why I said it's hopeless. But we do not need to align code. We need to align data. Unless we find a gcc switch that forces data alignment of data items on some minimum alignment value which in our case is 16. That would likely noticeably increase the data segment ... What you have is a data attribute "aligned" (or something like that) to force alignment of data. But that needs to be done in the code, example: char SSEDataField[4000] attribute ((aligned (16))); "-mstackrealign" is not a hack. It is to align the stack address on function entry. I think that if you define a stack alignment of 16, the compiler will then take care that all local variables will be aligned on a 16 byte boundary as a minimum. Keep in mind that for a 32-bit compiler the stack alignment is 4 bytes as a default (for example, you push data as 4 byte entities no matter if you push a byte, a word or a dword). If you need to align to "higher" alignments (say 8 or 16 bytes) you will need to tell the compiler to take care of that.

dmik commented 6 years ago

@lerdmann are you sure you carefully read my last comment? :) I already acknowledged that there is a manual way of aligning variables but it's not useful in our case as we don't know when exactly GCC emits SSE/SSE2 in pentium4/O3 mode.

Re -mstackrealign, I meant it's a hack not per se but in the way how we use it. Because we abuse it.

But I tend to agree that it's hopeless in the sense that we should either:

fix GCC to automatically align memory properly when it emits SSE/SSE2 — just like on other platforms
not use -march=pentium4
not use O3 in XUL (JS seems not be affected by this problem somehow)

As I already said, I tend to just use -O3 for JS and -Os for the rest plus -march=pentium4 as fixing gcc is a separate task of its own that may take too long. This should be SSE crash free.

StevenLevine commented 6 years ago

Since the code is a reference to a static buffer, I have to guess that the buffer is not properly declared. The buffer aligns correctly on other platforms because the default compiler options differ compared to our platform.

dmik commented 6 years ago

@dspiatkowski I've uploaded yet another build as http://rpm.netlabs.org/test/ff45_9_0_t3.7z — this is the same as t2 but I've added -mno-sse2 to -O3 in JS — I wonder if it will fix your crash issue, please check. If yes, I will also try a full -O3 build (also with -mno-sse2 and -march=pentium4) which previously crashed for me in xpcshell during install. Note that the size of XUL.DLL is bigger as it already contains Debug info embedded (which is normally separate in a .DBG file).

dspiatkowski commented 6 years ago

@dmik OK, good news and "umm...so so" news! Good news is that the application start-up crash is gone. The browser window comes up fine, I can start multiple separate windows, all appear to work fine with normal browsing. The "so so" news is about YouTube. This build reverts back to not being able to play the native VP9 video codec. In fact, enabling the h264ify add-on actually results in the display being very choppy, eventually it stops at which point in time attempting to close the browser window causes application trap. I've attached a sample of such a trap given that the debug info is already in the XUL. Not sure if it helps or not, but I would not consider that to be a key issue at this point in time. Instead, the regression in YouTube playback is a problem. FF_T3_0093_01_TRP.zip

dmik commented 6 years ago

@dspiatkowski I'm pretty sure that YouTube problems have nothing to do with the build — it's just you got some timing issue back again. I'm pretty sure it will work after some time.

dmik commented 6 years ago

@dspiatkowski According to the trap report though, it's still SSE2. Are you sure the right XUL.DLL is used? How do you launch it? Please show me the about:buildconfig output.

bitwiseworks / mozilla-os2

youtube.com takes really long time to load #266