bitwiseworks / libc

LIBC Next (kLIBC fork)
9 stars 3 forks source link

wl: out of memory #84

Open dmik opened 4 years ago

dmik commented 4 years ago

After fixing too long operation of emxomfld (see #83), I faced another problem with linking Chromium: WL.EXE (Watcom Linker) runs out of memory killing the system, see https://github.com/bitwiseworks/qtwebengine-os2/issues/1#issuecomment-646584092.

Watcom Linker is not part of kLIBC or LIBCn but I will track this problem here as we don't have a controlled repository of it so far (perhaps, we should create one to not depend on Watcom maintainers and poor mercury VCS support on OS/2).

dmik commented 4 years ago

I’m summoning @stevenlevine here to help me out with building and debugging wlink.

dmik commented 4 years ago

I've tried to run WL.EXE with a response file EMXOMFLD generates for it on a freshly booted system (and NOT from EMXOMFLD) — and it worked! At least I got the list of missing externals. Unfortunately, I didn't catch the output so doing it again.

So it's definitely OOM and w/o EMXOMFLD in memory there seems to be enough. I guess I should change it to make all allocations dynamic (all those four 200 KB hash maps and whatever else I will find) to make them work together.

dmik commented 4 years ago

OK, the logging run also succeeded. I was watching memory usage by WL.EXE in Theseus and here's the maximum usage level: 2240 MB "shared" (which actually means high mem) + 324 MB "private" (low mem), i.e. 2564 MB in total (with free memory dropping down to 434 MB). Not bad but this is close to the maximum limit a process can have at all (don't forget DLLs). And it took 1 hour 7 minutes. This is way too slow of course and we should look at it's hash table usage, perhaps there is also an easy solution to speed it up.

Anyway, now I got the full list of unresolved externals. Will report it to the Chromium ticket.

dmik commented 4 years ago

Hmm, actually WL.EXE ended up with an error: it run out of LIBC heap already. Just didn't crash the kernel for some reason. The last message in the log is this one:

Error! E3009: dynamic memory exhausted

So it seems that besides resolving externals we also have to fix memory issues in WL anyway... And this may bring more undefined symbols actually.

StevenLevine commented 4 years ago

Is the OOM for a build with or without debugging data?

dmik commented 4 years ago

With debugging data as I need it (but some objects are compiled with my hacked EMXOMF).

StevenLevine commented 4 years ago

See my comment on #82 for some testing suggestions and the log ouput I need to see.

dmik commented 4 years ago

Yes I'm trying what you suggested. BTW, do you have any idea on why the WL part is so slow (given a response file with all Chromium files to link, it takes more than an hour before it spits out dynamic memory exhausted. Do you have any idea what could make it so slow? May be there are some hash tables with ridiculously small sizes, like it was in EMXOMFLD?

StevenLevine commented 4 years ago

When you run with VERBOSE, what stage is running out of memory?

How many files are mentioned in the response file?

Why not attach the response file to the issue so I don't have to ask so many questions?

If wlink needs to use spill files and virtmem, there will be a time cost.

The wlink algorithms are usually pretty efficient. Wlink has the reputation of being on of the faster linkers available. That said, as you found with emxomfld, almost anything can be improved once a bottleneck is identified.

How does the link time compare to your xul.dll link times? How xul.dll is about 30MB and xul.dbg is about 350MB. How does this compare the size of the DLL and debug data of the Chromium DLL you are building?

dmik commented 4 years ago

I've added OPTION VERBOSE and OPTION NOCACE and this brought me nowhere after 17 hours of waiting!!! The only output I got from WL is this:

** EXPERIMENTAL (HLL) ** Open Watcom Linker Version 2.0beta1 Limited Availability
Portions Copyright (c) 1985-2002 Sybase, Inc. All Rights Reserved.
Source code is available under the Sybase Open Watcom Public License.
See http://www.openwatcom.org/ for details.
loading object files
searching libraries
Error! E3013: break key detected

W/o these options it would finish in 1 hour or so.

Regarding the response file, here is its beginning and end:

FORMAT OS2 LX DLL INITINSTANCE TERMINSTANCE
NAME 'Qt5WebCd.dll'
OPTION MAP='Qt5WebCd.map'
DEBUG HLL
OPTION VERBOSE
OPTION NOCACHE
OPTION OSNAME='OS/2 EMX'
OPTION CASEEXACT
OPTION STUB='D:\CODING\LIBC\MASTER-BUILD\OPT\OMF\os2stub.bin'
DISABLE 1121
DISABLE 1121
...
OPTION MODNAME='Qt5WebCd'

The rest is FILE and LIBRARY statements (~5 000 and ~500 correspondingly, with around ~10 000 of object files in libraries, which gives around 15 000 object files in total). The total response file size is 450 KB, so I'm not going to attach it here.

I can create a test ZIP for you with all the libraries and object files but I'm afraid it will be a couple of GB. Ready for that?

Re Mozilla, I don't recall exactly, but the link stage did never take more than several dozen min there. My explanation is that there is a lot of C code in Mozilla while Chromium uses mostly C++ everywhere (which increases the amount of symbols by some orders of magnitude).

dmik commented 4 years ago

Dropping OPTION NOCACE made it finish (in 1 h 50 min) with the OOM error again. It doesn't look that OPTION VERBOSE changed anything. I got basically the same output.

StevenLevine commented 4 years ago

I had a feeling that NOCACHE might slow things down.

Based on the VERBOSE output, do you have a feel at which stage wlink is spending the most time? I suspect it gets through "loading object files" in a reasonable time and then spends the most of the rest of the time "searching libraries." Is this true?

I suspect I am going to have to get familiar with wsample, which is what OpenWatcom provides for performance monitoring. I will use the testcase we have for #72 for initial analysis. It looks to be big enough to provide some useful data. If I need something bigger, I will ask.

BTW, you can use psfiles as a poor-man's file access monitor.

psfiles | find "LDCONV_"

will tell you which relevant files, if any, wlink has open for processing.

psfiles | grep -Fe ".OBJ" -e ".LIB"

might give better coverage. This might also provide some insight on the rate of progress with NOCACHE. Running these in a loop with a 5 or 10 second delay can often provide useful information.

Given your numbers, it appears that wlink is taking something on the order of .25 seconds per file. This is not terrible considering the complexity of the code, but we probably can do better once we identify the hotspots.

A bit of searching on the web shows chromium builds can be time consuming on other platforms. However, I have not found anything that reports link times separately from source compile times.

Another thing I noticed is that there is a trend for larger projects is to use unity builds which can cut the build times down massively.

dmik commented 4 years ago

Thanks for the psfiles hint. Will have a look. And yes, most time was spent in searching libraries according to my observations.

0.25 s per file — well, it's a bit too much to my taste. It's just linking where you want to simply enumerate the symbols and put them all together performing some fixups (not compiling where you have to actually generate the code taking all its complexity into account).

I regularly build Qt and Chromium within it on my macOS (comparable hardware) and it's taking much much less. Compiling + linking, together — around 2h IIRC (the Chromium part, I mean).

dmik commented 4 years ago

BTW, none of the p4web links at www.openwatcom.org work ATM (may be due to the Tropical Storm Notice that is now there). So I can't access anything there. But I see that its github mirror (https://github.com/open-watcom/owp4v1copy/) gets regular updates, the latest commit is from June, 8, 2020. Steven, can you confirm that all OS/2 patches are there (except your last one from #82)?

At least, I see there all commits you and Gregg mentioned in the mails from Jan, 08, 2020.

StevenLevine commented 4 years ago

Marty, the OpenWatcom admin, is having a COVID-19 related issue. Perforce provides a free license to OpenWatcom because OpenWatcom is an open source project. It appears that things are backed up at Perforce and Marty is still waiting for the license key to be issued. I'm hoping for this to happen in the next few days. If the problem persists, we will have to rehost the sources somewhere.

So far this year, the weather has not been a problem.

Jiri mirrors the Perforce repository and owp4v1copy is up to date in the sense that it contains everything the Perforce repo contained when the license expired.

StevenLevine commented 4 years ago

I think I am ready for a larger testcase. A bit of experimenting with wsample and wprof indicates they should help track down the hotspots. Using your genmatch.exe testcase with an unbroken hex.obj, provides results that make sense. wlink uses about 75% of it's time in DosRead, but that's to be expected given the testcase. wlink spends a lot of time in the lsymtab and mem modules.

This should change significantly with a testcase that more closely matches what you are doing.

Something that runs for about 15 minutes or so should provide a decent sample size although I would have no problem handling what you are currently linking. I've got plenty of disk space. I recommend you use something other than zip to archive the files. A 7z or tar.bz2 archive will be significantly smaller.

dmik commented 4 years ago

Ok, I will prepare a 7z archive for you now. In the mean time, here is the WL popup log I got just now:

06-30-2020  22:57:07  SYS3175  PID 0c1c  TID 0001  Slot 009c
C:\USR\BIN\WL.EXE
c0000005
0003f509
P1=00000001  P2=00000010  P3=XXXXXXXX  P4=XXXXXXXX
EAX=0008738c  EBX=0008734c  ECX=00000000  EDX=0008735c
ESI=00000000  EDI=00087344
DS=0053  DSACC=d0f3  DSLIM=bfffffff
ES=0053  ESACC=d0f3  ESLIM=bfffffff
FS=150b  FSACC=00f3  FSLIM=00000030
GS=0000  GSACC=****  GSLIM=********
CS:EIP=005b:0003f509  CSACC=d0df  CSLIM=bfffffff
SS:ESP=0053:0008732c  SSACC=d0f3  SSLIM=bfffffff
EBP=00000056  FLG=00010297

WL.EXE 0002:0001f509

It's not your last build though, it's the last RPM build. Dunno if it tells you anything. Here it looked like an OOM.

dmik commented 4 years ago

Ok, the test 7z is being uploaded as http://rpm.netlabs.org/test/wloom.7z. Please wait before it's fully there (about 20 minutes). It is about 390 MB packed and 3.7 GB unpacked. Inside you will find all objects and static/import libraries except a few standard libraries we distribute in .lib format as RPM (those names start with C:\usr\lib\... in the WL response file which is also in the archive). Once you make sure all library RPMs are installed, you may run t.cmd to get an OOM message (here it took 1 hour 30 minutes). Pleasea let me know if you succeed with this test case.

I got the list of the remaining missing symbols and working on them in the mean time.

StevenLevine commented 4 years ago

It tells me a lot. It tells me you are not using the wl.exe build that resolves this exception. It also tells me you are still using a broken emxomf that generatedsbroken .obj files. It is also not an OOM.

The trap is at:

obj2supp.c:256 case FIX_FRAME_SEG: if( !frame->u.sdata ) { LnkMsg( LOC+ERR+MSG_FIXUP_MISSING_THREAD, "dA", frame->type, tgt_addr ); // 2020-06-21 SHL break; }

As you noticed, in the old version of wl.exe, the code attempts to dereference the null pointer and traps.

StevenLevine commented 4 years ago

Thanks. The testcase arrived without problems and should do what I need it too.

The initial results confirm what I suspected already. The code is spending most of the measured time in GlobalSearchSym so I will be looking at what can be done to optimize this part of the code.

Almost all of the rest of the time is spend reading object files. The profiler shows this time as spent in DosRead.

dmik commented 4 years ago

Yes, I have to use a "broken" EMXOMF because I get OOM for some Chromium C++ sources otherwise. I guess I will give up attempts to build Chromium with debug info (until EMXOMF/WL OOM problems are resolved) and try a release build instead.

Thank you very much for helping out with WL!

StevenLevine commented 4 years ago

Please try

wl_exe-20200702.zip

at www.warpcave.com/openwatcom. The hash table has been bumped to 65521 from the original 1789, which reduces the effects of collisions noticbly in my testing. There are still some easy easy optimizations that I plan to do. memcmp is used to compare symbols. As the code is written memcmp is not inlined.

I was thinking of replacing memcmp with a custom version that compared the characters right to left, but if the hash function is good enough, there will be no real benefit to this.

Have you given any thought to how big Qt5WebCd.dll is going to be once built? I did a quick back of the napkin analysis using accessibility_jumbo_1.o, just because. It is about 73K. dmpobj tells me there is about 7K of LEDATA destined for code and data segments. the rest is relocation tables, symbol defintions and debug data. Assuming that this ration is typical, this is going to generate one rather large DLL.

dmik commented 4 years ago

Thanks, Steve! With the release build and -s (no debug info at all) it worked really fast indeed! See https://github.com/bitwiseworks/qtwebengine-os2/issues/1 for more info. I'm not sure what participated more here though -- the absence of the debug info or your fix. We will see.

Regarding the estimated DLL size -- only a wild guess with something like a 100 MB DLL.

StevenLevine commented 4 years ago

The lack of debug info helps somewhat, but debug data processing is mostly disk IO and such.

The wlink symbol table tweak was very noticable here. I was only testing pass 1, so the debug data mostly affect disked read time. Without the tweak, the hash table load factor was so high that in effect the search was linear time. The load factor is still pretty high, but it is 1/36th as bad as it was before.

What really should be done is rework the code to dynamically resize the hash table without using more memory, but it will take some care the do this without breaking anything.

It will be interesting to see how building a DLL of this size works out. I dimly recalled running into a 32MB limit problem at one time, but I tracked down my notes and this should not affect us. Back in 2012, Dave managed to generate an xul.dll with a page shift offset of 9, which the DLL to 32MB per object and the DLL refused to load. We resolved the issue by forcing Alignment=4096, which theoretically allows 256MB per object. The firefox builds you released all seem to have aligment 4096 and page shift offset 0, so they do not run into Dave's issue.

dmik commented 4 years ago

Note that XUL.DLL with the debug info is 428 MB in size and it loads fine. But I guess debug info is kept separately and not loaded into memory.

dryeo commented 4 years ago

With GCC 9.2.0, it blows up to almost 700MB, links and loads fine.

StevenLevine commented 4 years ago

Please try

www.warpcave.com/openwatcom/wl_exe-20200703.zip

The build ensures that the memcmp is inlined in GlobalSearchSym(). This should provide a noticable, but not massive, speedup.

The DLL file size is not what matters. There is a 65K page limit although I have lost track of where this is documented.

dmik commented 4 years ago

Did the old test with the debug info (of which you have a ZIP) with your last WL build. The difference is not that substantial. — 0 h 53 m vs 1 h 16 m with the RPM build of WL.

StevenLevine commented 4 years ago

30% is not insubstantial. There's a lot of going on beyond symbol table lookups. To visualize where the time is going, I recommend you run wlink under wsample and then use wprof to view the distribution. Try:

wsample -rate=10 wl @wl.rsp

After you have captured enough data kill the link and then it's

wprof wl.smp

dmik commented 4 years ago

Okay, may be my wording is not correct, but it still takes quite a bit of time is what I meant. I will try that, thanks!

StevenLevine commented 4 years ago

Unfortunately, some of the time usage is to be expected. There's a lot of disk IO and data shuffling, although there are still opportunities for optimization.

It's too bad that wprof only understands dwarf debug data. It might be useful for finding hot spots in your qt5 code. The other option is the VAC profiler, but it's a bit more difficult to use because we are compiling with gcc.

Steven

--

"Steven Levine" steve53@earthlink.net Warp/DIY/BlueLion etc. www.scoug.com www.arcanoae.com www.warpcave.com

dmik commented 4 years ago

I've tried to compile everything in debug mode with -s forced as well as we do in release builds (i.e. no HLL in object files) but in this scenario the latest WL is still taking way too long — didn't finish in 30 min and hung the system because of OOM (linking the release DLL finishes in 2-3 min which is tolerable).

I have no idea which could cause these delays given that all HLL stuff is gone. My guess is that it's due to -O0 which increases number of real symbols (which are optimized out in release builds due to -O3 and such). Another idea is that linking is done w/o -s -- to preserve public symbol names so at least we get them in .TRP stack traces. With the release build it works perfectly but for the debug it might still be unbearable.

dmik commented 4 years ago

No, giving -s to the linker doesn't change a thing. It must be -O0 then. Ok, I will have to come up with something else to have debug code enabled in Chromium...

StevenLevine commented 4 years ago

In bitwiseworks/libc/issues/84/662984639@github.com, on 07/23/20 at 05:43 AM, Dmitriy Kuminov notifications@github.com said:

I've tried to compile everything in debug mode with -s forced as well as we do in release builds (i.e. no HLL in object files) but in this scenario the latest WL is still taking way too long didn't finish in 30 min and hung the system because of OOM (linking the release DLL finishes in 2-3 min which is tolerable).

I have no idea which could cause these delays given that all HLL stuff is gone. My guess is that it's due to -O0 which increases number of real symbols (which are optimized out in release builds due to -O3 and such). Another idea is that linking is done w/o -s -- to preserve public symbol names so at least we get them in .TRP stack traces. With the release build it works perfectly but for the debug it might still be unbearable.

Regards,

Steven

--

"Steven Levine" steve53@earthlink.net Warp/DIY/BlueLion etc. www.scoug.com www.arcanoae.com www.warpcave.com

StevenLevine commented 4 years ago

I've tried to compile everything in debug mode with -s forced as well as we do in release builds (i.e. no HLL in object files) but in this scenario the latest WL is still taking way too long didn't finish in 30

I find this hard to understand. What I recommend you do is use OpenWatcom's dmpobj to look at the generated OMF files and figure out what is different.

My plan is to optimize the wlink symbol table handling a bit more to speed it up. However, whatever I do must not increase memory usage significantly or there will be no overall benefit.

One random thought I had is you could upload a diff for your hacked emxomf. I might be able to come up with a solution that does not break any object files.

Have you tried any of the -feliminate... options to see if they reduce the size of the debug info significantly.

I have no idea which could cause these delays given that all HLL stuff is gone. My guess is that it's due to -O0 which increases number of real symbols

When I use dmpobj to compare a simple testcase of -O0 to -O2 the only noticable difference is the size of the generated code. The debug data seems unchanged

symbol names so at least we get them in .TRP stack traces. With the release build it works perfectly but for the debug it might still be unbearable.

I suggest you revisit my suggestion to enhance emxomf to take a list of files which need debug data retained. I contend that you will never need to visit the vast majority of the code with the debugger. This code just going to work because it's been regression tested on multiple platforms.

dmik commented 4 years ago

It might be due to the mix of the old and new build output. I've started a fresh debug build with -s, -O2 and emxomf with no hacks but it hasn't finished yet so I can't tell if it's buildable. If it is, I will try it with -O0 again then. Then I will provide you with more info regarding -O0 vs -O2.

No, I haven't tried -feliminate yet but I will keep that in mind.

Regarding selective usage of HLL (via -g). It's not practical to do so in case of Chromium. The first reason is that build files that carry source file lists and compile/link options for them are not intended for frequent changing. The GN build system assumes that they are rather constant (it generates binary trees out of them and so forth). So It takes quite a while to regenerate everything (and then rebuild affected files!) once you change some build file. It slows down things considerably here.

Second, it's C++. Various pieces of code are frequently mutually reused. And if I enable HLL for some piece which uses another piece that doesn't have HLL, it will be rather pointless as I won't have complete debug info for an execution stack that involves those pieces.

Actually, what I really need from HLL is line numbers (and static/nonexport symbol names). I usually don't need local variables & stuff. You can take this into account when looking at EMXOMF. I guess we can introduce a special debug info level switch that will limit EMXOMF to only emit line number info (and static symbol names). This will both save us time and memory.

More over, GCC actually already has this notation. Look for -glevel here: https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html#Debugging-Options. Level 1 there (i.e. -g1) is exactly what I'm talking above: symbol names and line numbers.

BTW what I also really need is complete stack traces in optimized GCC builds in EXCEPTQ trap reports. Seems that EXCEPTQ doesn't support stack frames which are not based on EBP (which is frequent in case of optimized code, -O2 and above). See also here: https://github.com/bitwiseworks/qtwebengine-chromium-os2/issues/13#issue-665438575. If you could sort out this problem, then together with line numbers (see above) it would be already more than enough for debugging quite complex cases.

Regarding the hacked EMXOMF, the diff is really simple, here it is:

diff --git a/src/emx/src/emxomf/stabshll.c b/src/emx/src/emxomf/stabshll.c
index 5bfdc38..6a4e470 100644
--- a/src/emx/src/emxomf/stabshll.c
+++ b/src/emx/src/emxomf/stabshll.c
@@ -3938,6 +3938,7 @@ static void write_changseg (void)

 void convert_debug (void)
 {
+return;
   int i;
   struct type *t1, *t2;
 #if defined (HLL_DEBUG)
dmik commented 4 years ago

I've tried a fresh debug build with -O2 and -O0. The one with -O2 eventually works (the DLL size is 350 MB compared to 250 MB of the same -O2 build in release mode). However, an attempt to link stuff compiled with -O0 fails just as I described above. WL hangs forever until eats up all memory and crashes the system (which may take ages). So I bet it's the large number of symbols it can't cope with.

dmik commented 4 years ago

Although -O2 debug links, it doesn't load. Any attempt to start an exe with this DLL results in:

SYS0008: There is not enough memory available to process this command.
All available memory is in use.

I guess I will give up with attempting to use debug facilities for now.

Note that HIGHMEM does not work on this DLL, it hangs forever (I guess it's OOM as well). Newer EXEHDR I can't try as Steven's WarpCave is not available ATM (and I lost my previous download).

dmik commented 4 years ago

BTW, I found what's wrong with HIGHMEM. It hangs forever in a call to DosRead which reads all 350 MB as a single file into a buffer allocated on a high memory LIBCn heap. For some reason, this screws up the kernel completely. Perhaps highmem support in DosRead is buggy. Really sad.

dmik commented 4 years ago

In the above commit, I fixed HIGHMEM so that it does not read the whole file. In fact, only a few hundred bytes are necessary to change the object table bits (less than 300 bytes usually, depending on the number of objects). This immediately fixed my hangs when trying to set the code segment load high in the debug DLL with HIGHMEM.

And — and this is good news — doing so finally made the debug DLL (built with -O2) load! What a relief. Not a surprise it couldn't fit into the low shared memory given that the code object of this DLL is 205 MB in size (the release DLL's code object size is just 138 MB).

Now I can see a lot of debug messages from Chromium authors at least, but that's off topic here.

dmik commented 4 years ago

Re DosRead problems, I was given a hint that it might be a JFS driver, not the Kernel. But I have no other FS to experiment. Here it hangs only when loading a DLL file. If I give it another kind of file (as big as 500 or even 600 MB) it loads it fine. Which means that DosRead somehow detects that the file is a DLL, and does some special processing for it. I don't see other explanation at least.

dmik commented 4 years ago

Ok, I found what causes DosRead to bring down the kernel. It's the JFS cache size. I had no /CACHE switch on the JFS.IFS statement in CONFIG.SYS. Readme says it would allocate 12.5% of "real memory" w/o mentioning what is real memory. Setting the cache size explicitly to 256 MB with /CACHE:262144 made HIGHMEM work w/o hangs! And free system memory, according to Theseus, dropped down from 226 MB to just 24 MB. Which tells me that the default JFS cache size was something around 50 MB (and this screwed DosRead up on big files). Perhaps, this is worth an ArcaNoae ticket. Clearly, a JFS bug.

BTW, it might also explain why only DLL files are affected if caching logic differs between file types.

dmik commented 4 years ago

Ok, using /CACHE:262144 has a fatal side effect. With it, any attempt to start the QtWebEngie test case, either a release version, or a debug version, ends with SYS0008. Regardless of whether the DLLs are set to load high or low. All this is really, really strange. I simply lack (A LOT OF) information here.

dryeo commented 4 years ago

I've never been able to get a stable system with /CACHE:262144 and even with half that, it can be hit and miss. You could try, PROCESSES=128 THREADS=511 in config.sys to gain some kernel memory Dave

OS4User commented 4 years ago

"using /CACHE:262144 has a fatal side effect. With it, any attempt to start the QtWebEngie test case, either a release version, or a debug version, ends with SYS0008"

Cannot confirm this on OS/4 - everything works ok without any SYS0008.

StevenLevine commented 2 years ago

What is our status with this issue?

SilvanScherrer commented 7 months ago

hmm did one follow up on that?