Huge performance regression in 2.0.17

miwelc commented 3 years ago

We've experienced a drastic performance regression (between 3x - 8x) in our app by updating from 2.0.16 to 2.017. On the other hand, size has decreased 14%. I suspect it's due to LLVM's new pass manager. Is there any flag we can use to select the legacy LLVM pass manager? This new behavior is preventing us from updating as performance is mission critical for our use case and we don't really care about size. Thanks.

sbc100 commented 3 years ago

I think you should be able to just pass -flegacy-pass-manager to the compiler. We should probably have included that in the ChangeLog.

miwelc commented 3 years ago

Would this flag affect system libraries compilation (libc, libc++...)?

miwelc commented 3 years ago

Passing -flegacy-pass-manager at compilation and --lto-legacy-pass-manager at link time doesn't change much.

sbc100 commented 3 years ago

Ah yes, the system libraries won't effected by -flegacy-pass-manager on the command line. To fully revert you would need to essentially revert #13818 (at least the emcc.py and building.py parts) then run ./embuilder --clear-cache, then rebuild your project.

miwelc commented 3 years ago

I should have said I was testing on Safari 14.0.3. On Chrome 90 performance doesn't seem to have any regression, regardless of flags.

After further testing on Safari:

No flags: very bad performance, as in 8x worse.
-flegacy-pass-manager and --lto-legacy-pass-manager: same very bad performance
-flegacy-pass-manager, --lto-legacy-pass-manager and -s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1: Much better performance but not as good as with 2.0.16
Only -s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1: Same performance as with 2.0.16 😄.

Regardless of flags: ~14% size reduction. Mostly from system libraries?

Seems like the new LLVM pass manager combined with the old Binaryen inlining behavior gives the better performance for our use case on both browsers.

Anyway, I think it would be nice when there are changes to defaults if changelog included the flags needed to revert to old behavior 😅.

sbc100 commented 3 years ago

Thats somewhat counter intuitive.. it sounds like the only way to match the old performance is to use the new pass manager.

I guess we may want to revisit the binaryen inlining change and/or file a bug with apple about that.

miwelc commented 3 years ago

I know... I don't get it but I've run each (flag combination x version) multiple times and those are the results I get... Anyway, I'm going to close this now. Thanks!

kripken commented 3 years ago

Good point about the changelog, thanks, I'll open a PR to update it for both those changes.

kripken commented 3 years ago

As for the perf issue, perhaps more inlining ends up hitting a worse case with Safari register allocation or something like that? If you can provide a testcase that could be interesting to look at. Otherwise let's see if there are more reports of issues.

mithrendal commented 3 years ago

I see performance is dropping when going from 2.0.16 to 2.0.17. it drops from 50fps to 8fps. Strange thing is that it drops only when using Firefox. Chrome and Safari are doing fine on 2.0.17 builds.

code is at https://github.com/dirkwhoffmann/virtualc64web

sbc100 commented 3 years ago

Did you try with -s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1.. if that solution doesn't work then likely you are no hitting the same issue.

mithrendal commented 3 years ago

thank you @sbc100 that setting brings back the performance of vc64web build with 2.0.17 on firefox to 50fps. Strangely in opposite to @miwelc 's project it does affect only firefox here. Safari runs fine without the setting whereas his project showed the slowdown only on Safari 🤔...

sbc100 commented 3 years ago

@kripken I wonder if we can find a reasonable a value to set --one-caller-inline-max-function-size to get most of the benefits without getting these negative side effects on certain engines?

mithrendal commented 3 years ago

Just tell me if you would like me to retest with another value than 1. 🤓

kripken commented 3 years ago

@mithrendal @miwelc

It would be good if both of you can test with more values of --one-caller-inline-max-function-size, yes. Perhaps you can bisect to find the specific value where change happens, then we can consider what to do with the default based on that.

Also, these are specific browser limitations or bugs that we need to investigate and file. Can you please create testcases showing the issue, that is, a build with the 'bad' and the 'good' settings? Then we can file bugs for them to look into. (I can help with the filing part.) Without filing bugs for them things may not improve.

mithrendal commented 3 years ago

ok first some testing for vc64web ...

still fast values=1,1024,2048,4096,16384,18k,19k,19250,19280,19296,19304,19306 bad values = 1048576, 65535, 32768,20480,20k,19500,19375,19312,19308,19307

mithrendal commented 3 years ago

machine is a mbp 2007 with macos10.11 and firefox 78.10.0esr (64-Bit)

logoutput for --one-caller-inline-max-function-size=19307

time[ms]=1520, audio_samples=65536, frames [executed=12, rendered=3] avg_fps=3
lost sync target=39, total_executed=6
lost sync target=38, total_executed=6
time[ms]=1565, audio_samples=45056, frames [executed=12, rendered=4] avg_fps=3
lost sync target=37, total_executed=6
lost sync target=37, total_executed=6
time[ms]=1509, audio_samples=90112, frames [executed=12, rendered=4] avg_fps=3

logoutput for --one-caller-inline-max-function-size=19306

time[ms]=1010, audio_samples=45056, frames [executed=51, rendered=51] avg_fps=45
time[ms]=1015, audio_samples=45056, frames [executed=50, rendered=51] avg_fps=45
time[ms]=1002, audio_samples=43008, frames [executed=51, rendered=46] avg_fps=45

any thoughts?

is it serious? or is it high enough to ignore it ? BTW what is the meaning of the number ? Is the unit of the number bytes or instructions ?

should I still provide the two wasm files?

kripken commented 3 years ago

@mithrendal

Thanks!

Yes, the wasm files would be very helpful. I'd like to see the change between them. And we can file a bug on the relevant browser with them.

The meaning of the number is the number of IR nodes in binaryen, which is almost exactly the number of instructions in wasm.

That the number is 19306/7 is interesting - this is not a small function that is being inlined. The next question is into what does it get inlined (I can see that with the wasm files).

mithrendal commented 3 years ago

@kripken I made the two builds and put them into the zip file ... is that maybe already enough and you are looking only for those .. or do you need the other js files which are needed to run the whole thing too?

wasm_builds.zip

see the whole thing running (currently with a 2.0.16 build) here https://dirkwhoffmann.github.io/virtualc64web/

the source files (and also the other js files which are needed to run it) are here https://github.com/dirkwhoffmann/virtualc64web

kripken commented 3 years ago

Thanks @mithrendal !

Yes, the other JS files will be needed for filing a browser bug. What are the steps to plug in the js/wasm combos from the archive, to get it running with that repo?

Meanwhile, inspecting the wasm files, it looks like the difference is func $215 is inlined into func $327. The inlined function is indeed of size 19307 as expected. Raw data before:

func: 215
 [binary-bytes] : 44778   
 [total]        : 19307   
 [vars]         : 6       
 Binary         : 4441    
 Block          : 265     
 Break          : 415     
 Call           : 144     
 Const          : 3493    
 Drop:          : 0
 GlobalGet      : 1       
 GlobalSet      : 2       
 If             : 36      
 Load           : 2601    
 LocalGet       : 4599    
 LocalSet       : 1075    
 Loop           : 0
 Select         : 288     
 Store          : 1280    
 Switch         : 1       
 Unary          : 665     
 Unreachable    : 1       

func: 327
 [binary-bytes] : 1306    
 [total]        : 558     
 [vars]         : 13      
 Binary         : 101     
 Block          : 29      
 Break          : 25      
 Call           : 14      
 Const          : 100     
 Drop           : 4       
 If             : 11      
 Load           : 56      
 LocalGet       : 137     
 LocalSet       : 36      
 Loop           : 1       
 Select         : 1       
 Store          : 30      
 Switch         : 1       
 Unary          : 10      
 Unreachable    : 2

and after:

func: 326
 [binary-bytes] : 46075   
 [total]        : 19863   
 [vars]         : 13      
 Binary         : 4542    
 Block          : 293     
 Break          : 440     
 Call           : 157     
 Const          : 3593    
 Drop           : 4       
 GlobalGet      : 1       
 GlobalSet      : 2       
 If             : 47      
 Load           : 2657    
 LocalGet       : 4735    
 LocalSet       : 1112    
 Loop           : 1       
 Select         : 289     
 Store          : 1310    
 Switch         : 2       
 Unary          : 675     
 Unreachable    : 3

(the combined new function is index 326 since 215 was removed, and then all subsequent indexes decreased by 1.

I don't think there is much we can do on the tools side since this looks like something we should inline:

Inlining reduces total code size.
The inlined call was in a loop - those are often good to inline for speed.
The inlining does not increase the total vars in the target (still 13; so there is no obvious signal of register pressure).

kripken commented 3 years ago

@lars-t-hansen

To summarize the last few comments in this issue, we have a testcase that dropped from 50fps to 8fps on Firefox, after binaryen started to do more inlining. There is no change on Chrome or Safari, so this is Firefox specific.

Investigating the wasm file, I don't see a reason not to perform the inlining (see last comment) - it looks like something the toolchain should be doing as best I can tell (that is, the new inlining heuristics look broadly better, and there is no obvious tweak to the heuristics to avoid this particular testcase).

Perhaps this is hitting a bad case in the Firefox register allocator, or some internal limit on the optimizing compiler, or tiering? We can file a bug for further investigation if that would be useful (still need some details from @mithrendal for how to run the code).

lars-t-hansen commented 3 years ago

@kripken, thanks, I'll investigate next week. I assume this is x64?

mithrendal commented 3 years ago

as vc64web will soon move to its new home address vc64web.github.io (we wait first to get the CORS related http-response headers set at related cross sites from which vc64web currently can fetch its retro content from the demo scene) I took this opportunity to publish the latest copy of it into it, (effectively use the new address for this issue) that is the build emsdk_2.0.17 with the problematic parameter --one-caller-inline-max-function-size=19307 the same that you analysed

so you will currently get that 2.0.17 version up and running here at its future address http://vc64web.github.io

(the current official address with the 2.0.16 build is still https://dirkwhoffmann.github.io/virtualc64web/)

when you start it, it asks you for ROM-chips. You may just hit install open-roms or press ESC now. In both case it will install three system roms from the mega65-open-roms project. But if you only! do this the slowdown will not yet! occur. This is because the open-roms do not include the romchip of the Commodore 1541 floppy drive and I found out that on the C++ side the slowdown occurs only! when emulating the 1541 floppy drive.

so the long function with 19307 WASM instructions seems to be this one in

/Emulator/C64.cpp line 1092 if (drive8.isActive()) drive8.execute(durationOfOneCycle);

see link to code here https://github.com/dirkwhoffmann/virtualc64web/blob/e913978fb06534843e48b9d05a8e06b805326764/Emulator/C64.cpp#L1092

@dirkwhoffmann can you confirm ? @lars-t-hansen had the question whether the code is x64 ? Can you say something on that matter too?

to get complete build (including the js-files) just head to https://github.com/vc64web/vc64web.github.io and click green button 'code' -> Download ZIP

to get the floppy drive8 activated you will have to click the missing 1541-rom-chip icon here.

Ohne Titel

try this file https://sourceforge.net/p/vice-emu/code/HEAD/tree/trunk/vice/data/DRIVES/d1541II?format=raw

as soon you loaded the missing floppy drive rom, the emulator will activate the emulation code for the 1541 floppy drive and begin to slow down dramatically when using firefox and when being compiled with --one-caller-inline-max-function-size=19307 or when the compiler option `--one-caller-inline-max-function-size' is not set at all

you can spot the perfstats in the web console they will be outputted every second. (there is also a live-debug-console which you can enable in the settings of vc64web)

if there is still need to build and compile then you will find the source code currently here https://github.com/dirkwhoffmann/virtualc64web and the how-to-make instruction here https://github.com/dirkwhoffmann/virtualc64web/wiki/build_and_run

kripken commented 3 years ago

lars-t-hansen had the question whether the code is x64 ?

I think the question was where you saw the regression - is your machine x64 (as opposed to 32-bit, or ARM)? Info about the browser, CPU, and OS can be helpful in figuring out why it became slower on a particular combination.

dirkwhoffmann commented 3 years ago

I think the question was where you saw the regression - is your machine x64

Mithrendal was using Firefox 78.10.0esr (64-Bit) on a MacBook Pro from 2007 (Intel core2duo 2.2 GHz) with macOS 10.11 El Capitan.

mithrendal commented 3 years ago

today I changed the current app at https://dirkwhoffmann.github.io/virtualc64web/

to a 2.0.17 --one-caller-inline-max-function-size=19306 build

the firefox problematic 2.0.17 / 19307 build is still at https://vc64web.github.io

for building I used an intel macmini on bigsur latest patch level with emsdk 2.0.17.

test running was at a slow 2007 mbp machine as @dirkwhoffmann already descibed

lars-t-hansen commented 3 years ago

I have hardware like that but the latest Mac OS I have for it is 10.6.8, on which esr78 is not supported. Updating to 10.11 is a little bit of a project, as it's not exclusively my computer.

@mithrendal, I'm wondering, can you rerun the experiment but first go into about:config and set javascript.options.wasm_baselinejit (it may be called something slightly different than that but it should be obvious) to false? The most obvious explanation for the problem is that you're being bitten by tiering / being stuck in slow code, and it would be good to rule that out. (Load times may increase.)

mithrendal commented 3 years ago

@lars-t-hansen I did set the baslinejit to false and it changed not anything, I spotted between 2 and 4 FPS on the old machine with that setting...

then I did a test on a newer machine a intel mac mini 2014 2.6 Ghz Big Sur 11.2.3 Firefox 88 64Bit: with that newer machine

I got full 60 FPS when running the emsdk 2.0.17 build with --one-caller-inline-max-function-size=19306 and below https://dirkwhoffmann.github.io/virtualc64web/

and around 12 FPS when running with--one-caller-inline-max-function-size=19307and above ... https://vc64web.github.io/

lars-t-hansen commented 3 years ago

Thanks. It'll be easier to test on the newer hardware with newer Firefox, I'll try to repro.

lars-t-hansen commented 3 years ago

OK, I can repro on my current Linux desktop system, life is good. https://bugzilla.mozilla.org/show_bug.cgi?id=1708381.

mithrendal commented 3 years ago

@lars-t-hansen vc64web now finally moved to its new home address http://vc64web.github.io , which you used as a test case for the aggressive inline problem on firefox bug 1708381

that means the address serves no longer the build with the problematic inlining for firefox. As far as I understood from the bugzilla ticket you and your colleges have already isolated a small testcase which shows the same problem right? Anyway, if you still need that problematic build I can make one for you on another address. Good luck !

lars-t-hansen commented 3 years ago

@mithrendal, we do indeed have a separate test case for this, thanks for letting us know. The bug referenced above should have all the details.

julian-seward1 commented 3 years ago

@mithrendal I've been working on a fix for this (https://bugzilla.mozilla.org/show_bug.cgi?id=1712078). I'd like to verify that it really fixes the performance problem on the problematic build. So .. could you please make it available on another URL, per your comment above?

mithrendal commented 3 years ago

Hi @julian-seward1

sure 🤓

build with maximal inlining (parameter --one-caller-inline-max-function-size is omitted)

published to https://vc64web.github.io/devbuild/

build with limited inlining (--one-caller-inline-max-function-size=19307)

published to https://vc64web.github.io

julian-seward1 commented 3 years ago

@mithrendal Thanks. With the patch at https://bugzilla.mozilla.org/show_bug.cgi?id=1712078#c9 I got the following results:

Limited inlining, with patch 56 fps Max inlining, with patch 56 fps Max inlining, without patch 12 fps

In other words .. the patch fixes this problem.

andrewevstyukhin commented 2 years ago

Hi, after I upgraded to emscripten 2.0.17, browser started to crash with long running application by out of memory condition (7GB for Chrome). -s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1 helped to return stability but increased wasm size. I'm confused why inlining tends to OOM.

kripken commented 2 years ago

It sounds like the OOM is of the browser itself? Inlining leads to larger function sizes, which can require more work by the VM. For example, if the VM does work that is O(N^2) in the number of locals (to do register allocation, for example), then inlining that adds more locals (usually the case) can raise memory usage by a lot.

andrewevstyukhin commented 2 years ago

Yes, browser eats memory till crash. Problem stably happens in -s ASYNCIFY=1 builds which have no precise settings. Probably I have to test the well-tuned asyncify project to see the global impact.

imho Amount of locals can be reduced by splitting big functions into subfunctions (each outer loop for example). It's like a research to measure all browsers and derive good enough ranges.

Maybe isUnderSizeLimit should be configurable to further reduce long compilation time under Safari.

kripken commented 2 years ago

You can try -Os (instead of -O3) as well as the various wasm-opt inlining tuning flags, like --flexible-inline-max-function-size (in addition to --one-caller-inline-max-function-size that is already mentioned above). That should reduce the amount of large functions in the output.

andrewevstyukhin commented 2 years ago

I already use -Os -flto. Anyway --flexible-inline-max-function-size seems a handy parameter for upcoming tuning. A nice hint, thanks!

emscripten-core / emscripten

Huge performance regression in 2.0.17 #13899