Open miwelc opened 3 years ago
I think you should be able to just pass -flegacy-pass-manager
to the compiler. We should probably have included that in the ChangeLog.
Would this flag affect system libraries compilation (libc, libc++...)?
Passing -flegacy-pass-manager
at compilation and --lto-legacy-pass-manager
at link time doesn't change much.
Ah yes, the system libraries won't effected by -flegacy-pass-manager
on the command line. To fully revert you would need to essentially revert #13818 (at least the emcc.py and building.py parts) then run ./embuilder --clear-cache
, then rebuild your project.
I should have said I was testing on Safari 14.0.3. On Chrome 90 performance doesn't seem to have any regression, regardless of flags.
After further testing on Safari:
-flegacy-pass-manager
and --lto-legacy-pass-manager
: same very bad performance-flegacy-pass-manager
, --lto-legacy-pass-manager
and -s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1
: Much better performance but not as good as with 2.0.16-s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1
: Same performance as with 2.0.16 😄.Regardless of flags: ~14% size reduction. Mostly from system libraries?
Seems like the new LLVM pass manager combined with the old Binaryen inlining behavior gives the better performance for our use case on both browsers.
Anyway, I think it would be nice when there are changes to defaults if changelog included the flags needed to revert to old behavior 😅.
Thats somewhat counter intuitive.. it sounds like the only way to match the old performance is to use the new pass manager.
I guess we may want to revisit the binaryen inlining change and/or file a bug with apple about that.
I know... I don't get it but I've run each (flag combination x version) multiple times and those are the results I get... Anyway, I'm going to close this now. Thanks!
Good point about the changelog, thanks, I'll open a PR to update it for both those changes.
As for the perf issue, perhaps more inlining ends up hitting a worse case with Safari register allocation or something like that? If you can provide a testcase that could be interesting to look at. Otherwise let's see if there are more reports of issues.
I see performance is dropping when going from 2.0.16 to 2.0.17. it drops from 50fps to 8fps. Strange thing is that it drops only when using Firefox. Chrome and Safari are doing fine on 2.0.17 builds.
Did you try with -s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1
.. if that solution doesn't work then likely you are no hitting the same issue.
thank you @sbc100 that setting brings back the performance of vc64web build with 2.0.17 on firefox to 50fps. Strangely in opposite to @miwelc 's project it does affect only firefox here. Safari runs fine without the setting whereas his project showed the slowdown only on Safari 🤔...
@kripken I wonder if we can find a reasonable a value to set --one-caller-inline-max-function-size
to get most of the benefits without getting these negative side effects on certain engines?
Just tell me if you would like me to retest with another value than 1. 🤓
@mithrendal @miwelc
It would be good if both of you can test with more values of --one-caller-inline-max-function-size
, yes. Perhaps you can bisect to find the specific value where change happens, then we can consider what to do with the default based on that.
Also, these are specific browser limitations or bugs that we need to investigate and file. Can you please create testcases showing the issue, that is, a build with the 'bad' and the 'good' settings? Then we can file bugs for them to look into. (I can help with the filing part.) Without filing bugs for them things may not improve.
ok first some testing for vc64web ...
still fast values=1,1024,2048,4096,16384,18k,19k,19250,19280,19296,19304,19306 bad values = 1048576, 65535, 32768,20480,20k,19500,19375,19312,19308,19307
machine is a mbp 2007 with macos10.11 and firefox 78.10.0esr (64-Bit)
logoutput for --one-caller-inline-max-function-size=19307
time[ms]=1520, audio_samples=65536, frames [executed=12, rendered=3] avg_fps=3
lost sync target=39, total_executed=6
lost sync target=38, total_executed=6
time[ms]=1565, audio_samples=45056, frames [executed=12, rendered=4] avg_fps=3
lost sync target=37, total_executed=6
lost sync target=37, total_executed=6
time[ms]=1509, audio_samples=90112, frames [executed=12, rendered=4] avg_fps=3
logoutput for --one-caller-inline-max-function-size=19306
time[ms]=1010, audio_samples=45056, frames [executed=51, rendered=51] avg_fps=45
time[ms]=1015, audio_samples=45056, frames [executed=50, rendered=51] avg_fps=45
time[ms]=1002, audio_samples=43008, frames [executed=51, rendered=46] avg_fps=45
any thoughts?
is it serious? or is it high enough to ignore it ? BTW what is the meaning of the number ? Is the unit of the number bytes or instructions ?
should I still provide the two wasm files?
@mithrendal
Thanks!
Yes, the wasm files would be very helpful. I'd like to see the change between them. And we can file a bug on the relevant browser with them.
The meaning of the number is the number of IR nodes in binaryen, which is almost exactly the number of instructions in wasm.
That the number is 19306/7 is interesting - this is not a small function that is being inlined. The next question is into what does it get inlined (I can see that with the wasm files).
@kripken I made the two builds and put them into the zip file ... is that maybe already enough and you are looking only for those .. or do you need the other js files which are needed to run the whole thing too?
see the whole thing running (currently with a 2.0.16 build) here https://dirkwhoffmann.github.io/virtualc64web/
the source files (and also the other js files which are needed to run it) are here https://github.com/dirkwhoffmann/virtualc64web
Thanks @mithrendal !
Yes, the other JS files will be needed for filing a browser bug. What are the steps to plug in the js/wasm combos from the archive, to get it running with that repo?
Meanwhile, inspecting the wasm files, it looks like the difference is func $215
is inlined into func $327
. The inlined function is indeed of size 19307 as expected. Raw data before:
func: 215
[binary-bytes] : 44778
[total] : 19307
[vars] : 6
Binary : 4441
Block : 265
Break : 415
Call : 144
Const : 3493
Drop: : 0
GlobalGet : 1
GlobalSet : 2
If : 36
Load : 2601
LocalGet : 4599
LocalSet : 1075
Loop : 0
Select : 288
Store : 1280
Switch : 1
Unary : 665
Unreachable : 1
func: 327
[binary-bytes] : 1306
[total] : 558
[vars] : 13
Binary : 101
Block : 29
Break : 25
Call : 14
Const : 100
Drop : 4
If : 11
Load : 56
LocalGet : 137
LocalSet : 36
Loop : 1
Select : 1
Store : 30
Switch : 1
Unary : 10
Unreachable : 2
and after:
func: 326
[binary-bytes] : 46075
[total] : 19863
[vars] : 13
Binary : 4542
Block : 293
Break : 440
Call : 157
Const : 3593
Drop : 4
GlobalGet : 1
GlobalSet : 2
If : 47
Load : 2657
LocalGet : 4735
LocalSet : 1112
Loop : 1
Select : 289
Store : 1310
Switch : 2
Unary : 675
Unreachable : 3
(the combined new function is index 326 since 215 was removed, and then all subsequent indexes decreased by 1.
I don't think there is much we can do on the tools side since this looks like something we should inline:
@lars-t-hansen
To summarize the last few comments in this issue, we have a testcase that dropped from 50fps to 8fps on Firefox, after binaryen started to do more inlining. There is no change on Chrome or Safari, so this is Firefox specific.
Investigating the wasm file, I don't see a reason not to perform the inlining (see last comment) - it looks like something the toolchain should be doing as best I can tell (that is, the new inlining heuristics look broadly better, and there is no obvious tweak to the heuristics to avoid this particular testcase).
Perhaps this is hitting a bad case in the Firefox register allocator, or some internal limit on the optimizing compiler, or tiering? We can file a bug for further investigation if that would be useful (still need some details from @mithrendal for how to run the code).
@kripken, thanks, I'll investigate next week. I assume this is x64?
as vc64web will soon move to its new home address vc64web.github.io (we wait first to get the CORS related http-response headers set at related cross sites from which vc64web currently can fetch its retro content from the demo scene) I took this opportunity to publish the latest copy of it into it, (effectively use the new address for this issue) that is the build emsdk_2.0.17 with the problematic parameter --one-caller-inline-max-function-size=19307
the same that you analysed
so you will currently get that 2.0.17 version up and running here at its future address http://vc64web.github.io
(the current official address with the 2.0.16 build is still https://dirkwhoffmann.github.io/virtualc64web/)
when you start it, it asks you for ROM-chips. You may just hit install open-roms or press ESC now. In both case it will install three system roms from the mega65-open-roms project. But if you only! do this the slowdown will not yet! occur. This is because the open-roms do not include the romchip of the Commodore 1541 floppy drive and I found out that on the C++ side the slowdown occurs only! when emulating the 1541 floppy drive.
so the long function with 19307 WASM instructions seems to be this one in
/Emulator/C64.cpp
line 1092
if (drive8.isActive()) drive8.execute(durationOfOneCycle);
see link to code here https://github.com/dirkwhoffmann/virtualc64web/blob/e913978fb06534843e48b9d05a8e06b805326764/Emulator/C64.cpp#L1092
@dirkwhoffmann can you confirm ? @lars-t-hansen had the question whether the code is x64 ? Can you say something on that matter too?
to get complete build (including the js-files) just head to https://github.com/vc64web/vc64web.github.io and click green button 'code' -> Download ZIP
to get the floppy drive8 activated you will have to click the missing 1541-rom-chip icon here.
try this file https://sourceforge.net/p/vice-emu/code/HEAD/tree/trunk/vice/data/DRIVES/d1541II?format=raw
as soon you loaded the missing floppy drive rom, the emulator will activate the emulation code for the 1541 floppy drive and begin to slow down dramatically when using firefox and when being compiled with --one-caller-inline-max-function-size=19307
or when the compiler option `--one-caller-inline-max-function-size' is not set at all
you can spot the perfstats in the web console they will be outputted every second. (there is also a live-debug-console which you can enable in the settings of vc64web)
if there is still need to build and compile then you will find the source code currently here https://github.com/dirkwhoffmann/virtualc64web and the how-to-make instruction here https://github.com/dirkwhoffmann/virtualc64web/wiki/build_and_run
lars-t-hansen had the question whether the code is x64 ?
I think the question was where you saw the regression - is your machine x64 (as opposed to 32-bit, or ARM)? Info about the browser, CPU, and OS can be helpful in figuring out why it became slower on a particular combination.
I think the question was where you saw the regression - is your machine x64
Mithrendal was using Firefox 78.10.0esr (64-Bit) on a MacBook Pro from 2007 (Intel core2duo 2.2 GHz) with macOS 10.11 El Capitan.
today I changed the current app at https://dirkwhoffmann.github.io/virtualc64web/
to a 2.0.17 --one-caller-inline-max-function-size=19306
build
the firefox problematic 2.0.17 / 19307 build is still at https://vc64web.github.io
for building I used an intel macmini on bigsur latest patch level with emsdk 2.0.17.
test running was at a slow 2007 mbp machine as @dirkwhoffmann already descibed
I have hardware like that but the latest Mac OS I have for it is 10.6.8, on which esr78 is not supported. Updating to 10.11 is a little bit of a project, as it's not exclusively my computer.
@mithrendal, I'm wondering, can you rerun the experiment but first go into about:config and set javascript.options.wasm_baselinejit (it may be called something slightly different than that but it should be obvious) to false? The most obvious explanation for the problem is that you're being bitten by tiering / being stuck in slow code, and it would be good to rule that out. (Load times may increase.)
@lars-t-hansen I did set the baslinejit to false and it changed not anything, I spotted between 2 and 4 FPS on the old machine with that setting...
then I did a test on a newer machine a intel mac mini 2014 2.6 Ghz Big Sur 11.2.3 Firefox 88 64Bit: with that newer machine
I got full 60 FPS
when running the emsdk 2.0.17 build with --one-caller-inline-max-function-size=19306
and below
https://dirkwhoffmann.github.io/virtualc64web/
and around 12 FPS
when running with--one-caller-inline-max-function-size=19307
and above ...
https://vc64web.github.io/
Thanks. It'll be easier to test on the newer hardware with newer Firefox, I'll try to repro.
OK, I can repro on my current Linux desktop system, life is good. https://bugzilla.mozilla.org/show_bug.cgi?id=1708381.
@lars-t-hansen vc64web now finally moved to its new home address http://vc64web.github.io , which you used as a test case for the aggressive inline problem on firefox bug 1708381
that means the address serves no longer the build with the problematic inlining for firefox. As far as I understood from the bugzilla ticket you and your colleges have already isolated a small testcase which shows the same problem right? Anyway, if you still need that problematic build I can make one for you on another address. Good luck !
@mithrendal, we do indeed have a separate test case for this, thanks for letting us know. The bug referenced above should have all the details.
@mithrendal I've been working on a fix for this (https://bugzilla.mozilla.org/show_bug.cgi?id=1712078). I'd like to verify that it really fixes the performance problem on the problematic build. So .. could you please make it available on another URL, per your comment above?
Hi @julian-seward1
sure 🤓
build with maximal inlining (parameter --one-caller-inline-max-function-size
is omitted)
published to https://vc64web.github.io/devbuild/
build with limited inlining (--one-caller-inline-max-function-size=19307
)
published to https://vc64web.github.io
@mithrendal Thanks. With the patch at https://bugzilla.mozilla.org/show_bug.cgi?id=1712078#c9 I got the following results:
Limited inlining, with patch 56 fps Max inlining, with patch 56 fps Max inlining, without patch 12 fps
In other words .. the patch fixes this problem.
Hi,
after I upgraded to emscripten 2.0.17, browser started to crash with long running application by out of memory condition (7GB for Chrome).
-s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1
helped to return stability but increased wasm size.
I'm confused why inlining tends to OOM.
It sounds like the OOM is of the browser itself? Inlining leads to larger function sizes, which can require more work by the VM. For example, if the VM does work that is O(N^2)
in the number of locals (to do register allocation, for example), then inlining that adds more locals (usually the case) can raise memory usage by a lot.
Yes, browser eats memory till crash. Problem stably happens in -s ASYNCIFY=1
builds which have no precise settings.
Probably I have to test the well-tuned asyncify project to see the global impact.
imho Amount of locals can be reduced by splitting big functions into subfunctions (each outer loop for example). It's like a research to measure all browsers and derive good enough ranges.
Maybe isUnderSizeLimit
should be configurable to further reduce long compilation time under Safari.
You can try -Os
(instead of -O3
) as well as the various wasm-opt inlining tuning flags, like --flexible-inline-max-function-size
(in addition to --one-caller-inline-max-function-size
that is already mentioned above). That should reduce the amount of large functions in the output.
I already use -Os -flto
. Anyway --flexible-inline-max-function-size
seems a handy parameter for upcoming tuning.
A nice hint, thanks!
We've experienced a drastic performance regression (between 3x - 8x) in our app by updating from 2.0.16 to 2.017. On the other hand, size has decreased 14%. I suspect it's due to LLVM's new pass manager. Is there any flag we can use to select the legacy LLVM pass manager? This new behavior is preventing us from updating as performance is mission critical for our use case and we don't really care about size. Thanks.