elm-lang / elm-make

A build tool for Elm projects
BSD 3-Clause "New" or "Revised" License
175 stars 45 forks source link

Performance reduced when running on multi-core ( which is by default ) #159

Closed AntouanK closed 6 years ago

AntouanK commented 7 years ago

Hi.

Noticed this issue when I was working in parallel on the same project, on my PC and my MacBook Pro. For some weird reason, a 3 year-old MBP, was compiling faster than a brand new 16-core PC. For example:

MBP:

screen shot 2017-04-12 at 18 08 58

PC: screenshot_2017-04-12_18-07-42

After the tip from @eeue56 on the elm slack channel, to use "sysconfcpus", I saw a huge boost. On the Ryzen PC with linux, with one core ( so with "sysconfcpus -n 1" ) I can run "make build" on ~10.4 seconds! ( on the mac "sysconfcpus -n 1" makes no difference )

So, how come the same process is ~50% slower when running on 16 cores, than running on one? Is there anything I can do to make the compiler take advantage on the multiple cores?

Thanks.

process-bot commented 7 years ago

Thanks for the issue! Make sure it satisfies this checklist. My human colleagues will appreciate it!

Here is what to expect next, and if anyone wants to comment, keep these things in mind.

evancz commented 7 years ago

The compiler already should be taking advantage of the cores you have. Here are some things that may be relevant:

That said, I need more information to improve things in a directed way. CPU may not actually be the core issue. Maybe a bunch more RAM is used in one case? And maybe GC is causing the CPU usage as a result? Maybe transferring data between all the different cores is costly, so maybe there's some way to get information on that?

So here's the plan. When 0.19 alpha is out, please test this again and see if the issue persists. In the meantime, if you find any additional information, let me know about it in an organized way. (I.e. work through it on slack, and prefer one coherent and concise comment over ten scattered comments as you learn more.)

Again, the compiler is designed to use as many cores as possible already, so something weird must be going on. Ultimately, I want the compiler to be super fast, so thanks for reporting this and helping make sense of this!

AntouanK commented 7 years ago

@evancz Thanks for the response.

Unless you built Elm in some crazy way, you should have this.

The elm-make I use comes from the npm install. So that's, I guess, the "normal" binary.

reports what is available to the system, which may not be available to the actual process

When I run elm-make normally, I see all 16 cores go to 99%. So it means that the process can see them all, right? And still, the total compile time is slower. So there's definitely something going wrong.

I'm not a Haskell programmer, so I don't know where to look at.

Let me know what tests you'd like me to run, and I'm happy to help on that. I'll definitely try 0.19 alpha when it comes out.

evancz commented 7 years ago

Cool, yeah, the npm one should be good.

If you want to look into it more now, please ask @eeue56 for help on slack. We can proceed without Haskell stuff. For example, knowing about memory usage during compilation can help. Maybe it uses 10mb on your laptop and 100mb on your PC. That would be interesting and helpful to know. Knowing about cache misses may help as well. That kind of thing. I suspect I'll need to get on a machine like yours to test things out, but exploring these things may be helpful or personally interesting nonetheless!

AntouanK commented 7 years ago

As you can see in the screenshots, there's plenty of memory available. I'll ping @eeue56 to see if we can maybe get other details out.

In theory, how much would every extra core speed up the compile time? Since the new Ryzen CPUs are cheap and can run lots of threads ( 16 the 7 series ) it would be a huge gain for devs that will buy them.

AntouanK commented 7 years ago

Regarding cache misses

sysconfcpus -n 1 elm-make src/App.elm --yes --output dist/elm.js Success! Compiled 106 modules.
Successfully generated dist/elm.js

Performance counter stats for 'make build':

25,389,023,513      L1-dcache-loads:u                                           
    24,876,267      L1-dcache-load-misses:u   #    0.10% of all L1-dcache hits  

  10.811931503 seconds time elapsed

- without

$ perf stat -e L1-dcache-loads -e L1-dcache-load-misses elm-make src/App.elm --output dist/elm.js Success! Compiled 106 modules.
Successfully generated dist/elm.js

Performance counter stats for 'elm-make src/App.elm --output dist/elm.js':

46,763,012,114      L1-dcache-loads:u                                           
   148,834,742      L1-dcache-load-misses:u   #    0.32% of all L1-dcache hits  

  15.228556020 seconds time elapsed
AntouanK commented 7 years ago

and you can see it uses all cores when I run it without sysconfcpus -n 1

screenshot_2017-04-12_23-31-38

eeue56 commented 7 years ago

We took a look through this on Slack. tl;dr:

Discussed with @AntouanK, it's currently at a "liveable" state for them (decent elm-make times). So we will take a look again after 0.19 is released and try to dig in a bit better then.

Some numbers from my chromebook:

With two cores enabled, getNumProcessors = 2

noah@noah-Swanky:~/dev/elm-css$ perf stat elm-make
Success! Compiled 35 modules.                                       

 Performance counter stats for 'elm-make':

      21781.649106      task-clock (msec)         #    1.262 CPUs utilized          
            17,875      context-switches          #    0.821 K/sec                  
                89      cpu-migrations            #    0.004 K/sec                  
            20,060      page-faults               #    0.921 K/sec                  
    23,485,501,602      cycles                    #    1.078 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    19,333,470,309      instructions              #    0.82  insns per cycle        
     3,577,546,896      branches                  #  164.246 M/sec                  
       191,048,855      branch-misses             #    5.34% of all branches        

      17.262226728 seconds time elapsed

With a single core enabled, getNumProcessors = 1

noah@noah-Swanky:~/dev/elm-css$ perf stat sysconfcpus -n 1 elm-make
Success! Compiled 35 modules.                                       

 Performance counter stats for 'sysconfcpus -n 1 elm-make':

      15567.244782      task-clock (msec)         #    1.002 CPUs utilized          
             1,823      context-switches          #    0.117 K/sec                  
                32      cpu-migrations            #    0.002 K/sec                  
            19,005      page-faults               #    0.001 M/sec                  
    16,821,342,129      cycles                    #    1.081 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    15,000,183,543      instructions              #    0.89  insns per cycle        
     2,714,811,983      branches                  #  174.393 M/sec                  
       168,684,092      branch-misses             #    6.21% of all branches        

      15.537166525 seconds time elapsed
AntouanK commented 7 years ago

Just for comparison on a beefier CPU ( again on the elm-css project )

With 16 cores enabled ( default )

Performance counter stats for 'elm-make':

      50921.822910      task-clock:u (msec)       #    8.289 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            16,039      page-faults:u             #    0.315 K/sec                  
   101,591,919,919      cycles:u                  #    1.995 GHz                      (83.31%)
     5,486,939,866      stalled-cycles-frontend:u #    5.40% frontend cycles idle     (83.46%)
     8,788,128,333      stalled-cycles-backend:u  #    8.65% backend cycles idle      (83.32%)
    43,475,312,872      instructions:u            #    0.43  insn per cycle         
                                                  #    0.20  stalled cycles per insn  (83.41%)
    10,120,790,403      branches:u                #  198.752 M/sec                    (83.17%)
       164,522,472      branch-misses:u           #    1.63% of all branches          (83.36%)

       6.143030403 seconds time elapsed

With a single core enabled, sysconfcpus -n 1

Performance counter stats for 'sysconfcpus -n 1 elm-make':

       2858.615621      task-clock:u (msec)       #    0.975 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            18,096      page-faults:u             #    0.006 M/sec                  
     8,755,915,430      cycles:u                  #    3.063 GHz                      (83.06%)
     2,272,539,762      stalled-cycles-frontend:u #   25.95% frontend cycles idle     (83.65%)
     2,043,003,105      stalled-cycles-backend:u  #   23.33% backend cycles idle      (83.42%)
    14,712,522,371      instructions:u            #    1.68  insn per cycle         
                                                  #    0.15  stalled cycles per insn  (83.47%)
     2,783,266,581      branches:u                #  973.641 M/sec                    (83.11%)
        67,444,081      branch-misses:u           #    2.42% of all branches          (83.34%)

       2.932051381 seconds time elapsed
evancz commented 7 years ago

Okay, after thinking about it more, I think it makes sense to have a meta issue to try to track the problem. It looks like it's related to some general problem that we haven't been able to pin down yet.

I'm not sure where the meta issue should live, so I'm just going to leave it for now.

jcberentsen commented 7 years ago

Could it be related to this many-core Haskell runtime issue? It might be worth trying running elm-make +RTS -qg to turn parallell GC off? Possibly also with -A50M?

Augustin82 commented 7 years ago

I'm encountering the same problem.

I'm using Ubuntu 16.04.2 LTS 64-bit on a Dell XPS15 (16GB RAM, i7-7700HQ CPU @ 2.80GHz × 8).

Here's the output of elm-make --yes against the elm-css repo:

$ sudo rm -rf elm-stuff && sudo perf stat elm-make --yes
Starting downloads...

  ● rtfeldman/elm-css-util 1.0.2
  ● rtfeldman/hex 1.0.0
  ● elm-lang/core 5.1.1

Packages configured successfully!
Success! Compiled 35 modules.                                       

 Performance counter stats for 'elm-make --yes':

      18956,377203      task-clock (msec)         #    3,297 CPUs utilized          
           899 431      context-switches          #    0,047 M/sec                  
               961      cpu-migrations            #    0,051 K/sec                  
            24 012      page-faults               #    0,001 M/sec                  
    60 360 847 025      cycles                    #    3,184 GHz                    
    42 589 205 040      instructions              #    0,71  insn per cycle         
     8 429 877 689      branches                  #  444,699 M/sec                  
        59 846 870      branch-misses             #    0,71% of all branches        

       5,749273986 seconds time elapsed
$ sudo rm -rf elm-stuff && sudo sysconfcpus -n 1 perf stat elm-make --yes
Starting downloads...

  ● rtfeldman/hex 1.0.0
  ● rtfeldman/elm-css-util 1.0.2
  ● elm-lang/core 5.1.1

Packages configured successfully!
Success! Compiled 35 modules.                                       

 Performance counter stats for 'elm-make --yes':

       2523,628536      task-clock (msec)         #    0,624 CPUs utilized          
               765      context-switches          #    0,303 K/sec                  
                45      cpu-migrations            #    0,018 K/sec                  
            25 436      page-faults               #    0,010 M/sec                  
     8 034 175 352      cycles                    #    3,184 GHz                    
    16 722 884 588      instructions              #    2,08  insn per cycle         
     3 040 339 979      branches                  # 1204,749 M/sec                  
        53 767 368      branch-misses             #    1,77% of all branches        

       4,047055038 seconds time elapsed
rtfeldman commented 7 years ago

I dove deeper into this. It turns out GHC has some, ahem, conservative default garbage collection settings for parallelized workloads. (Example: the nursery gets a whole megabyte!)

Simon Marlow and others have looked into this and have tuned the default settings for the next release of GHC 8. Fortunately, we can tune things right now using flags in elm-compiler.cabal.

tl;dr

Try adding -with-rtsopts="-A16M -qg" to the end of this ghc-options: line in elm-compiler.cabal.

What this does

Bonus perf boost for Linux

For Linux builds (not sure if this is a no-op or causes problems on non-Linux systems), Simon Marlow suggests adding the --numa flag (e.g. -with-rtsopts="-A16M -qg --numa") for another significant perf boost (according to the docs, a 10% speedup is "typical" but varies greatly by hardware and program).

Further Reading

I have read the documentation for the -H flag about 12 times. I still do not know what it does. It sounds like it could lead to reduced memory consumption. Maybe it's awesome. I'm also not sure if GHC 7.x supports it.

If we remove -qg we should also enable -qb0 (for -A32M and up, and honestly possibly also -A16M and up) which turns on load balancing in the nursery for parallel GC. Simon Marlow recommended this. We should also definitely enable -n4m; the current GHC now defaults to -n4m for -A16M or higher.

Longer Version

I put the long version of what I learned into a gist.

AntouanK commented 7 years ago

thank you @rtfeldman for looking into this.

If I remember correctly, compiling elm-make, is not a one line thing. If anyone does try it with those flags, can you upload the binaries somewhere so we can try them as well?

eeue56 commented 7 years ago

So, I ran almost all the combinations suggested with different values. I've put the most useful raw numbers into this gist, benchmarking against elm-css on a Thinkpad T470s.

Conclusion: the best option is "-with-rtsops=-qg". No amount of -An could beat just disabling the parallel GC. While there is some benefit from using -An, it only starts to kick in when you have a large value, e.g -A64M. This seems like a large amount of memory to reserve for a single elm-make process, and it still did not perform as well as -qg. Other things such as n4m likewise saw little or no performance change.

Make sure to use " -with-rtsopts=-A16M -qg" for checking, and not -with-rtsopts="-A16M -qg" -- or -qg will be parsed as a non-rts option.

tl;dr

"-with-rtsopts=-qg" is the best option for performance on my machine.

AntouanK commented 7 years ago

@eeue56 How many cores does that machine have? Since it's an issue that worsens with the more cores you have available, it would be good to test it against a CPU with many. I can run it on a Ryzen 1700, it has 16 cores. Let me know if you can make a gist with instructions on what combinations to try, and I'll do it this weekend.

eeue56 commented 7 years ago

It is this CPU: https://ark.intel.com/products/97466/Intel-Core-i7-7600U-Processor-4M-Cache-up-to-3_90-GHz

So, in terms of this, 4 cores (2 real cores, 2 more from HT).

If you look at the gist I made, it has all the combinations that had any form of noticeable impact - in order to build:

If you have other questions, this discussion is better asked on #elm-dev on Slack

rtfeldman commented 7 years ago

@AntouanK if you do build those, can you save the different binaries you make so we can post them somewhere for others to try? A higher sample size will give us more confidence!

AntouanK commented 7 years ago

Thanks @eeue56 Here are my results: https://gist.github.com/AntouanK/1ecec02e08b90be54463d4ad6f0efcab Still, it confuses me how it's possible to have 1 core outperforming 16 in parallel. Even with the flag "-with-rtsopts=-qg", if I switch to just one core, it runs faster. Seems to me that the best solution is to restrict it to one core and not let it multi-thread. Don't know how easy that would be.

AntouanK commented 7 years ago

@rtfeldman Seems like there should be an easy script for this. I made a docker container, with haskell 7.10, mounted a local directory of elm-css, and then did 3 things repeatedly.

so should we just make a script instead? Or I can just upload the binaries in a repo?

rtfeldman commented 7 years ago

@AntouanK uploading the binaries would be fantastic! I think that would definitely be the easiest for others. 😄

Since sysconfcpus still improved things, I'm curious what happens if you try these RTS options:

Among these, I think it'd be most useful o try -N1 -qg, but if you have time, trying all 3 would be great. Even better would be trying 6 combinations - using the different -N flags with -qb0 -n4m instead of -qg. 😄

Saving binaries for anything you do there would be great as well!

AntouanK commented 7 years ago

@rtfeldman With my poor scripting skills, I got a loop going. Binaries and perf stat outputs are here : https://github.com/AntouanK/elm-make_perf

So far, seems like the flag doesn't make a big difference. Using one core does.

elm-css

Perf name Time
multicore 5.502 sec
onecore 3.074 sec
-A16M_-qb0.multicore 3.136 sec
-A16M_-qb0.onecore 2.776 sec
-A16M_-qg.multicore 3.230 sec
-A16M_-qg.onecore 2.744 sec
-A32M_-qb0.multicore 3.131 sec
-A32M_-qb0.onecore 2.751 sec
-A32M_-qg.multicore 3.030 sec
-A32M_-qg.onecore 2.744 sec
-A64M_-qb0.multicore 3.327 sec
-A64M_-qb0.onecore 2.754 sec
-A64M_-qg.multicore 3.201 sec
-A64M_-qg.onecore 2.790 sec
-N1-qb0-n4m.multicore 3.123 sec
-N1-qb0-n4m.onecore 2.748 sec
-N1_-qg.multicore 3.201 sec
-N1_-qg.onecore 2.749 sec
-N12-qb0-n4m.multicore 3.192 sec
-N12-qb0-n4m.onecore 2.744 sec
-N12_-qg.multicore 3.135 sec
-N12_-qg.onecore 2.767 sec
-N16-qb0-n4m.multicore 3.191 sec
-N16-qb0-n4m.onecore 2.777 sec
-N16_-qg.multicore 3.214 sec
-N16_-qg.onecore 2.734 sec
-qg.multicore 3.006 sec
-qg.onecore 2.761 sec
AntouanK commented 7 years ago

Added results for elm-spa-example as well

Perf name Time % from multicore
multicore 9.234 sec
onecore 6.385 sec 69.15%
-A16M_-qb0.multicore 7.451 sec 80.69%
-A16M_-qb0.onecore 6.033 sec 65.33%
-A16M_-qg.multicore 7.431 sec 80.47%
-A16M_-qg.onecore 6.151 sec 66.61%
-A32M_-qb0.multicore 7.169 sec 77.64%
-A32M_-qb0.onecore 6.106 sec 66.13%
-A32M_-qg.multicore 7.068 sec 76.54%
-A32M_-qg.onecore 6.266 sec 67.86%
-A64M_-qb0.multicore 7.559 sec 81.86%
-A64M_-qb0.onecore 6.219 sec 67.35%
-A64M_-qg.multicore 7.153 sec 77.46%
-A64M_-qg.onecore 6.015 sec 65.14%
-N12-qb0-n4m.multicore 7.301 sec 79.07%
-N12-qb0-n4m.onecore 6.123 sec 66.31%
-N12_-qg.multicore 7.373 sec 79.85%
-N12_-qg.onecore 6.297 sec 68.19%
-N16-qb0-n4m.multicore 7.586 sec 82.15%
-N16-qb0-n4m.onecore 6.340 sec 68.66%
-N16_-qg.multicore 7.248 sec 78.49%
-N16_-qg.onecore 6.139 sec 66.48%
-N1-qb0-n4m.multicore 7.543 sec 81.69%
-N1-qb0-n4m.onecore 6.131 sec 66.40%
-N1_-qg.multicore 7.561 sec 81.88%
-N1_-qg.onecore 6.097 sec 66.03%
-qg.multicore 7.330 sec 79.38%
-qg.onecore 6.386 sec 69.16%
rtfeldman commented 7 years ago

Hm, this is very strange. As I recall @eeue56 was seeing like 2-3 second build times on elm-spa-example without sysconfcpus, building on less powerful hardware.

I wonder what the reason is for the discrepancy. 🤔

AntouanK commented 7 years ago

That's with elm-make src/Main.elm?

I tried it again. I get 8.6 on normal elm-make, and 6.2 with one core.

rtfeldman commented 7 years ago

I did some more digging around in the docs, and found some more potential answers as to why single-core is outperforming multicore.

-s for Benchmarking Statistics

Before getting into those, it's worth noting that the -s flag (and related flags) print out a wealth of runtime performance statistics after the run. This would be really helpful for these benchmarks!

Flags that possibly should always be enabled for elm-make

Flags that deserve experimentation

Quick Aside

I also learned a bit about GHC's default pre-emption settings.

GHC implements pre-emptive multitasking: the execution of threads are interleaved in a random fashion. More specifically, a thread may be pre-empted whenever it allocates some memory [...] The rescheduling timer runs on a 20ms granularity by default, but this may be altered using the -i RTS option. [...]

This sounds like a potential contributing factor, but unfortunately, these docs may be out of date. The RTS docs don't list an -i flag. However, there are plenty of other knobs to try!

@eeue56 and @AntouanK - if you could try the above flags out on your benchmarks, it would be so amazing!!! I feel some combination of these may be the answer we've been looking for! 😄

Note: I'm not actually sure how to be sure these are enabled. I linked to the docs, but as Noah learned, the exact way you specify them can be tricky, and GHC has a habit of silently ignoring flags it doesn't understand.

rtfeldman commented 7 years ago

As an aside, I'm curious if this is why macOS programs are doing better: maybe GHC thinks they all have 1 core.

I ran this on my MacBook Pro (2 physical cores + 2 HT):

module Main where

import Control.Concurrent

main :: IO ()
main = do
  capabilities <- getNumCapabilities
  putStrLn $ show capabilities

It printed 1. 😕

Curious what others see when running this.

rtfeldman commented 7 years ago

Nope, that's not it. I also ran it on nixOS with 8 physical cores (Ryzen) and it also printed 1. :confounded:

jmitchell commented 7 years ago

@rtfeldman, I get the same on my 8-core AMD under nixOS. Perhaps it's set to 1 by default everywhere, unless overridden with setNumCapabilities or the +RTS -N runtime flag mentioned in the setNumCapabilities doc string:

Set the number of Haskell threads that can run truly simultaneously (on separate physical processors) at any given time. The number passed to forkOn is interpreted modulo this value. The initial value is given by the +RTS -N runtime flag.

rtfeldman commented 7 years ago

@jmitchell yep, I just confirmed that - it prints out whatever -N flag I get it, but defaults to 1.

rtfeldman commented 7 years ago

This is even more confusing then. Why would sysconfcpus have any affect at all if GHC always thinks there is only 1 core available unless you tell it otherwise? :thinking:

jmitchell commented 7 years ago

Ah, because setNumCapabilities is applied according to GHC.Conc.getNumProcessors, which on my machine returns 8.

rtfeldman commented 7 years ago

Ahh, that's why!

I learned from Brian McKenna on Twitter that if you pass -N by itself (no number specified), GHC will infer that for you on startup, making that getNumProcessors line unnecessary. I verified locally that it works; on my MacBook Pro it reported 4, and on my nixOS Ryzen machine it reported 8.

Then I set up a test on Travis to see if maybe -N would lead getNumCapabilities to report a better number than getNumProcessors (which reports 32 cores available, when actually only 2 are available to the virtualized environment on Travis), but sadly the -N approach also reports 32 on Travis. :disappointed:

andys8 commented 7 years ago

There is also the chance that the cause for the stackoverflow issue #164 is related, if the configured values are relative to the total amount of available memory on a machine, too.

zwilias commented 7 years ago

For future posterity: building elm-make with -rtsopts in the GHC options allows setting the RTS options on execution, rather than at compile time.

https://downloads.haskell.org/~ghc/7.8.4/docs/html/users_guide/runtime-control.html

andys8 commented 7 years ago

@zwilias Any chance to get this flag in by default? (Any implications?) Otherwise a workaround would have to rely on an elm-make fork.

andys8 commented 7 years ago

https://github.com/elm-lang/elm-make/pull/179

evancz commented 6 years ago

I want to make https://github.com/elm-lang/elm-compiler/issues/1473 the canonical issue for this. It probably makes sense to summarize the things that need to happen into a meta issue though. I will coordinate with @eeue56 and @zwilias to get a list of TODO items that should be in that.