Closed AntouanK closed 6 years ago
Thanks for the issue! Make sure it satisfies this checklist. My human colleagues will appreciate it!
Here is what to expect next, and if anyone wants to comment, keep these things in mind.
The compiler already should be taking advantage of the cores you have. Here are some things that may be relevant:
Haskell introduced a new IO scheduler with GHC 7.8.1 which was targeted at weird performance when you have lots of cores. Unless you built Elm in some crazy way, you should have this.
https://github.com/elm-lang/elm-compiler/issues/1473 reported slow builds, and the root issue seems to be that Haskell's getNumProcessors
reports what is available to the system, which may not be available to the actual process. This appears to slow down Haskell's scheduler, which makes sense. If you can, it'd be great to know what getNumProcessors
produces on your Linux machine!
That said, I need more information to improve things in a directed way. CPU may not actually be the core issue. Maybe a bunch more RAM is used in one case? And maybe GC is causing the CPU usage as a result? Maybe transferring data between all the different cores is costly, so maybe there's some way to get information on that?
So here's the plan. When 0.19 alpha is out, please test this again and see if the issue persists. In the meantime, if you find any additional information, let me know about it in an organized way. (I.e. work through it on slack, and prefer one coherent and concise comment over ten scattered comments as you learn more.)
Again, the compiler is designed to use as many cores as possible already, so something weird must be going on. Ultimately, I want the compiler to be super fast, so thanks for reporting this and helping make sense of this!
@evancz Thanks for the response.
Unless you built Elm in some crazy way, you should have this.
The elm-make
I use comes from the npm install. So that's, I guess, the "normal" binary.
reports what is available to the system, which may not be available to the actual process
When I run elm-make
normally, I see all 16 cores go to 99%. So it means that the process can see them all, right?
And still, the total compile time is slower. So there's definitely something going wrong.
I'm not a Haskell programmer, so I don't know where to look at.
Let me know what tests you'd like me to run, and I'm happy to help on that. I'll definitely try 0.19 alpha when it comes out.
Cool, yeah, the npm
one should be good.
If you want to look into it more now, please ask @eeue56 for help on slack. We can proceed without Haskell stuff. For example, knowing about memory usage during compilation can help. Maybe it uses 10mb on your laptop and 100mb on your PC. That would be interesting and helpful to know. Knowing about cache misses may help as well. That kind of thing. I suspect I'll need to get on a machine like yours to test things out, but exploring these things may be helpful or personally interesting nonetheless!
As you can see in the screenshots, there's plenty of memory available. I'll ping @eeue56 to see if we can maybe get other details out.
In theory, how much would every extra core speed up the compile time? Since the new Ryzen CPUs are cheap and can run lots of threads ( 16 the 7 series ) it would be a huge gain for devs that will buy them.
Regarding cache misses
$ perf stat -e L1-dcache-loads -e L1-dcache-load-misses make build
sysconfcpus -n 1 elm-make src/App.elm --yes --output dist/elm.js
Success! Compiled 106 modules.
Successfully generated dist/elm.js
Performance counter stats for 'make build':
25,389,023,513 L1-dcache-loads:u
24,876,267 L1-dcache-load-misses:u # 0.10% of all L1-dcache hits
10.811931503 seconds time elapsed
- without
$ perf stat -e L1-dcache-loads -e L1-dcache-load-misses elm-make src/App.elm --output dist/elm.js
Success! Compiled 106 modules.
Successfully generated dist/elm.js
Performance counter stats for 'elm-make src/App.elm --output dist/elm.js':
46,763,012,114 L1-dcache-loads:u
148,834,742 L1-dcache-load-misses:u # 0.32% of all L1-dcache hits
15.228556020 seconds time elapsed
and you can see it uses all cores when I run it without sysconfcpus -n 1
We took a look through this on Slack. tl;dr:
getNumProcessors
is reported correctly. If you use sysconfcpus -n n
, it will report n
. Discussed with @AntouanK, it's currently at a "liveable" state for them (decent elm-make times). So we will take a look again after 0.19 is released and try to dig in a bit better then.
With two cores enabled, getNumProcessors = 2
noah@noah-Swanky:~/dev/elm-css$ perf stat elm-make
Success! Compiled 35 modules.
Performance counter stats for 'elm-make':
21781.649106 task-clock (msec) # 1.262 CPUs utilized
17,875 context-switches # 0.821 K/sec
89 cpu-migrations # 0.004 K/sec
20,060 page-faults # 0.921 K/sec
23,485,501,602 cycles # 1.078 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
19,333,470,309 instructions # 0.82 insns per cycle
3,577,546,896 branches # 164.246 M/sec
191,048,855 branch-misses # 5.34% of all branches
17.262226728 seconds time elapsed
With a single core enabled, getNumProcessors = 1
noah@noah-Swanky:~/dev/elm-css$ perf stat sysconfcpus -n 1 elm-make
Success! Compiled 35 modules.
Performance counter stats for 'sysconfcpus -n 1 elm-make':
15567.244782 task-clock (msec) # 1.002 CPUs utilized
1,823 context-switches # 0.117 K/sec
32 cpu-migrations # 0.002 K/sec
19,005 page-faults # 0.001 M/sec
16,821,342,129 cycles # 1.081 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
15,000,183,543 instructions # 0.89 insns per cycle
2,714,811,983 branches # 174.393 M/sec
168,684,092 branch-misses # 6.21% of all branches
15.537166525 seconds time elapsed
Just for comparison on a beefier CPU ( again on the elm-css
project )
With 16 cores enabled ( default )
Performance counter stats for 'elm-make':
50921.822910 task-clock:u (msec) # 8.289 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
16,039 page-faults:u # 0.315 K/sec
101,591,919,919 cycles:u # 1.995 GHz (83.31%)
5,486,939,866 stalled-cycles-frontend:u # 5.40% frontend cycles idle (83.46%)
8,788,128,333 stalled-cycles-backend:u # 8.65% backend cycles idle (83.32%)
43,475,312,872 instructions:u # 0.43 insn per cycle
# 0.20 stalled cycles per insn (83.41%)
10,120,790,403 branches:u # 198.752 M/sec (83.17%)
164,522,472 branch-misses:u # 1.63% of all branches (83.36%)
6.143030403 seconds time elapsed
With a single core enabled, sysconfcpus -n 1
Performance counter stats for 'sysconfcpus -n 1 elm-make':
2858.615621 task-clock:u (msec) # 0.975 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
18,096 page-faults:u # 0.006 M/sec
8,755,915,430 cycles:u # 3.063 GHz (83.06%)
2,272,539,762 stalled-cycles-frontend:u # 25.95% frontend cycles idle (83.65%)
2,043,003,105 stalled-cycles-backend:u # 23.33% backend cycles idle (83.42%)
14,712,522,371 instructions:u # 1.68 insn per cycle
# 0.15 stalled cycles per insn (83.47%)
2,783,266,581 branches:u # 973.641 M/sec (83.11%)
67,444,081 branch-misses:u # 2.42% of all branches (83.34%)
2.932051381 seconds time elapsed
Okay, after thinking about it more, I think it makes sense to have a meta issue to try to track the problem. It looks like it's related to some general problem that we haven't been able to pin down yet.
I'm not sure where the meta issue should live, so I'm just going to leave it for now.
Could it be related to this many-core Haskell runtime issue? It might be worth trying running elm-make +RTS -qg
to turn parallell GC off? Possibly also with -A50M
?
I'm encountering the same problem.
I'm using Ubuntu 16.04.2 LTS 64-bit on a Dell XPS15 (16GB RAM, i7-7700HQ CPU @ 2.80GHz × 8).
Here's the output of elm-make --yes
against the elm-css
repo:
$ sudo rm -rf elm-stuff && sudo perf stat elm-make --yes
Starting downloads...
● rtfeldman/elm-css-util 1.0.2
● rtfeldman/hex 1.0.0
● elm-lang/core 5.1.1
Packages configured successfully!
Success! Compiled 35 modules.
Performance counter stats for 'elm-make --yes':
18956,377203 task-clock (msec) # 3,297 CPUs utilized
899 431 context-switches # 0,047 M/sec
961 cpu-migrations # 0,051 K/sec
24 012 page-faults # 0,001 M/sec
60 360 847 025 cycles # 3,184 GHz
42 589 205 040 instructions # 0,71 insn per cycle
8 429 877 689 branches # 444,699 M/sec
59 846 870 branch-misses # 0,71% of all branches
5,749273986 seconds time elapsed
$ sudo rm -rf elm-stuff && sudo sysconfcpus -n 1 perf stat elm-make --yes
Starting downloads...
● rtfeldman/hex 1.0.0
● rtfeldman/elm-css-util 1.0.2
● elm-lang/core 5.1.1
Packages configured successfully!
Success! Compiled 35 modules.
Performance counter stats for 'elm-make --yes':
2523,628536 task-clock (msec) # 0,624 CPUs utilized
765 context-switches # 0,303 K/sec
45 cpu-migrations # 0,018 K/sec
25 436 page-faults # 0,010 M/sec
8 034 175 352 cycles # 3,184 GHz
16 722 884 588 instructions # 2,08 insn per cycle
3 040 339 979 branches # 1204,749 M/sec
53 767 368 branch-misses # 1,77% of all branches
4,047055038 seconds time elapsed
I dove deeper into this. It turns out GHC has some, ahem, conservative default garbage collection settings for parallelized workloads. (Example: the nursery gets a whole megabyte!)
Simon Marlow and others have looked into this and have tuned the default settings for the next release of GHC 8. Fortunately, we can tune things right now using flags in elm-compiler.cabal
.
Try adding -with-rtsopts="-A16M -qg"
to the end of this ghc-options:
line in elm-compiler.cabal
.
-A16M
increases the nursery size from the default of 1MB to 16MB. Why 16MB? The highest I saw was Simon Marlow mentioning using -A256M
. Someone reported faster and faster speeds as they dialed this up incrementally to 128M
, but with big increase in memory usage along the way. I chose 16MB because the person who tried that reported it seemed like the "sweet spot" - memory usage increased, but wasn't crazy, and build time decreased a lot. The GHC folks seem to be changing GHC defaults to special-case -A
values of 16M and up, as well as 32M and up, but they are staying conservative and keeping default -A
at 1M. It is probably worth people trying out different values here, but I wanted to pick one to suggest, so I picked 16M.-qg
disables parallel GC. People have reported speed gains from this. As an alternative to -qg
, we could try using -n4m -qb0
instead (more on this below), which is a combination of flags that should improve parallel GC performance. Hoewver, I suspect it may be better to turn it off altogether; Simon Marlow suggested there may be an as-yet-unidentified bug in the parallel GC's multithreading logic. Since using sysconfcpus
to force using 1 core speeds things up overall, if a parallel GC bug is the root of the problem here, it stands to reason that disabling it would result in better perf than the sysconfcpus
trick because at least the compiler logic itself could still use multiple cores. It's probably worth benchmarking both ways.For Linux builds (not sure if this is a no-op or causes problems on non-Linux systems), Simon Marlow suggests adding the --numa
flag (e.g. -with-rtsopts="-A16M -qg --numa"
) for another significant perf boost (according to the docs, a 10% speedup is "typical" but varies greatly by hardware and program).
I have read the documentation for the -H
flag about 12 times. I still do not know what it does. It sounds like it could lead to reduced memory consumption. Maybe it's awesome. I'm also not sure if GHC 7.x supports it.
If we remove -qg
we should also enable -qb0
(for -A32M
and up, and honestly possibly also -A16M
and up) which turns on load balancing in the nursery for parallel GC. Simon Marlow recommended this. We should also definitely enable -n4m
; the current GHC now defaults to -n4m
for -A16M
or higher.
I put the long version of what I learned into a gist.
thank you @rtfeldman for looking into this.
If I remember correctly, compiling elm-make, is not a one line thing. If anyone does try it with those flags, can you upload the binaries somewhere so we can try them as well?
So, I ran almost all the combinations suggested with different values. I've put the most useful raw numbers into this gist, benchmarking against elm-css on a Thinkpad T470s.
Conclusion: the best option is "-with-rtsops=-qg"
. No amount of -An
could beat just disabling the parallel GC. While there is some benefit from using -An
, it only starts to kick in when you have a large value, e.g -A64M
. This seems like a large amount of memory to reserve for a single elm-make process, and it still did not perform as well as -qg
. Other things such as n4m
likewise saw little or no performance change.
Make sure to use " -with-rtsopts=-A16M -qg"
for checking, and not -with-rtsopts="-A16M -qg"
-- or -qg
will be parsed as a non-rts option.
"-with-rtsopts=-qg"
is the best option for performance on my machine.
@eeue56 How many cores does that machine have? Since it's an issue that worsens with the more cores you have available, it would be good to test it against a CPU with many. I can run it on a Ryzen 1700, it has 16 cores. Let me know if you can make a gist with instructions on what combinations to try, and I'll do it this weekend.
It is this CPU: https://ark.intel.com/products/97466/Intel-Core-i7-7600U-Processor-4M-Cache-up-to-3_90-GHz
So, in terms of this, 4 cores (2 real cores, 2 more from HT).
If you look at the gist I made, it has all the combinations that had any form of noticeable impact - in order to build:
runhaskell BuildFromSource.hs 0.18
Elm-Platform/0.18
elm-package/elm-package.cabal
, elm-make/elm-make.cabal
, elm-make/elm-compiler
to have the options. Your ghc-options should look something like -threaded -O2 -W "with-rtsopts=-A16M -gq"
. Note where the quotes are.cabal install elm-make
perf stat <dir-with-elm-platform>/Elm-Platform/0.18/.cabal_sandbox/bin/elm-make
in the elm-css dir. Make sure to wipe elm-stuff/build-artifacts
, but not elm-stuff/packages
before each build.If you have other questions, this discussion is better asked on #elm-dev on Slack
@AntouanK if you do build those, can you save the different binaries you make so we can post them somewhere for others to try? A higher sample size will give us more confidence!
Thanks @eeue56
Here are my results: https://gist.github.com/AntouanK/1ecec02e08b90be54463d4ad6f0efcab
Still, it confuses me how it's possible to have 1 core outperforming 16 in parallel.
Even with the flag "-with-rtsopts=-qg"
, if I switch to just one core, it runs faster.
Seems to me that the best solution is to restrict it to one core and not let it multi-thread. Don't know how easy that would be.
@rtfeldman Seems like there should be an easy script for this. I made a docker container, with haskell 7.10, mounted a local directory of elm-css, and then did 3 things repeatedly.
runhaskell ...
againperf stat .../elm-make
so should we just make a script instead? Or I can just upload the binaries in a repo?
@AntouanK uploading the binaries would be fantastic! I think that would definitely be the easiest for others. 😄
Since sysconfcpus
still improved things, I'm curious what happens if you try these RTS options:
-N1
(tells GHC to only use one core for the program as well as the GC; should be equivalent to sysconfcpus -n 1
)-N16
(tells GHC to use 16 cores, instead of the 32 it might be trying to use due to hyperthreading; this comment suggests that might be a thing-N12
to see if telling it to use many cores, but fewer than 16, is better than telling it to use 1Among these, I think it'd be most useful o try -N1 -qg
, but if you have time, trying all 3 would be great. Even better would be trying 6 combinations - using the different -N
flags with -qb0 -n4m
instead of -qg
. 😄
Saving binaries for anything you do there would be great as well!
@rtfeldman With my poor scripting skills, I got a loop going.
Binaries and perf stat
outputs are here : https://github.com/AntouanK/elm-make_perf
So far, seems like the flag doesn't make a big difference. Using one core does.
elm-css
Perf name | Time |
---|---|
multicore | 5.502 sec |
onecore | 3.074 sec |
-A16M_-qb0.multicore | 3.136 sec |
-A16M_-qb0.onecore | 2.776 sec |
-A16M_-qg.multicore | 3.230 sec |
-A16M_-qg.onecore | 2.744 sec |
-A32M_-qb0.multicore | 3.131 sec |
-A32M_-qb0.onecore | 2.751 sec |
-A32M_-qg.multicore | 3.030 sec |
-A32M_-qg.onecore | 2.744 sec |
-A64M_-qb0.multicore | 3.327 sec |
-A64M_-qb0.onecore | 2.754 sec |
-A64M_-qg.multicore | 3.201 sec |
-A64M_-qg.onecore | 2.790 sec |
-N1-qb0-n4m.multicore | 3.123 sec |
-N1-qb0-n4m.onecore | 2.748 sec |
-N1_-qg.multicore | 3.201 sec |
-N1_-qg.onecore | 2.749 sec |
-N12-qb0-n4m.multicore | 3.192 sec |
-N12-qb0-n4m.onecore | 2.744 sec |
-N12_-qg.multicore | 3.135 sec |
-N12_-qg.onecore | 2.767 sec |
-N16-qb0-n4m.multicore | 3.191 sec |
-N16-qb0-n4m.onecore | 2.777 sec |
-N16_-qg.multicore | 3.214 sec |
-N16_-qg.onecore | 2.734 sec |
-qg.multicore | 3.006 sec |
-qg.onecore | 2.761 sec |
Added results for elm-spa-example
as well
Perf name | Time | % from multicore |
---|---|---|
multicore | 9.234 sec | |
onecore | 6.385 sec | 69.15% |
-A16M_-qb0.multicore | 7.451 sec | 80.69% |
-A16M_-qb0.onecore | 6.033 sec | 65.33% |
-A16M_-qg.multicore | 7.431 sec | 80.47% |
-A16M_-qg.onecore | 6.151 sec | 66.61% |
-A32M_-qb0.multicore | 7.169 sec | 77.64% |
-A32M_-qb0.onecore | 6.106 sec | 66.13% |
-A32M_-qg.multicore | 7.068 sec | 76.54% |
-A32M_-qg.onecore | 6.266 sec | 67.86% |
-A64M_-qb0.multicore | 7.559 sec | 81.86% |
-A64M_-qb0.onecore | 6.219 sec | 67.35% |
-A64M_-qg.multicore | 7.153 sec | 77.46% |
-A64M_-qg.onecore | 6.015 sec | 65.14% |
-N12-qb0-n4m.multicore | 7.301 sec | 79.07% |
-N12-qb0-n4m.onecore | 6.123 sec | 66.31% |
-N12_-qg.multicore | 7.373 sec | 79.85% |
-N12_-qg.onecore | 6.297 sec | 68.19% |
-N16-qb0-n4m.multicore | 7.586 sec | 82.15% |
-N16-qb0-n4m.onecore | 6.340 sec | 68.66% |
-N16_-qg.multicore | 7.248 sec | 78.49% |
-N16_-qg.onecore | 6.139 sec | 66.48% |
-N1-qb0-n4m.multicore | 7.543 sec | 81.69% |
-N1-qb0-n4m.onecore | 6.131 sec | 66.40% |
-N1_-qg.multicore | 7.561 sec | 81.88% |
-N1_-qg.onecore | 6.097 sec | 66.03% |
-qg.multicore | 7.330 sec | 79.38% |
-qg.onecore | 6.386 sec | 69.16% |
Hm, this is very strange. As I recall @eeue56 was seeing like 2-3 second build times on elm-spa-example
without sysconfcpus
, building on less powerful hardware.
I wonder what the reason is for the discrepancy. 🤔
That's with elm-make src/Main.elm
?
I tried it again. I get 8.6 on normal elm-make, and 6.2 with one core.
I did some more digging around in the docs, and found some more potential answers as to why single-core is outperforming multicore.
-s
for Benchmarking StatisticsBefore getting into those, it's worth noting that the -s
flag (and related flags) print out a wealth of runtime performance statistics after the run. This would be really helpful for these benchmarks!
elm-make
--numa
flag could have major benefits for Linux in particular. (It seems like it shouldn't be enabled on non-Linux builds though.) As far as I can tell, it's just a straight-up upgrade for Linux builds, and could very well explain the discrepancy between Linux and macOS performance characteristics.-feager-blackholing
flag apparently should always be turned on, according to the docs. ("We recommend compiling any code that is intended to be run in parallel with the -feager-blackholing
flag.") Note that it only has any effect if -threaded
is enabled, so we'll need to keep that enabled in order to see what it does. (Doesn't sound like it should be much - around the neighborhood of 1-2%, according to the docs, but who knows?)-fllvm
flag seems worth enabling in general, and may separately impact this issue. It enables using LLVM to get additional binary performance optimizations. ("Compile via LLVM instead of using the native code generator. This will generally take slightly longer than the native code generator to compile. Produced code is generally the same speed or faster than the other two code generators.") The one caveat is that compiling via LLVM requires LLVM’s opt
and llc
executables to be in PATH
, so this may not be worth the effort in comparison to the other options.-threaded
flag enables a parallel version of the runtime. ("The threaded runtime system is so-called because it manages multiple OS threads, as opposed to the default runtime system which is purely single-threaded.") It turns out that if this is disabled, then -N
is silently ignored. The fact that it is disabled by default explains why -N
never had any effect on the previous benchmarks. I don't know why this isn't enabled by default, considering some docs say "In order to make use of multiple CPUs, your program must be linked with the -threaded
option" (which empirically seems to be false, since we've seen more CPUs engage when sysconfcpus
is not used to limit them), but this very much seems worth trying at any rate.elm-make
is already using -threaded
. So it looks like the reason sysconfcpus
has an effect is because elm-make
is setting the capability count manually. This line of code would become unnecessary if we used -N
(no number), which would make GHC infer it on startup, but it appears that would do nothing more than saving a line of code; GHC infers the same number as what we get by calling this manually.-ki
flag controls default stack size for new threads. Thread stacks live on the heap, apparently, and the default is 1K (GHC is once again conservative). They allocate more memory as they grow (in 32K chunks by default, which is also configurable via the -kc
flag, if we care), and then GC it when they're no longer using it. This could explain why throwing more cores at the workload degrades performance; if each of these threads is so resource constrained that it ends up doing a ton of allocation and GC to overcome its 1K default stack size, they might be so much slower than the main thread (which doesn't have this problem) that it leads to a net slowdown.-H
flag sets the default heap size for the garbage collector. The default is 0, and I suspect we can experimentally find a better choice than that, but then again -H
and -A
are related, and tweaking -A
didn't seem to do anything. Still, could be worth a shot since the default seems so bad.-I
flag could be relevant; it controls how often GHC collects garbage when idle. By default it runs every 0.3 seconds, and we can disable it with -I0
.I also learned a bit about GHC's default pre-emption settings.
GHC implements pre-emptive multitasking: the execution of threads are interleaved in a random fashion. More specifically, a thread may be pre-empted whenever it allocates some memory [...] The rescheduling timer runs on a 20ms granularity by default, but this may be altered using the -i
RTS option. [...]
This sounds like a potential contributing factor, but unfortunately, these docs may be out of date. The RTS docs don't list an -i
flag. However, there are plenty of other knobs to try!
@eeue56 and @AntouanK - if you could try the above flags out on your benchmarks, it would be so amazing!!! I feel some combination of these may be the answer we've been looking for! 😄
Note: I'm not actually sure how to be sure these are enabled. I linked to the docs, but as Noah learned, the exact way you specify them can be tricky, and GHC has a habit of silently ignoring flags it doesn't understand.
As an aside, I'm curious if this is why macOS programs are doing better: maybe GHC thinks they all have 1 core.
I ran this on my MacBook Pro (2 physical cores + 2 HT):
module Main where
import Control.Concurrent
main :: IO ()
main = do
capabilities <- getNumCapabilities
putStrLn $ show capabilities
It printed 1
. 😕
Curious what others see when running this.
Nope, that's not it. I also ran it on nixOS with 8 physical cores (Ryzen) and it also printed 1
. :confounded:
@rtfeldman, I get the same on my 8-core AMD under nixOS. Perhaps it's set to 1 by default everywhere, unless overridden with setNumCapabilities
or the +RTS -N
runtime flag mentioned in the setNumCapabilities
doc string:
Set the number of Haskell threads that can run truly simultaneously (on separate physical processors) at any given time. The number passed to forkOn is interpreted modulo this value. The initial value is given by the +RTS -N runtime flag.
@jmitchell yep, I just confirmed that - it prints out whatever -N
flag I get it, but defaults to 1
.
This is even more confusing then. Why would sysconfcpus
have any affect at all if GHC always thinks there is only 1 core available unless you tell it otherwise? :thinking:
Ah, because setNumCapabilities
is applied according to GHC.Conc.getNumProcessors
, which on my machine returns 8.
Ahh, that's why!
I learned from Brian McKenna on Twitter that if you pass -N
by itself (no number specified), GHC will infer that for you on startup, making that getNumProcessors
line unnecessary. I verified locally that it works; on my MacBook Pro it reported 4, and on my nixOS Ryzen machine it reported 8.
Then I set up a test on Travis to see if maybe -N
would lead getNumCapabilities
to report a better number than getNumProcessors
(which reports 32 cores available, when actually only 2 are available to the virtualized environment on Travis), but sadly the -N
approach also reports 32 on Travis. :disappointed:
There is also the chance that the cause for the stackoverflow issue #164 is related, if the configured values are relative to the total amount of available memory on a machine, too.
For future posterity: building elm-make
with -rtsopts
in the GHC options allows setting the RTS options on execution, rather than at compile time.
https://downloads.haskell.org/~ghc/7.8.4/docs/html/users_guide/runtime-control.html
@zwilias Any chance to get this flag in by default? (Any implications?) Otherwise a workaround would have to rely on an elm-make fork.
I want to make https://github.com/elm-lang/elm-compiler/issues/1473 the canonical issue for this. It probably makes sense to summarize the things that need to happen into a meta issue though. I will coordinate with @eeue56 and @zwilias to get a list of TODO items that should be in that.
Hi.
Noticed this issue when I was working in parallel on the same project, on my PC and my MacBook Pro. For some weird reason, a 3 year-old MBP, was compiling faster than a brand new 16-core PC. For example:
MBP:
PC:
After the tip from
@eeue56
on the elm slack channel, to use "sysconfcpus", I saw a huge boost. On the Ryzen PC with linux, with one core ( so with "sysconfcpus -n 1" ) I can run "make build" on ~10.4 seconds! ( on the mac "sysconfcpus -n 1" makes no difference )So, how come the same process is ~50% slower when running on 16 cores, than running on one? Is there anything I can do to make the compiler take advantage on the multiple cores?
Thanks.