This pr introduces parallelism in the pipeline to gain performance. I've included benchmarks at the end.
Closes #2750.
Flags:
There are two new global flags:
-N / --threads. It is used to set the number of capabilities. According to GHC documentation: Set the number of Haskell threads that can run truly simultaneously (on separate physical processors) at any given time. When compiling in parallel, we create this many worker threads. The default value is -N auto, which sets -N to half the number of logical cores, capped at 8.
--dev-show-thread-ids. When given, the thread id is printed in the compilation progress log. E.g.
Parallel compilation
I've added src/Parallel/ParallelTemplate.hs which contains all the concurrency related code. I think it is good to keep this code separated from the actual compiler code.
I've added a progress log (only for the parallel driver) that outputs a log of the compilation progress, similar to what stack/cabal do.
Code changes:
I've removed the setup stage where we were registering dependencies. Instead, the dependencies are registered when the pathResolver is run for the first time. This way it is safer.
Now the ImportTree is needed to run the pipeline. Cycles are detected during the construction of this tree, so I've removed Reader ImportParents from the pipeline.
For the package pathresolver, we do not support parallelism yet (we could add support for it in the future, but the gains will be small).
When -N1, the pipeline remains unchanged, so performance should be the same as in the main branch (except there is a small performance degradation due to adding the -threaded flag).
I've introduced PipelineOptions, which are options that are used to pass options to the effects in the pipeline.
PathResolver constraint has been removed from the upTo* functions in the pipeline due to being redundant.
I've added a lot of NFData instances. They are needed to force the full evaluation of Stored.ModuleInfo in each of the threads.
The Cache effect uses SharedState as opposed to LocalState. Perhaps we should provide different versions.
I've added a Cache handler that accepts a setup function. The setup is triggered when a miss is detected. It is used to lazily compile the modules in parallel.
Tests
I've adapted the smoke test suite to ignore the progress log in the stderr.
I've had to adapt tests/positive/Internal/Lambda.juvix. Due to laziness, a crash happening in this file was not being caught. The problem is that in this file we have a lambda function with different number of patterns in their clauses, which we currently do not support (https://github.com/anoma/juvix/issues/1706).
It should be investigated how much performance we lose by fully evaluating the Stored.ModuleInfo, since some information in it will be discarded. It may be possible to be more fine-grained when forcing evaluation.
The scanning of imports to build the import tree is sequential. Now, we build the import tree from the entry point module and only the modules that are imported from it are in the tree. However, we have discussed that at some point we should make a distinction between juvixthe compiler and juvixthe build tool. When using juvix as a build tool it makes sense to typecheck/compile (to stored core) all modules in the project. When/if we do this, scanning imports in all modules in parallel becomes trivial.
The implementation of the ParallelTemplate uses low level primitives such as forkIO. At some point it should be refactored to use safer functions from the Effectful.Concurrent.Async module.
The number of cores and worker threads that we spawn is determined by the command line. Ideally, we could use to import tree to compute an upper bound to the ideal number of cores to use.
We could add an animation that displays which modules are being compiled in parallel and which have finished being compiled.
Benchmarks
On some benchmarks, I include the GHC runtime option -A, which sometimes makes a good impact on performance. Thanks to @paulcadman for pointing this out. I've figured a good combination of -N and -A through trial and error (but this oviously depends on the cpu and juvix projects).
Typecheck the standard library
Clean run (88% faster than main):
hyperfine --warmup 1 --prepare 'juvix clean' 'juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432' 'juvix -N 4 typecheck Stdlib/Prelude.juvix' 'juvix-main typecheck Stdlib/Prelude.juvix'
Benchmark 1: juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432
Time (mean ± σ): 444.1 ms ± 6.5 ms [User: 1018.0 ms, System: 77.7 ms]
Range (min … max): 432.6 ms … 455.9 ms 10 runs
Benchmark 2: juvix -N 4 typecheck Stdlib/Prelude.juvix
Time (mean ± σ): 628.3 ms ± 23.9 ms [User: 1227.6 ms, System: 69.5 ms]
Range (min … max): 584.7 ms … 670.6 ms 10 runs
Benchmark 3: juvix-main typecheck Stdlib/Prelude.juvix
Time (mean ± σ): 835.9 ms ± 12.3 ms [User: 788.5 ms, System: 31.9 ms]
Range (min … max): 816.0 ms … 853.6 ms 10 runs
Summary
juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432 ran
1.41 ± 0.06 times faster than juvix -N 4 typecheck Stdlib/Prelude.juvix
1.88 ± 0.04 times faster than juvix-main typecheck Stdlib/Prelude.juvix
Cached run (43% faster than main):
hyperfine --warmup 1 'juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432' 'juvix -N 4 typecheck Stdlib/Prelude.juvix' 'juvix-main typecheck Stdlib/Prelude.juvix'
Benchmark 1: juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432
Time (mean ± σ): 241.3 ms ± 7.3 ms [User: 538.6 ms, System: 101.3 ms]
Range (min … max): 231.5 ms … 251.3 ms 11 runs
Benchmark 2: juvix -N 4 typecheck Stdlib/Prelude.juvix
Time (mean ± σ): 235.1 ms ± 12.0 ms [User: 405.3 ms, System: 87.7 ms]
Range (min … max): 216.1 ms … 253.1 ms 12 runs
Benchmark 3: juvix-main typecheck Stdlib/Prelude.juvix
Time (mean ± σ): 336.7 ms ± 13.3 ms [User: 269.5 ms, System: 67.1 ms]
Range (min … max): 316.9 ms … 351.8 ms 10 runs
Summary
juvix -N 4 typecheck Stdlib/Prelude.juvix ran
1.03 ± 0.06 times faster than juvix -N 4 typecheck Stdlib/Prelude.juvix +RTS -A33554432
1.43 ± 0.09 times faster than juvix-main typecheck Stdlib/Prelude.juvix
Typecheck the test suite of the containers library
At the moment this is the biggest juvix project that we have.
Clean run (105% faster than main)
hyperfine --warmup 1 --prepare 'juvix clean' 'juvix -N 6 typecheck Main.juvix +RTS -A67108864' 'juvix -N 4 typecheck Main.juvix' 'juvix-main typecheck Main.juvix'
Benchmark 1: juvix -N 6 typecheck Main.juvix +RTS -A67108864
Time (mean ± σ): 1.006 s ± 0.011 s [User: 2.171 s, System: 0.162 s]
Range (min … max): 0.991 s … 1.023 s 10 runs
Benchmark 2: juvix -N 4 typecheck Main.juvix
Time (mean ± σ): 1.584 s ± 0.046 s [User: 2.934 s, System: 0.149 s]
Range (min … max): 1.535 s … 1.660 s 10 runs
Benchmark 3: juvix-main typecheck Main.juvix
Time (mean ± σ): 2.066 s ± 0.010 s [User: 1.939 s, System: 0.089 s]
Range (min … max): 2.048 s … 2.077 s 10 runs
Summary
juvix -N 6 typecheck Main.juvix +RTS -A67108864 ran
1.57 ± 0.05 times faster than juvix -N 4 typecheck Main.juvix
2.05 ± 0.03 times faster than juvix-main typecheck Main.juvix
Cached run (54% faster than main)
hyperfine --warmup 1 'juvix -N 6 typecheck Main.juvix +RTS -A33554432' 'juvix -N 4 typecheck Main.juvix' 'juvix-main typecheck Main.juvix'
Benchmark 1: juvix -N 6 typecheck Main.juvix +RTS -A33554432
Time (mean ± σ): 551.8 ms ± 13.2 ms [User: 1419.8 ms, System: 199.4 ms]
Range (min … max): 535.2 ms … 570.6 ms 10 runs
Benchmark 2: juvix -N 4 typecheck Main.juvix
Time (mean ± σ): 636.7 ms ± 17.3 ms [User: 1006.3 ms, System: 196.3 ms]
Range (min … max): 601.6 ms … 655.3 ms 10 runs
Benchmark 3: juvix-main typecheck Main.juvix
Time (mean ± σ): 847.2 ms ± 58.9 ms [User: 710.1 ms, System: 126.5 ms]
Range (min … max): 731.1 ms … 890.0 ms 10 runs
Summary
juvix -N 6 typecheck Main.juvix +RTS -A33554432 ran
1.15 ± 0.04 times faster than juvix -N 4 typecheck Main.juvix
1.54 ± 0.11 times faster than juvix-main typecheck Main.juvix
This pr introduces parallelism in the pipeline to gain performance. I've included benchmarks at the end.
Flags:
There are two new global flags:
-N / --threads
. It is used to set the number of capabilities. According to GHC documentation: Set the number of Haskell threads that can run truly simultaneously (on separate physical processors) at any given time. When compiling in parallel, we create this many worker threads. The default value is-N auto
, which sets-N
to half the number of logical cores, capped at 8.--dev-show-thread-ids
. When given, the thread id is printed in the compilation progress log. E.g.Parallel compilation
src/Parallel/ParallelTemplate.hs
which contains all the concurrency related code. I think it is good to keep this code separated from the actual compiler code.Code changes:
setup
stage where we were registering dependencies. Instead, the dependencies are registered when thepathResolver
is run for the first time. This way it is safer.ImportTree
is needed to run the pipeline. Cycles are detected during the construction of this tree, so I've removedReader ImportParents
from the pipeline.-N1
, the pipeline remains unchanged, so performance should be the same as in the main branch (except there is a small performance degradation due to adding the-threaded
flag).PipelineOptions
, which are options that are used to pass options to the effects in the pipeline.PathResolver
constraint has been removed from theupTo*
functions in the pipeline due to being redundant.NFData
instances. They are needed to force the full evaluation ofStored.ModuleInfo
in each of the threads.Cache
effect usesSharedState
as opposed toLocalState
. Perhaps we should provide different versions.Cache
handler that accepts a setup function. The setup is triggered when a miss is detected. It is used to lazily compile the modules in parallel.Tests
tests/positive/Internal/Lambda.juvix
. Due to laziness, a crash happening in this file was not being caught. The problem is that in this file we have a lambda function with different number of patterns in their clauses, which we currently do not support (https://github.com/anoma/juvix/issues/1706).From the test as it was causing a crash (https://github.com/anoma/juvix/issues/2247).
Future Work
Stored.ModuleInfo
, since some information in it will be discarded. It may be possible to be more fine-grained when forcing evaluation.juvix
the compiler andjuvix
the build tool. When usingjuvix
as a build tool it makes sense to typecheck/compile (to stored core) all modules in the project. When/if we do this, scanning imports in all modules in parallel becomes trivial.ParallelTemplate
uses low level primitives such as forkIO. At some point it should be refactored to use safer functions from theEffectful.Concurrent.Async
module.Benchmarks
On some benchmarks, I include the GHC runtime option
-A
, which sometimes makes a good impact on performance. Thanks to @paulcadman for pointing this out. I've figured a good combination of-N
and-A
through trial and error (but this oviously depends on the cpu and juvix projects).Typecheck the standard library
Clean run (88% faster than main):
Cached run (43% faster than main):
Typecheck the test suite of the containers library
At the moment this is the biggest juvix project that we have.
Clean run (105% faster than main)
Cached run (54% faster than main)