composewell / streamly

High performance, concurrent functional programming abstractions
https://streamly.composewell.com
Other
865 stars 66 forks source link

Investigate performance degradation due to TypeFamilies extension #567

Open harendra-kumar opened 4 years ago

harendra-kumar commented 4 years ago

PR #566 proposes enabling most extensions by default. During the investigation we found that enabling TypeFamilies extension causes several performance regressions as well as improvements. We would like to keep the improvements but avoid the regressions. We need to investigate why the degradation and improvements occur? And if there are already regressions due to the cases where this extension is currently being used?

harendra-kumar commented 4 years ago

These are some significant degradation and improvements in run time perf of the Prelude.Serial benchmark suite when TypeFamilies extension is enabled by default for everything. Several other benchmarks suites have similar impact. +x in the second column means x percent degradation and -x means x percent improvement (on the lower base value):

Prelude.Serial(cpuTime)
Benchmark                                                                default(0)(μs) default(1) - default(0)(%)
------------------------------------------------------------------------ -------------- --------------------------
Prelude.Serial/o-1-space/pipesX4/tee                                            1993.13                   +1209.05
Prelude.Serial/o-1-space/concat-foldable/foldMapWith (<>) (Stream)              1543.25                    +140.35
Prelude.Serial/o-1-space/elimination/product                                      42.85                    +120.41
Prelude.Serial/o-1-space/mixed/sum-product-scan                                   49.43                     +90.77
Prelude.Serial/o-1-space/elimination/reduce/IO/foldlM'                            39.84                     +58.49
Prelude.Serial/o-n-space/foldl/maximumBy                                       13247.20                     +31.22
Prelude.Serial/o-1-space/filtering/drop-one                                       43.14                     +31.19
Prelude.Serial/o-n-space/foldl/minimumBy                                       13175.00                     +30.41
Prelude.Serial/o-1-space/elimination/reduce/IO/foldl'                             40.70                     +22.61
Prelude.Serial/o-1-space/generation/IsString.fromString                          482.02                     +21.07
Prelude.Serial/o-1-space/elimination/length . generally                           41.53                     +20.92
Prelude.Serial/o-1-space/generation/integerFromStep                              644.48                     +20.70
Prelude.Serial/o-1-space/elimination/length                                       42.17                     +19.04
Prelude.Serial/o-1-space/foldable/maximum                                         71.68                     +16.91
Prelude.Serial/o-1-space/mixed/foldl-map                                          41.20                     +16.09
Prelude.Serial/o-1-space/elimination/sum                                          43.31                     +14.92
Prelude.Serial/o-1-space/concat/concatMapWith (n of 1)                         14155.00                     +14.19

Prelude.Serial/o-1-space/concat/concatMapRepl (sqrt n of sqrt n)                 652.58                      -9.24
Prelude.Serial/o-1-space/concat-foldable/foldMapM (List)                        5984.40                      -9.43
Prelude.Serial/o-1-space/Monad/(>>) (sqrt n x sqrt n)                           2979.64                      -9.91
Prelude.Serial/o-1-space/concat/concatMap (sqrt n of sqrt n)                     642.06                     -10.10
Prelude.Serial/o-1-space/concat/concatMap (1 of n)                              1268.91                     -10.97
Prelude.Serial/o-1-space/concat/concatMapId (n of 1) (fromFoldable)             5139.50                     -11.22
Prelude.Serial/o-1-space/Applicative/(<*) (sqrt n x sqrt n)                     2989.50                     -12.07
Prelude.Serial/o-1-space/foldable/sum                                             83.35                     -13.79
Prelude.Serial/o-1-space/filtering/dropWhile-true                                 50.27                     -18.96
Prelude.Serial/o-1-space/foldable/sequence_                                       56.13                     -28.92
Prelude.Serial/o-1-space/filtering/intersperse                                    56.55                     -35.41
harendra-kumar commented 4 years ago

Raised a GHC ticket here: https://gitlab.haskell.org/ghc/ghc/-/issues/18414 .

Comparison with MonoLocalBinds disabled and enabled:

Prelude.Serial(time)
Benchmark                                                                default(0)(μs) default(1) - default(0)(%)
------------------------------------------------------------------------ -------------- --------------------------
Prelude.Serial/o-1-space/pipesX4/tee                                            1909.86                   +1295.31
Prelude.Serial/o-1-space/concat-foldable/foldMapWith (<>) (Stream)              1492.10                    +141.99
Prelude.Serial/o-1-space/elimination/product                                      39.22                    +137.14
Prelude.Serial/o-1-space/mixed/sum-product-scan                                   46.27                     +97.93
Prelude.Serial/o-1-space/elimination/reduce/IO/foldlM'                            38.73                     +57.75
Prelude.Serial/o-n-space/foldl/minimumBy                                       12585.61                     +35.11
Prelude.Serial/o-n-space/foldl/maximumBy                                       12850.76                     +33.43
Prelude.Serial/o-1-space/elimination/length . generally                           38.87                     +22.80
Prelude.Serial/o-1-space/elimination/length                                       39.67                     +22.17
Prelude.Serial/o-1-space/elimination/reduce/IO/foldl'                             39.22                     +21.06
Prelude.Serial/o-1-space/filtering/filter-all-in                                  41.25                     +19.96
Prelude.Serial/o-1-space/mixed/foldl-map                                          38.94                     +19.72
Prelude.Serial/o-n-stack/iterated/nullHeadTail                                  3416.58                     +17.99
Prelude.Serial/o-n-heap/buffered/reverse'                                        130.26                     +14.01
Prelude.Serial/o-1-space/elimination/sum                                          41.77                     +12.35
Prelude.Serial/o-1-space/Applicative/liftA2 (sqrt n x sqrt n)                   2708.38                     +11.21
Prelude.Serial/o-1-space/filteringX4/mapMaybe                                     39.05                     +10.14

Results are pretty similar, which indicates that the issue is due to MonoLocalBinds enabled by TypeFamilies.

harendra-kumar commented 4 years ago

Ok, with TypeFamilies and NoMonoLocalBinds the regression is fully recovered:

Benchmark                                                                default(0)(μs) default(1) - default(0)(%) default(2) - default(0)(%)
------------------------------------------------------------------------ -------------- -------------------------- --------------------------
Prelude.Serial/o-1-space/pipesX4/tee                                            1910.64                   +1293.17                      -1.72
Prelude.Serial/o-1-space/concat-foldable/foldMapWith (<>) (Stream)              1490.11                    +142.23                      -5.27
Prelude.Serial/o-1-space/elimination/product                                      39.20                    +137.39                      +0.13
Prelude.Serial/o-1-space/mixed/sum-product-scan                                   46.28                     +97.74                      -1.99
Prelude.Serial/o-1-space/elimination/reduce/IO/foldlM'                            38.80                     +57.45                      -2.95
Prelude.Serial/o-n-space/foldl/minimumBy                                       12575.83                     +35.24                      +0.16
Prelude.Serial/o-n-space/foldl/maximumBy                                       12837.00                     +33.47                      -3.87
Prelude.Serial/o-1-space/elimination/length . generally                           38.93                     +22.63                      -1.32
Prelude.Serial/o-1-space/elimination/length                                       39.80                     +21.75                      -0.64
Prelude.Serial/o-1-space/elimination/reduce/IO/foldl'                             39.25                     +21.02                      -3.86
Prelude.Serial/o-1-space/filtering/filter-all-in                                  41.26                     +19.96                      -1.13
Prelude.Serial/o-1-space/mixed/foldl-map                                          38.94                     +19.92                      -0.04
Prelude.Serial/o-n-stack/iterated/nullHeadTail                                  3416.00                     +17.97                      -1.87
Prelude.Serial/o-n-heap/buffered/reverse'                                        130.00                     +14.13                      +2.55
Prelude.Serial/o-1-space/elimination/sum                                          41.82                     +12.13                      -1.69
Prelude.Serial/o-1-space/Applicative/liftA2 (sqrt n x sqrt n)                   2704.15                     +11.32                      -1.20
Prelude.Serial/o-1-space/filteringX4/mapMaybe                                     39.01                     +10.13                      -2.87