Closed klapaucius closed 6 years ago
cc: @nikita-volkov
Woops. I'll take a look
Alright! I've applied some optimizations there, increasing the performance several times. I've also added more benchmarks including the subject case. GC is still hard, but at least the performance is now still better than of the vector-growing fold. That is, even when building the vector of size 10,000,000 (as in the subject).
However, it takes forever to run even larger benchmarks, so I am not certain now that the algorithm will perform better then, which is why I think it'll make sense to restore the older vector fold in this library and have both alternatives coexist. We'll just need to mention the performance implications in the docs.
Here are the benchmark results of "vector-builder" of version 0.3.3:
benchmarking 1000/vector-builder
time 91.11 μs (90.75 μs .. 91.51 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 91.54 μs (91.06 μs .. 92.28 μs)
std dev 1.882 μs (1.321 μs .. 3.169 μs)
variance introduced by outliers: 16% (moderately inflated)
benchmarking 1000/default
time 272.8 μs (271.1 μs .. 275.2 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 273.6 μs (271.8 μs .. 275.4 μs)
std dev 6.106 μs (5.212 μs .. 7.302 μs)
variance introduced by outliers: 16% (moderately inflated)
benchmarking 10000/vector-builder
time 2.008 ms (1.989 ms .. 2.027 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 2.008 ms (1.997 ms .. 2.019 ms)
std dev 38.06 μs (29.94 μs .. 49.11 μs)
benchmarking 10000/default
time 3.029 ms (3.008 ms .. 3.045 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 3.012 ms (2.993 ms .. 3.028 ms)
std dev 60.85 μs (46.65 μs .. 84.29 μs)
benchmarking 100000/vector-builder
time 28.51 ms (27.91 ms .. 28.96 ms)
0.998 R² (0.997 R² .. 0.999 R²)
mean 28.72 ms (28.35 ms .. 29.29 ms)
std dev 1.007 ms (642.3 μs .. 1.607 ms)
variance introduced by outliers: 10% (moderately inflated)
benchmarking 100000/default
time 32.81 ms (32.63 ms .. 32.98 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 32.76 ms (32.67 ms .. 32.87 ms)
std dev 198.4 μs (153.3 μs .. 257.1 μs)
benchmarking 1000000/vector-builder
time 299.6 ms (250.5 ms .. 348.9 ms)
0.985 R² (0.928 R² .. 1.000 R²)
mean 308.4 ms (283.7 ms .. 328.2 ms)
std dev 25.19 ms (12.99 ms .. 34.22 ms)
variance introduced by outliers: 18% (moderately inflated)
benchmarking 1000000/default
time 332.1 ms (312.7 ms .. 355.4 ms)
0.997 R² (0.982 R² .. 1.000 R²)
mean 328.5 ms (318.7 ms .. 335.9 ms)
std dev 9.820 ms (5.357 ms .. 13.32 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking 10000000/vector-builder
time 3.005 s (2.391 s .. 3.785 s)
0.992 R² (0.975 R² .. NaN R²)
mean 3.247 s (3.109 s .. 3.378 s)
std dev 220.4 ms (543.9 as .. 226.5 ms)
variance introduced by outliers: 20% (moderately inflated)
benchmarking 10000000/default
time 3.699 s (3.462 s .. 3.897 s)
1.000 R² (0.998 R² .. 1.000 R²)
mean 3.594 s (3.503 s .. 3.641 s)
std dev 78.70 ms (543.9 as .. 81.27 ms)
variance introduced by outliers: 19% (moderately inflated)
I've run the benchmarks on 100,000,000 and it's about the point when the vector-growing algo starts beating "vector-builder":
benchmarking 1000/vector-builder
time 91.29 μs (90.78 μs .. 91.93 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 91.34 μs (90.78 μs .. 92.11 μs)
std dev 2.223 μs (1.440 μs .. 3.850 μs)
variance introduced by outliers: 21% (moderately inflated)
benchmarking 1000/default
time 284.5 μs (281.2 μs .. 289.4 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 285.2 μs (283.8 μs .. 287.3 μs)
std dev 5.460 μs (4.009 μs .. 8.053 μs)
variance introduced by outliers: 12% (moderately inflated)
benchmarking 10000/vector-builder
time 2.056 ms (2.025 ms .. 2.084 ms)
0.999 R² (0.998 R² .. 0.999 R²)
mean 2.032 ms (2.019 ms .. 2.045 ms)
std dev 42.98 μs (35.34 μs .. 56.33 μs)
benchmarking 10000/default
time 3.068 ms (3.044 ms .. 3.092 ms)
0.999 R² (0.998 R² .. 0.999 R²)
mean 3.131 ms (3.104 ms .. 3.162 ms)
std dev 91.35 μs (73.99 μs .. 111.9 μs)
variance introduced by outliers: 14% (moderately inflated)
benchmarking 100000/vector-builder
time 28.88 ms (28.26 ms .. 29.37 ms)
0.998 R² (0.997 R² .. 0.999 R²)
mean 28.90 ms (28.43 ms .. 29.41 ms)
std dev 1.063 ms (705.3 μs .. 1.658 ms)
variance introduced by outliers: 10% (moderately inflated)
benchmarking 100000/default
time 33.27 ms (32.78 ms .. 33.76 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 33.52 ms (33.28 ms .. 33.81 ms)
std dev 523.8 μs (360.0 μs .. 840.4 μs)
benchmarking 1000000/vector-builder
time 302.8 ms (257.0 ms .. 346.7 ms)
0.988 R² (0.940 R² .. 1.000 R²)
mean 309.0 ms (286.7 ms .. 326.0 ms)
std dev 22.40 ms (10.72 ms .. 30.25 ms)
variance introduced by outliers: 18% (moderately inflated)
benchmarking 1000000/default
time 336.0 ms (314.4 ms .. 358.4 ms)
0.996 R² (0.983 R² .. 1.000 R²)
mean 330.8 ms (319.9 ms .. 337.3 ms)
std dev 10.77 ms (3.708 ms .. 14.49 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking 10000000/vector-builder
time 3.097 s (2.271 s .. 4.085 s)
0.988 R² (0.961 R² .. 1.000 R²)
mean 3.343 s (3.188 s .. 3.647 s)
std dev 263.2 ms (543.9 as .. 264.7 ms)
variance introduced by outliers: 21% (moderately inflated)
benchmarking 10000000/default
time 3.784 s (3.544 s .. 3.952 s)
1.000 R² (0.999 R² .. 1.000 R²)
mean 3.675 s (3.586 s .. 3.724 s)
std dev 77.91 ms (0.0 s .. 83.70 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 100000000/vector-builder
time 192.0 s (144.6 s .. 218.6 s)
0.992 R² (0.984 R² .. NaN R²)
mean 188.9 s (182.5 s .. 193.1 s)
std dev 6.319 s (0.0 s .. 7.296 s)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 100000000/default
time 163.0 s (105.0 s .. NaN s)
0.976 R² (0.930 R² .. 1.000 R²)
mean 160.0 s (151.1 s .. 174.2 s)
std dev 12.44 s (0.0 s .. 13.31 s)
variance introduced by outliers: 21% (moderately inflated)
is it boxed Vector
?
ok, every vector is boxed. Is it vector of boxed values?
i'm more interested in the Unboxed/Storable case
The last set of results are a little surprising because up until then the default
implementation's performance scaled linearly with the number of elements, but then a 10x increase in elements ran 40x slower
IMO pretty normal for something with 5GB RES.
What's weird is that the performance difference that @klapaucius is seeing is significantly larger than the one that @nikita-volkov is measuring
@klapaucius: Also, do you get the same results when using Control.Foldl.vector
from foldl-1.3.1
?
i's difference between Data.Vector
and Data.Vector.Unboxed
, so performance difference is actually understandable.
Data.Vector
10000000
411,952,312 bytes allocated in the heap
490,201,792 bytes copied during GC
169,415,272 bytes maximum residency (9 sample(s))
2,056,576 bytes maximum slop
369 MB total memory in use (50 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 298 colls, 0 par 0.578s 0.572s 0.0019s 0.0668s
Gen 1 9 colls, 0 par 0.219s 0.275s 0.0306s 0.1178s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.094s ( 0.171s elapsed)
GC time 0.797s ( 0.847s elapsed)
EXIT time 0.000s ( 0.033s elapsed)
Total time 0.906s ( 1.051s elapsed)
%GC time 87.9% (80.6% elapsed)
Alloc rate 4,394,157,994 bytes per MUT second
Productivity 12.1% of total user, 19.4% of total elapsed
Data.Vector
10000000
1,938,887,592 bytes allocated in the heap
2,076,921,448 bytes copied during GC
630,191,328 bytes maximum residency (11 sample(s))
8,271,064 bytes maximum slop
1088 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3541 colls, 0 par 1.406s 1.482s 0.0004s 0.0468s
Gen 1 11 colls, 0 par 1.172s 1.565s 0.1423s 0.5815s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.594s ( 0.411s elapsed)
GC time 2.578s ( 3.047s elapsed)
EXIT time 0.000s ( 0.131s elapsed)
Total time 3.172s ( 3.590s elapsed)
%GC time 81.3% (84.9% elapsed)
Alloc rate 3,265,494,891 bytes per MUT second
Productivity 18.7% of total user, 15.1% of total elapsed
Data.Vector.Unboxed
10000000
971,706,320 bytes allocated in the heap
13,496 bytes copied during GC
83,887,216 bytes maximum residency (9 sample(s))
2,071,424 bytes maximum slop
168 MB total memory in use (5 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1388 colls, 0 par 0.000s 0.003s 0.0000s 0.0001s
Gen 1 9 colls, 0 par 0.000s 0.017s 0.0019s 0.0157s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.156s ( 0.198s elapsed)
GC time 0.000s ( 0.020s elapsed)
EXIT time 0.000s ( 0.016s elapsed)
Total time 0.156s ( 0.234s elapsed)
%GC time 0.0% (8.4% elapsed)
Alloc rate 6,218,920,448 bytes per MUT second
Productivity 100.0% of total user, 91.5% of total elapsed
Data.Vector.Unboxed
10000000
1,938,809,440 bytes allocated in the heap
2,076,900,352 bytes copied during GC
630,113,176 bytes maximum residency (11 sample(s))
8,349,216 bytes maximum slop
1088 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3541 colls, 0 par 1.172s 1.193s 0.0003s 0.0124s
Gen 1 11 colls, 0 par 1.172s 1.504s 0.1367s 0.5151s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.438s ( 0.408s elapsed)
GC time 2.344s ( 2.697s elapsed)
EXIT time 0.016s ( 0.114s elapsed)
Total time 2.812s ( 3.220s elapsed)
%GC time 83.3% (83.8% elapsed)
Alloc rate 4,431,564,434 bytes per MUT second
Productivity 16.7% of total user, 16.2% of total elapsed
I've updated my benchmarks to include unboxed vectors. For some reason this time the boxed version starts losing at 100,000. The same applies to the unboxed one as well.
Here are the results:
benchmarking boxed/1000/builder
time 59.84 μs (59.29 μs .. 60.50 μs)
0.998 R² (0.997 R² .. 1.000 R²)
mean 60.16 μs (59.62 μs .. 60.95 μs)
std dev 2.558 μs (1.487 μs .. 3.415 μs)
variance introduced by outliers: 46% (moderately inflated)
benchmarking boxed/1000/growing
time 209.8 μs (208.9 μs .. 210.6 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 209.6 μs (209.1 μs .. 210.1 μs)
std dev 1.944 μs (1.526 μs .. 2.459 μs)
benchmarking boxed/10000/builder
time 1.907 ms (1.895 ms .. 1.921 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.895 ms (1.884 ms .. 1.907 ms)
std dev 41.68 μs (33.66 μs .. 49.92 μs)
benchmarking boxed/10000/growing
time 2.700 ms (2.597 ms .. 2.792 ms)
0.992 R² (0.989 R² .. 0.996 R²)
mean 2.606 ms (2.568 ms .. 2.652 ms)
std dev 138.9 μs (109.3 μs .. 165.8 μs)
variance introduced by outliers: 35% (moderately inflated)
benchmarking boxed/100000/builder
time 32.05 ms (31.25 ms .. 32.85 ms)
0.998 R² (0.996 R² .. 0.999 R²)
mean 32.78 ms (32.09 ms .. 33.95 ms)
std dev 2.099 ms (726.2 μs .. 3.406 ms)
variance introduced by outliers: 22% (moderately inflated)
benchmarking boxed/100000/growing
time 30.39 ms (28.95 ms .. 31.87 ms)
0.995 R² (0.991 R² .. 0.998 R²)
mean 29.82 ms (29.16 ms .. 30.65 ms)
std dev 1.470 ms (852.1 μs .. 1.832 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking boxed/1000000/builder
time 359.6 ms (261.1 ms .. 426.5 ms)
0.992 R² (0.972 R² .. 1.000 R²)
mean 356.3 ms (335.9 ms .. 366.6 ms)
std dev 17.63 ms (67.99 as .. 17.83 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking boxed/1000000/growing
time 304.4 ms (291.2 ms .. 319.0 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 293.8 ms (288.8 ms .. 297.6 ms)
std dev 5.068 ms (2.172 ms .. 6.338 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking boxed/10000000/builder
time 3.693 s (3.108 s .. 4.281 s)
0.997 R² (0.988 R² .. 1.000 R²)
mean 3.945 s (3.791 s .. 4.059 s)
std dev 173.0 ms (0.0 s .. 197.2 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking boxed/10000000/growing
time 3.515 s (3.139 s .. 3.812 s)
0.999 R² (0.995 R² .. 1.000 R²)
mean 3.437 s (3.370 s .. 3.476 s)
std dev 60.32 ms (0.0 s .. 66.73 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking unboxed/1000/builder
time 59.20 μs (57.66 μs .. 60.87 μs)
0.996 R² (0.995 R² .. 0.999 R²)
mean 59.64 μs (58.84 μs .. 60.88 μs)
std dev 2.944 μs (2.280 μs .. 3.745 μs)
variance introduced by outliers: 54% (severely inflated)
benchmarking unboxed/1000/growing
time 197.5 μs (195.9 μs .. 199.6 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 197.8 μs (196.2 μs .. 199.8 μs)
std dev 5.576 μs (4.635 μs .. 6.943 μs)
variance introduced by outliers: 23% (moderately inflated)
benchmarking unboxed/10000/builder
time 1.697 ms (1.672 ms .. 1.722 ms)
0.999 R² (0.998 R² .. 0.999 R²)
mean 1.709 ms (1.694 ms .. 1.720 ms)
std dev 42.34 μs (33.00 μs .. 53.54 μs)
variance introduced by outliers: 13% (moderately inflated)
benchmarking unboxed/10000/growing
time 2.244 ms (2.200 ms .. 2.297 ms)
0.998 R² (0.997 R² .. 1.000 R²)
mean 2.217 ms (2.207 ms .. 2.231 ms)
std dev 43.72 μs (33.02 μs .. 64.60 μs)
benchmarking unboxed/100000/builder
time 30.00 ms (29.28 ms .. 30.65 ms)
0.998 R² (0.995 R² .. 0.999 R²)
mean 30.34 ms (29.81 ms .. 31.07 ms)
std dev 1.568 ms (787.1 μs .. 2.360 ms)
variance introduced by outliers: 17% (moderately inflated)
benchmarking unboxed/100000/growing
time 23.12 ms (22.74 ms .. 23.88 ms)
0.996 R² (0.990 R² .. 1.000 R²)
mean 23.44 ms (23.07 ms .. 23.86 ms)
std dev 859.0 μs (597.7 μs .. 1.101 ms)
benchmarking unboxed/1000000/builder
time 280.6 ms (271.3 ms .. 287.9 ms)
1.000 R² (0.998 R² .. 1.000 R²)
mean 316.7 ms (306.6 ms .. 334.9 ms)
std dev 17.44 ms (2.560 ms .. 22.38 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking unboxed/1000000/growing
time 222.9 ms (215.0 ms .. 228.4 ms)
0.999 R² (0.995 R² .. 1.000 R²)
mean 223.1 ms (219.5 ms .. 225.3 ms)
std dev 3.460 ms (1.354 ms .. 4.725 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking unboxed/10000000/builder
time 2.975 s (2.485 s .. 3.566 s)
0.995 R² (0.984 R² .. 1.000 R²)
mean 3.251 s (3.109 s .. 3.355 s)
std dev 157.7 ms (0.0 s .. 180.0 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking unboxed/10000000/growing
time 2.280 s (2.047 s .. 2.483 s)
0.999 R² (0.995 R² .. 1.000 R²)
mean 2.210 s (2.164 s .. 2.237 s)
std dev 41.37 ms (0.0 s .. 46.09 ms)
variance introduced by outliers: 19% (moderately inflated)
(Source)
More importantly, criterion tries very hard to escape any GC influence on results: performs major collections in between measurements, writes off pauses as outliers, etc. So it tells very little about real performance on large object trees with zillions of refs. After 100K it's literally GC benchmark:
1M
MUT time 0.031s ( 0.021s elapsed)
GC time 0.078s ( 0.100s elapsed)
10M
MUT time 0.125s ( 0.157s elapsed)
GC time 0.766s ( 0.836s elapsed)
100M
MUT time 2.953s ( 2.521s elapsed)
GC time 19.266s ( 21.591s elapsed)
So it's critical to have vectorM
or something with less memory pressure because, if you can afford many gigs of wasted RAM, 0.5 sec pauses and abysmal 15% MUT productivity why bother to use foldl
at all? Isn't it library for efficient, low memory footprint consumption of large datasets?
Either way I'll add vectorM
. I was just wanted to make sure that I understood the difference in results
This is now up on Hackage as foldl-1.3.3
foldl-1.2.5
vector-builder-0.3.1
foldl-1.3.1