ekmett / machines

Networks of composable stream transducers
Other
340 stars 46 forks source link

Speed up fold ~3x #71

Closed HuwCampbell closed 8 years ago

HuwCampbell commented 8 years ago

Sorry for the bikeshedding, I wasn't sure if I should put this forth as fold was already rather elegant using scan, but I got a pretty decent improvement in benchmarks: Before:

benchmarking fold/machines
time                 121.8 ms   (120.0 ms .. 125.2 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 123.0 ms   (122.1 ms .. 124.3 ms)
std dev              1.543 ms   (973.9 μs .. 2.259 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking fold/pipes
time                 7.856 ms   (7.474 ms .. 8.207 ms)
                     0.977 R²   (0.956 R² .. 0.991 R²)
mean                 7.742 ms   (7.437 ms .. 7.969 ms)
std dev              751.7 μs   (538.1 μs .. 1.025 ms)
variance introduced by outliers: 54% (severely inflated)

benchmarking fold/conduit
time                 6.836 ns   (6.678 ns .. 6.994 ns)
                     0.997 R²   (0.995 R² .. 0.998 R²)
mean                 6.887 ns   (6.763 ns .. 7.035 ns)
std dev              438.3 ps   (339.2 ps .. 590.1 ps)
variance introduced by outliers: 83% (severely inflated)

After:

benchmarking fold/machines
time                 46.81 ms   (43.16 ms .. 50.11 ms)
                     0.988 R²   (0.978 R² .. 0.997 R²)
mean                 43.00 ms   (41.79 ms .. 44.48 ms)
std dev              2.626 ms   (2.039 ms .. 3.835 ms)
variance introduced by outliers: 20% (moderately inflated)

benchmarking fold/pipes
time                 8.399 ms   (8.174 ms .. 8.708 ms)
                     0.991 R²   (0.986 R² .. 0.995 R²)
mean                 8.445 ms   (8.273 ms .. 8.653 ms)
std dev              534.4 μs   (445.1 μs .. 651.5 μs)
variance introduced by outliers: 35% (moderately inflated)

benchmarking fold/conduit
time                 7.079 ns   (6.957 ns .. 7.218 ns)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 7.105 ns   (6.984 ns .. 7.213 ns)
std dev              400.9 ps   (334.9 ps .. 525.1 ps)
variance introduced by outliers: 79% (severely inflated)
acowley commented 8 years ago

That's a pretty substantial speedup. Quick question: is it just the composition with final that's slowing it down, or is scan not getting inlined?

YoEight commented 8 years ago

@acowley IMHO, (~>) is at fault. That function isn't as fast as its pipes counterpart for instance.

HuwCampbell commented 8 years ago

I think it's that it's yielding a lot of intermediate values and discarding them, it's faster to just not yield them.

acowley commented 8 years ago

Yeah, that's the kind of thing that gets eaten up by RULES most of the time. We could probably do something specific for rewriting uses of final to avoid the machine composition. IIUC, this is the relevant pipes correspondence. Is it actually doing anything terribly more efficient? (I honestly don't know)

The specific benchmark that @HuwCampbell is looking at is also kind of interesting as there's just not a lot to it. We should perhaps sprinkle INLINABLE pragmas more generously to crunch down on concrete types.

EDIT: I'd also vote for @HuwCampbell's patch, btw, as it's not worth paying any penalty on something as central as fold.

treeowl commented 8 years ago

I don't see any reason not to accept this, although it seems to point to deeper efficiency concerns.