composewell / streamly

High performance, concurrent functional programming abstractions
https://streamly.composewell.com
Other
864 stars 66 forks source link

Unfold vs flattenArray performance investigation #263

Open harendra-kumar opened 5 years ago

harendra-kumar commented 5 years ago

The array concat operation can be simply implemented as follows:

concat :: (IsStream t, MonadIO m, Storable a) => t m (Array a) -> t m a
concat = S.concatMap A.read

However, concatMap cannot fuse read so we implemented a custom flattenArrays to directly implement array concat instead of using concatMap. This performs pretty well but we lose composability. To solve the composability problem we introduced an Unfold to abstract the stream generation (A.read) in such a way that can fuse well because the state is now visible to the compiler and can be used in optimizations.

readU :: forall m a. (Monad m, Storable a) => Unfold m (Array a) a

concat = S.concatMapU A.readU

This fuses completely, as well as the custom implementation and performance is the same except in one case i.e. linecount benchmark where it is 50% slower. Though the core seems to shorter in the unfold case. One possible reason may be that in the unfold case we are passing one more local variable in the loop which may cause less optimal code due to register spill? Or is it something else? We can try if llvm can reduce this difference. For example, in the flattenArrays case we have:

                                  jump $s$wgo3_s8Fe
                                    s'1_X7fY
                                    ()
                                    (ForeignPtr sc7_s8Fv sc8_s8Fw)
                                    (plusAddr# sc9_s8Fx 1#)
                                    (Ptr sc10_s8Fy)
                                    (+# sc5_s8EY 1#)

In the Unfold case we have the corresponding jump as:

                                  jump $s$wgo3_s8Ks
                                    sc7_s8KN
                                    sc8_s8KJ
                                    sc9_s8KK
                                    (plusAddr# sc10_s8KL 1#)
                                    sc11_s8KM
                                    ()
                                    (+# sc6_s8Kb 1#)

These are the core files for both the cases:

concatMap.dump-simpl.linecount.flattenArrays.txt

concatMap.dump-simpl.linecount.unfold.txt

harendra-kumar commented 5 years ago

See also #254 for another issue with similar problem. Maybe we cannot just avoid, but maybe there is an opportunity for GHC to be able to improve these cases in future, or maybe we can tweak the implementation in way so that we can get better perf.