Unfold vs flattenArray performance investigation

The array concat operation can be simply implemented as follows:

concat :: (IsStream t, MonadIO m, Storable a) => t m (Array a) -> t m a
concat = S.concatMap A.read

However, concatMap cannot fuse read so we implemented a custom flattenArrays to directly implement array concat instead of using concatMap. This performs pretty well but we lose composability. To solve the composability problem we introduced an Unfold to abstract the stream generation (A.read) in such a way that can fuse well because the state is now visible to the compiler and can be used in optimizations.

readU :: forall m a. (Monad m, Storable a) => Unfold m (Array a) a

concat = S.concatMapU A.readU

This fuses completely, as well as the custom implementation and performance is the same except in one case i.e. linecount benchmark where it is 50% slower. Though the core seems to shorter in the unfold case. One possible reason may be that in the unfold case we are passing one more local variable in the loop which may cause less optimal code due to register spill? Or is it something else? We can try if llvm can reduce this difference. For example, in the flattenArrays case we have:

                                  jump $s$wgo3_s8Fe
                                    s'1_X7fY
                                    ()
                                    (ForeignPtr sc7_s8Fv sc8_s8Fw)
                                    (plusAddr# sc9_s8Fx 1#)
                                    (Ptr sc10_s8Fy)
                                    (+# sc5_s8EY 1#)

In the Unfold case we have the corresponding jump as:

                                  jump $s$wgo3_s8Ks
                                    sc7_s8KN
                                    sc8_s8KJ
                                    sc9_s8KK
                                    (plusAddr# sc10_s8KL 1#)
                                    sc11_s8KM
                                    ()
                                    (+# sc6_s8Kb 1#)

These are the core files for both the cases:

concatMap.dump-simpl.linecount.flattenArrays.txt

concatMap.dump-simpl.linecount.unfold.txt

composewell / streamly

Unfold vs flattenArray performance investigation #263