Open harendra-kumar opened 5 years ago
See also #254 for another issue with similar problem. Maybe we cannot just avoid, but maybe there is an opportunity for GHC to be able to improve these cases in future, or maybe we can tweak the implementation in way so that we can get better perf.
The array concat operation can be simply implemented as follows:
However,
concatMap
cannot fuseread
so we implemented a customflattenArrays
to directly implement array concat instead of using concatMap. This performs pretty well but we lose composability. To solve the composability problem we introduced anUnfold
to abstract the stream generation (A.read
) in such a way that can fuse well because the state is now visible to the compiler and can be used in optimizations.This fuses completely, as well as the custom implementation and performance is the same except in one case i.e.
linecount
benchmark where it is 50% slower. Though the core seems to shorter in the unfold case. One possible reason may be that in the unfold case we are passing one more local variable in the loop which may cause less optimal code due to register spill? Or is it something else? We can try if llvm can reduce this difference. For example, in the flattenArrays case we have:In the Unfold case we have the corresponding jump as:
These are the core files for both the cases:
concatMap.dump-simpl.linecount.flattenArrays.txt
concatMap.dump-simpl.linecount.unfold.txt