Open adamConnerSax opened 6 months ago
I did a bit of benchmarking. The test is to load a frame with N rows, make a list with M copies, make that into one frame via mconcat
or toFrame . fmap toList
and then do a fold to sum over one column (insuring the final frame is actually used), then test for various N and M.
TLDR: Things scale linearly in frame size but mconcat
scales almost quadratically in number of frame copies whereas toFrame
remains linear. The monoid version is quite a bit faster for small numbers of copies--I'd guess because there is no copying of data?--and the crossover point in my experiment was combining about 500 frames. so the ideal function might be something like
frameConcat :: (RecVec rs, Foldable f, Functor f) => f (FrameRec rs) -> FrameRec rs
frameConcat x = if length x < 500
then mconcat $ toList x
else Frames.toFrame . concat $ fmap toList x
Combining so many frames might not seem like a common use case but for me it has been, resulting from map/reduce type operations where the input is a frame and I then process each of many subsets (census tracts, say) in some way which results in a new frame which I then want to recombine into one large frame. It's possible that the right thing is to accumulate the intermediate results as lists of records rather than frames. But I still think Frames should support the large-scale concat.
I'd be happy to submit a PR (that function plus some instructive documentation!) if it would be accepted. If you would accept it, what module would you like that to go in? Maybe InCore?
I had a performance issue that I tracked down to the monoid instance. I had many frames that I wanted to merge vertically. Turned out to be substantially faster to take each frame, make it a list of records,
concat
those lists and then dotoFrame
.I wonder if it's worth testing in what scenarios each of these ways of vertical concatenation is more performant and then providing functions for both or just switching entirely to something like what I described above?