acowley / Frames

Data frames for tabular data.
Other
298 stars 41 forks source link

Performance of the Monoid instance #187

Open adamConnerSax opened 6 months ago

adamConnerSax commented 6 months ago

I had a performance issue that I tracked down to the monoid instance. I had many frames that I wanted to merge vertically. Turned out to be substantially faster to take each frame, make it a list of records, concat those lists and then do toFrame.

I wonder if it's worth testing in what scenarios each of these ways of vertical concatenation is more performant and then providing functions for both or just switching entirely to something like what I described above?

adamConnerSax commented 5 months ago

I did a bit of benchmarking. The test is to load a frame with N rows, make a list with M copies, make that into one frame via mconcat or toFrame . fmap toList and then do a fold to sum over one column (insuring the final frame is actually used), then test for various N and M. TLDR: Things scale linearly in frame size but mconcat scales almost quadratically in number of frame copies whereas toFrame remains linear. The monoid version is quite a bit faster for small numbers of copies--I'd guess because there is no copying of data?--and the crossover point in my experiment was combining about 500 frames. so the ideal function might be something like

frameConcat :: (RecVec rs, Foldable f, Functor f) => f (FrameRec rs) -> FrameRec rs
frameConcat x = if length x < 500
                then mconcat $ toList x
                else Frames.toFrame . concat $ fmap toList x

Combining so many frames might not seem like a common use case but for me it has been, resulting from map/reduce type operations where the input is a frame and I then process each of many subsets (census tracts, say) in some way which results in a new frame which I then want to recombine into one large frame. It's possible that the right thing is to accumulate the intermediate results as lists of records rather than frames. But I still think Frames should support the large-scale concat.

I'd be happy to submit a PR (that function plus some instructive documentation!) if it would be accepted. If you would accept it, what module would you like that to go in? Maybe InCore?