fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
924 stars 196 forks source link

Frame.mapColValues is weirdly slow compared to mapping columns as series and joining with Frame.ofColumns #539

Open vijoc opened 2 years ago

vijoc commented 2 years ago

I ran into an issue where a frame with tens of thousands of rows and a handful (<10) of columns is very slow to apply Frame.mapColValues over. On the other hand, when first mapping the series "manually" (see below) and then joining with Frame.ofColumns, the difference in speed is of orders of magnitude.

What I'm looking to do is a naive hourly averaging of time series data. I implemented it with essentially the following:

// A simple series of 20 000 observations with one minute interval
let startFrom = DateTimeOffset.Parse "2021-10-27T00:00:00Z"
let series =
    Seq.init 20000 (fun idx -> startFrom.AddMinutes (float idx), float idx)
    |> Series.ofObservations

// Two columns with the same series from above
let columns = seq { "one", series; "two", series }
let frame = Frame.ofColumns columns

// Comparison of two timestamps to check if the hour is the same
let isSameHour (d1: DateTimeOffset) (d2: DateTimeOffset) =
    d1.Hour = d2.Hour && d1.Day = d2.Day && d1.Month = d2.Month && d1.Year = d2.Year

// Three methods to convert to a new frame with hourly averages
// 1. Using Frame.mapColValues, takes over a second
frame |> Frame.mapColValues (Series.chunkWhileInto isSameHour Stats.mean) // this takes over a second

// 2. An approximation of the internals of Frame.mapColValues, takes the same time (over a second):
frame.Columns
    |> Series.mapValues (Series.chunkWhileInto isSameHour Stats.mean)
    |> Frame.ofColumns

// 3. Sidestepping the initial frame, this takes 10-20 *milli*seconds:
columns
    |> Seq.map (fun (k, s) -> k, s |> Series.chunkWhileInto isSameHour Stats.mean)
    |> Frame.ofColumns

It may well be that I'm overlooking something here, I'm not super confident with either the Deedle codebase nor performance diagnosis in F#. I do have a setup with BenchmarkDotNet, which I could extract and share if that would be helpful.

Is this kind of performance expected? I believe I can avoid the issue in my use case by using method 3 from above, but I'm struggling to understand what could cause this kind of performance difference in this case.

vijoc commented 2 years ago

For what it's worth, I did some more testing and found that the bad performance can also be avoided by using Frame.getNumericCols or even simply Frame.getCols.

// About the same performance as approach number 3 from above
frame
    |> Frame.getNumericCols
    |> Series.mapValues (Series.chunkWhileInto inlineComparison Stats.mean)
    |> Frame.ofColumns

// Slightly worse behavior, but still around 50 milliseconds versus ~10+ milliseconds for the above 
// or ~1+ seconds for Frame.mapColValues
frame
    |> Frame.getCols
    |> Series.mapValues (Series.chunkWhileInto inlineComparison Stats.mean)
    |> Frame.ofColumns