beacon-biosignals / DataFrameIntervals.jl

Utilities for working with DataFrames of `Intervals.jl` or `TimeSpans.jl` objects.
MIT License
6 stars 1 forks source link

poor performance relative to FlexiJoins in integer case #24

Open ericphanson opened 1 year ago

ericphanson commented 1 year ago

I have the task of matching individual sample points (integers) into AlignedSpan sample regions. I have found that if I map to raw Ints and UnitRange{Int}s, I get much better performance with FlexiJoins than using Interval{Int, Closed, Closed} with DataFrameIntervals. This problem might be out-of-scope for DataFrameIntervals, but maybe it also indicates a place where perf could be improved.

For my actual task, I am having trouble using FlexiJoins due to compat issues (Flux v0.12 needs ArrayInterface 5, while FlexiJoins 0.1.23 is the latest release, and needs Static 0.7-something, which needs ArrayInterface v6... it's a whole mess). So if DataFrameIntervals could solve this more performantly without needing extra dependencies that introduce compat issues, that would be wonderful. For now I am using DataFrameIntervals despite the perf shortfall to avoid the compat problem.

Here is my MWE.

Versions:

FlexiJoins v0.1.23
Intervals v1.8.0
DataFrameIntervals v0.1.0

on Julia 1.8.2 (ubuntu).

Problem:

using DataFrames, Intervals, StableRNGs
using DataFrameIntervals, FlexiJoins

N_INTERVALS = 1000
N_POINTS = 10^5

function make_example(n_intervals, n_points)
    rng = StableRNG(123)
    points = [rand(rng, 1:10_000) for _ in 1:n_points]
    L = DataFrame(:point => points)
    transform!(L, :point => ByRow(p -> Interval{Int,Closed,Closed}(p, p)) => :interval)

    indices = map(1:n_intervals) do _
        i = rand(rng, 1:10_000)
        return i:i+100
    end
    R = DataFrame(:indices => indices)
    transform!(R, :indices => ByRow(I -> Interval{Int,Closed,Closed}(first(I), last(I))) => :interval)
    return L, R
end

L, R = make_example(N_INTERVALS, N_POINTS)
@time FlexiJoins.innerjoin((L, R), by_pred(:point, ∈, :indices))
@time interval_join(L, R; on=:interval)

The FlexiJoin join takes 0.14 seconds, while the interval_join takes 22s, 157x slower.

I originally had N_INTERVALS = 3000 and N_POINTS = 10^6, for which FlexiJoins took 4-5s, and DataFrameIntervals was still running after 15 mins when I cancelled and reduced those values to make the MWE more minimal.

aplavin commented 1 year ago

I've cleaned up FlexiJoins dependencies last month, and now came to registering the updated version in General. In particular, it no longer depends on Static - should help with enviroments like yours that require an older version of it.

haberdashPI commented 1 year ago

Just as an update as to the status of this issue: @ericphanson has a PR in Intervals.jl that improves the performance of find_intersections (which determines performance here) that will support the AlignedSpans case (or an Interval{Closed,Closed} type). I'll then port that to support all interval types, which should improve performance, in general, for DataFrameIntervals.

haberdashPI commented 1 year ago

This is addressed by https://github.com/JuliaRegistries/General/pull/85622

aplavin commented 1 year ago

Just tried the same benchmark with the updated Intervals version - it did become somewhat faster. Results as-is: FlexiJoins 0.140287 seconds (23.20 k allocations: 229.607 MiB, 7.97% gc time) DataFrameIntervals 1.339070 seconds (4.07 M allocations: 7.914 GiB, 4.81% gc time)

Also note that FlexiJoins is used suboptimally here:

Using Interval (from IntervalSets) instead of range, and StructArray instead of DataFrame, I get FlexiJoins (native) 0.056713 seconds (80 allocations: 21.091 MiB, 47.42% gc time)

For larger N_INTERVALS = 3000, N_POINTS = 10^6: FlexiJoins (asis) 10.779393 seconds (93.14 k allocations: 6.174 GiB, 1.67% gc time) FlexiJoins (native) 1.068480 seconds (92 allocations: 658.505 MiB, 0.42% gc time) DataFrameIntervals 79.301766 seconds (120.87 M allocations: 180.618 GiB, 8.43% gc time)

Status `/tmp/jl_WWEqxB/Project.toml`
  [33b79e07] DataFrameIntervals v0.2.0 `https://github.com/beacon-biosignals/DataFrameIntervals.jl#main`
  [a93c6f00] DataFrames v1.5.0
  [e37f2e79] FlexiJoins v0.1.30
  [8197267c] IntervalSets v0.7.4
  [d8418881] Intervals v1.10.0
  [860ef19b] StableRNGs v1.0.0
  [44cfe95a] Pkg v1.9.0
ericphanson commented 1 year ago

I've realized FlexiJoins doesn't have CI setup (it has github actions script on gitlab which doesn't do anything), so it doesn't fit Beacon's dependency requirements. It also doesn't support the Tables.jl interface which is kind of a non-starter for a tabular join, and its DataFrames support doesn't participate in semver. So I don't think it's a viable alternative to DataFrameIntervals, however it does show these operations could be a lot faster still.

aplavin commented 1 year ago

FlexiJoins doesn't have CI setup

It does: the primary repo is (private) on github, gitlab is just the public mirror (motivation 1 2). It's no coincidence that there's a github action script in the repo :)

its DataFrames support doesn't participate in semver

DF support is pretty "naive" for now, definitely suboptimal as seen even from the benchmarks above. Also see https://gitlab.com/aplavin/FlexiJoins.jl/-/issues/2. These are the major reasons why this functionality is considered experimental.

I'm not really familiar with DFs myself, so any help improving the integration is welcome. Then this part can also become semver-stable functionality.

It also doesn't support the Tables.jl interface

Indeed, the priority of FlexiJoins is to support the Base Julia interface - collections. Many (most) in-memory tables also follow this interface. Any help with Tables.jl support would also be appreciated, but I don't have such plans myself.

haberdashPI commented 1 year ago

Thanks @aplavin for running the benchmark. Super helpful to know!

I perhaps should have clarified when I closed the issue, that I simply meant the more optimized find_intersections method had been implemented. Perhaps I shouldn't have closed.