Open ericphanson opened 1 year ago
I've cleaned up FlexiJoins dependencies last month, and now came to registering the updated version in General
. In particular, it no longer depends on Static
- should help with enviroments like yours that require an older version of it.
Just as an update as to the status of this issue: @ericphanson has a PR in Intervals.jl
that improves the performance of find_intersections
(which determines performance here) that will support the AlignedSpans case (or an Interval{Closed,Closed}
type). I'll then port that to support all interval types, which should improve performance, in general, for DataFrameIntervals
.
This is addressed by https://github.com/JuliaRegistries/General/pull/85622
Just tried the same benchmark with the updated Intervals version - it did become somewhat faster. Results as-is: FlexiJoins 0.140287 seconds (23.20 k allocations: 229.607 MiB, 7.97% gc time) DataFrameIntervals 1.339070 seconds (4.07 M allocations: 7.914 GiB, 4.81% gc time)
Also note that FlexiJoins is used suboptimally here:
Using Interval
(from IntervalSets
) instead of range
, and StructArray
instead of DataFrame
, I get
FlexiJoins (native) 0.056713 seconds (80 allocations: 21.091 MiB, 47.42% gc time)
For larger N_INTERVALS = 3000, N_POINTS = 10^6: FlexiJoins (asis) 10.779393 seconds (93.14 k allocations: 6.174 GiB, 1.67% gc time) FlexiJoins (native) 1.068480 seconds (92 allocations: 658.505 MiB, 0.42% gc time) DataFrameIntervals 79.301766 seconds (120.87 M allocations: 180.618 GiB, 8.43% gc time)
Status `/tmp/jl_WWEqxB/Project.toml`
[33b79e07] DataFrameIntervals v0.2.0 `https://github.com/beacon-biosignals/DataFrameIntervals.jl#main`
[a93c6f00] DataFrames v1.5.0
[e37f2e79] FlexiJoins v0.1.30
[8197267c] IntervalSets v0.7.4
[d8418881] Intervals v1.10.0
[860ef19b] StableRNGs v1.0.0
[44cfe95a] Pkg v1.9.0
I've realized FlexiJoins doesn't have CI setup (it has github actions script on gitlab which doesn't do anything), so it doesn't fit Beacon's dependency requirements. It also doesn't support the Tables.jl interface which is kind of a non-starter for a tabular join, and its DataFrames support doesn't participate in semver. So I don't think it's a viable alternative to DataFrameIntervals, however it does show these operations could be a lot faster still.
FlexiJoins doesn't have CI setup
It does: the primary repo is (private) on github, gitlab is just the public mirror (motivation 1 2). It's no coincidence that there's a github action script in the repo :)
its DataFrames support doesn't participate in semver
DF support is pretty "naive" for now, definitely suboptimal as seen even from the benchmarks above. Also see https://gitlab.com/aplavin/FlexiJoins.jl/-/issues/2. These are the major reasons why this functionality is considered experimental.
I'm not really familiar with DFs myself, so any help improving the integration is welcome. Then this part can also become semver-stable functionality.
It also doesn't support the Tables.jl interface
Indeed, the priority of FlexiJoins is to support the Base Julia interface - collections. Many (most) in-memory tables also follow this interface. Any help with Tables.jl support would also be appreciated, but I don't have such plans myself.
Thanks @aplavin for running the benchmark. Super helpful to know!
I perhaps should have clarified when I closed the issue, that I simply meant the more optimized find_intersections
method had been implemented. Perhaps I shouldn't have closed.
I have the task of matching individual sample points (integers) into AlignedSpan sample regions. I have found that if I map to raw
Int
s andUnitRange{Int}
s, I get much better performance with FlexiJoins than usingInterval{Int, Closed, Closed}
with DataFrameIntervals. This problem might be out-of-scope for DataFrameIntervals, but maybe it also indicates a place where perf could be improved.For my actual task, I am having trouble using FlexiJoins due to compat issues (Flux v0.12 needs ArrayInterface 5, while FlexiJoins 0.1.23 is the latest release, and needs Static 0.7-something, which needs ArrayInterface v6... it's a whole mess). So if DataFrameIntervals could solve this more performantly without needing extra dependencies that introduce compat issues, that would be wonderful. For now I am using DataFrameIntervals despite the perf shortfall to avoid the compat problem.
Here is my MWE.
Versions:
on Julia 1.8.2 (ubuntu).
Problem:
The FlexiJoin join takes 0.14 seconds, while the
interval_join
takes 22s, 157x slower.I originally had
N_INTERVALS = 3000
andN_POINTS = 10^6
, for which FlexiJoins took 4-5s, and DataFrameIntervals was still running after 15 mins when I cancelled and reduced those values to make the MWE more minimal.