JuliaCollections / DataStructures.jl

Julia implementation of Data structures
https://juliacollections.github.io/DataStructures.jl/latest/
MIT License
692 stars 246 forks source link

DisjointSets: template for extension? #911

Closed jacob-roth closed 1 month ago

jacob-roth commented 1 month ago

I have an application where I need to extend the functionality of IntDisjointSets. In one use case, I need to handle the following:

  1. track the min and max element of each set
  2. track the average of elements in each set

My first thought was to allocate additional arrays for min, max, count, and avg and then modify root_union! in the following way, based on 298:

function root_union!(s::IntDisjointSet{T}, x::T, y::T) where {T<:Integer}
    parents = s.parents
    rks = s.ranks
    # start modifications
    count = s.count
    mn = s.min
    mx = s.max
    avg = s.avg
    # end modifications
    @inbounds xrank = rks[x]
    @inbounds yrank = rks[y]

    if xrank < yrank
        x, y = y, x
    elseif xrank == yrank
        rks[x] += one(T)
    end
    @inbounds parents[y] = x

    # start modifications
    cx = count[x]
    cy = count[y]
    cxy = cx + cy
    @inbounds mn[x] = min(mn[x], mn[y])
    @inbounds mx[x] = max(mx[x], mx[y])
    @inbounds avg[x] = cx/cxy * avg[x] + cy/cxy * avg[y]
    @inbounds count[x] = cxy
    # end modifications

    s.ngroups -= one(T)
    return x
end

I was wondering (1) if there might be any optimizations to this code and (2) if it could be useful for having a template indicating how to best modify the functionality of disjoint sets? My actual use case is slightly more complicated than this, but if this is a "good" template for min/max/avg, then I will follow it for my use case.

eulerkochy commented 1 month ago
  1. One minor optimisation I can think of, is keeping the sum and count, and calculate average out of it. Will help to reduce precision errors.
  2. Not sure, if I fully understand what you mean by "template" in this context, but we can sort of have a merge / union function of two DisjointSubset, which can be overloaded. The base merge/union function can do the re-ranking, while nuanced functions can handle (commutative?) operations.
jacob-roth commented 1 month ago

Ah good call about sum/count for precision, thanks! In terms of template, what I meant to ask was: should I write a new root_union! function to overload? Or what do you mean about the nuanced functions?

I guess if I'm tracking size, I can do union-by-size and eliminate rank allocations and update the root_union! accordingly. Do you know if there was a reason rank was chosen over size initially (besides being based on Boost per 297)?

Also, another question I have is related to an earlier discourse post of mine: would implementing path-compression + splitting/halving be something worth considering?

Maybe I'll close this issue and put the path-compression+splitting question in another issue.

eulerkochy commented 1 month ago

Hi @jacob-roth, sorry missed your reply.

I answered your question on Discourse. Yes, go ahead for a PR implementing path-compression. Make sure to benchmark it.