JuliaCollections / DataStructures.jl

Julia implementation of Data structures
https://juliacollections.github.io/DataStructures.jl/latest/
MIT License
694 stars 247 forks source link

API of SortedContainer #918

Open Wu-Chenyang opened 6 days ago

Wu-Chenyang commented 6 days ago

In my application, I find it's useful to define the following APIs for SortedContainer.

Base.checkbounds(::Type{Bool}, container::SortedContainer, token::DataStructures.IntSemiToken) =
    token.address >= 3 && (token.address in container.bt.useddatacells)
Base.checkbounds(container::SortedContainer, token::DataStructures.IntSemiToken) = (container, token) |> DataStructures.has_data
Base.nextind(container::SortedContainer, token::DataStructures.IntSemiToken) = (container, token) |> advance
Base.prevind(container::SortedContainer, token::DataStructures.IntSemiToken) = (container, token) |> regress

Base.@propagate_inbounds function Base.getindex(m::SortedDict,
                              i::DataStructures.IntSemiToken)
           @boundscheck DataStructures.has_data((m,i))
           return (m, i) |> deref
end
Base.@propagate_inbounds function Base.getindex(m::SortedMultiDict,
                              i::DataStructures.IntSemiToken)
           @boundscheck DataStructures.has_data((m,i))
           return (m, i) |> deref
end
Base.@propagate_inbounds function Base.getindex(m::SortedSet,
                              i::DataStructures.IntSemiToken)
           @boundscheck DataStructures.has_data((m,i))
           return (m, i) |> deref
end
# Base.@propagate_inbounds function Base.getindex(m::SortedContainer, token::DataStructures.IntSemiToken)
#     @boundscheck DataStructures.has_data((m,token))
#     return (container, token) |> deref
# end

The first three functions checkbounds, nextind, prevind are straightforward and align well with existing APIs. The getindex functions for IntSemiToken have been implemented for SortedDict and SortedMultiDict, which return only the values. I modified them to return key-value pairs, because it is natural to do so in my application. Besides, given the existence of another getindex function that retrieves the value using the key, I think retrieving key-value pairs using IntSemiToken seems a natural choice. Of course, it might not be a good choice in general. How do you think about these changes?

Here is a piece of my application code for motivating the changes. It uses the general APIs for sorted containers to simultaneously handle sorted AbstractVector and SortedDict.


struct Observation{I, O}
    index::I
    obs::O
    Observation{I,O}(index::I, obs::O) where {I, O} = new{I,O}(index, obs)
end
Observation(index::I, obs::O) where {I, O} = Observation{I,O}(index, obs)
Base.lt(::Base.Order.ForwardOrdering, a::Observation, b::Observation) = a.index < b.index
Base.lt(::Base.Order.ForwardOrdering, a::Real, b::Observation) = a < b.index
Base.lt(::Base.Order.ForwardOrdering, a::Observation, b::Real) = a.index < b
flatten(obs::Observation) = (obs.index, obs.obs)
flatten(pair::Pair) = (pair.first, pair.second)

const ObservationContainer{I,O} = Union{AbstractVector{<:Observation{I,O}}, SortedDict{I,O}}

function (w::AbstractWienerProcess)(t::Real, condition::ObservationContainer)
    w₀, Σ = params(w)

    # `IntSemiToken` for `SortedDict`, `Int` for `AbstractVector`
    idx = searchsortedlast(condition, t)
    if checkbounds(Bool, condition, idx)
        t₁, a = flatten(condition[idx])
    else
        t₁ = zero(t)
        a = w₀
    end

    next_idx = nextind(condition, idx)
    if checkbounds(Bool, condition, next_idx)
        t₂, b = flatten(condition[next_idx])
        μ = a + (t - t₁) / (t₂ - t₁) * (b - a)
        cov_t = ((t₂ - t) * (t - t₁) / (t₂ - t₁))
        construct_normal(μ, Σ, cov_t)
    else
        construct_normal(a, Σ, t - t₁)
    end
end

I also tried to use the SortedSet as the container in this code block but encountered a problem that searchsortedlast for SortedSet tries to convert the t::Real to Observation before comparing it to other observations in the container. I think this compulsory conversion might not be necessary. In some cases, it is natural to directly compare objects of different types. The above code is an example.

Wu-Chenyang commented 6 days ago

If you are interested in these changes, I can help with submitting a pull request.

StephenVavasis commented 5 days ago

I would recommend against changing the API for getindex. Although I see the logic of the request, this API has been in place for 10 years, and changing it now could break many existing codes.

As for loosening the restriction in the API of searchsortedfirst and its relatives: I can see that it is sensible for your use-case. I also see the pitfall that some future user could be misled into thinking that the sorted containers allow for two incompatible orderings simultaneously. So I don't have a strong opinion either way.

Wu-Chenyang commented 5 days ago

I see. Indeed, the changes in getindex and searchsortedfirst are not necessary even in my use case. I think your prudent decision is correct. What about the addition of nextind, prevind, and checkbounds?

StephenVavasis commented 4 days ago

Yes, it would make sense to implement nextind and prevind and checkbounds. Furthermore, it would probably make sense to deprecate advance and regress in favor of nextind and prevind from Base, since a near-term goal for the package is to make the API more consistent with Base. We wouldn't want to do this overnight since it would break existing codes, but it could be phased in over a few release cycles.