JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems
MIT License
53 stars 12 forks source link

Complete re-write of rectangular binnings to use ranges #246

Closed Datseris closed 1 year ago

Datseris commented 1 year ago

Our binning code was really bad when it came down to real world usage. When preparing the workshop, showing the ouputs of value histogram was alwas unintuitive. This thing we do with n_eps and nextfloat always leads to completely random and unintuitive numbers for the histogram edges. it also is very hard to get "the expected histogram" for different distribtions, hence computing the KL divergence. Furthermore, it is fundamentally inaccurate.

A much better approach is to give up trying to "hack up" some accuracy ourselves, and instead take advantage of Julia's base range system that operates using TwicePrecision to always keep range step sizes what the user expects without dealing with floating point precision. So that range 0:0.1:1 has exactly 0.1 step and exactly length of 11.

I have fully re-written the internals of rectangular binnings to utilize ranges. This has lead to many, many benefits:

  1. RectangularBinning is an intermediate struct that gets cast into a FixedRectangularBinning. This reduces a lot the code.
  2. All the hacky stuff we do with n_eps have been completely removed. They were never accurate to begin with; they just changed the histogram sizes but they were just as inaccurate. To be accurate you need double precision.
  3. FixedRectangularBinning now takes in standard julia ranges as input. One range for each dimension, with convenience constructors. This allows us to utilize Julia's internal double precision system without any hacky stuff. This also means taht the outcome space has nice, simple edges and bin widths, which is what a user would like..
  4. Binnings have a precise option: if true, they use searchsortedlast, which uses internally the double precision, to map data to correct bin according to ranges. If false they use our standard division with the bin width.

To give an example of how much of achange this does, here we go:

    x = Dataset(rand(Random.MersenneTwister(1234), 100_000, 2))
    push!(x, SVector(0, 0)) # ensure both 0 and 1 have values in, exactly.
    push!(x, SVector(1, 1))

    bin = FixedRectangularBinning(0:0.1:1, 2)
    est = ValueHistogram(bin)
    out = outcome_space(est, x)
# equivalent with 
outcome_space(bin)
julia> out
10×10 Matrix{SVector{2, Float64}}:
 [0.0, 0.0]  [0.0, 0.1]  …  [0.0, 0.9]
 [0.1, 0.0]  [0.1, 0.1]     [0.1, 0.9]
 [0.2, 0.0]  [0.2, 0.1]     [0.2, 0.9]
 [0.3, 0.0]  [0.3, 0.1]     [0.3, 0.9]
 [0.4, 0.0]  [0.4, 0.1]     [0.4, 0.9]
 [0.5, 0.0]  [0.5, 0.1]  …  [0.5, 0.9]
 [0.6, 0.0]  [0.6, 0.1]     [0.6, 0.9]
 [0.7, 0.0]  [0.7, 0.1]     [0.7, 0.9]
 [0.8, 0.0]  [0.8, 0.1]     [0.8, 0.9]
 [0.9, 0.0]  [0.9, 0.1]     [0.9, 0.9]
kahaaga commented 1 year ago

@Datseris This seems reasonable, but I think it breaks upstream code at the moment.

To get the joint histograms for multi-argument functions I simply do (with the old code)

function encode_as_tuple(e::RectangularBinEncoding, point)
    (; mini, edgelengths) = e
    # Map a data point to its bin edge (plus one because indexing starts from 1)
    bin = floor.(Int, (point .- mini) ./ edgelengths) .+ 1
    return bin # returns
end

For a D-dimensional point, this returns a D-dimensional tuple of integers (one for each dimension, indicating which bin along that dimension the coordinate falls in). How would I do that with your proposal?

Datseris commented 1 year ago

from source code of encode

    if e.precise
        # Don't know how to make this faster unfurtunately...
        cartidx = CartesianIndex(map(searchsortedlast, ranges, Tuple(point)))
    else
        bin = floor.(Int, (point .- e.mini) ./ e.widths) .+ 1
        cartidx = CartesianIndex(Tuple(bin))
    end
Datseris commented 1 year ago

I'll extract this into a function cartesian_bin_index that is called by encode so that you can use that function downstream, ok?

kahaaga commented 1 year ago

I'll extract this into a function cartesian_bin_index that is called by encode so that you can use that function downstream, ok?

Excellent.

Datseris commented 1 year ago

Fixing the tests of the Transfer Operator is very hard. I am getting

 ArgumentError: Cannot decode integer -1: out of bounds of underlying binning.
Datseris commented 1 year ago

There is just so much in this source code that isn't used, makes it so hard to read the source code. In this block


            # Count how many points jump from the i-th bin to each of
            # the unique target bins, and use that to calculate the transition
            # probability from bᵢ to bⱼ.
            for (j, bᵤ) in enumerate(unique(target_bins))
                n_transitions_i_to_j = sum(target_bins .== bᵤ)

                push!(I, i)
                push!(J, bᵤ)
                push!(P, n_transitions_i_to_j / n_visitsᵢ)
            end

j is not used anywhere. Interestingly, some variables have j in their name, and the usage of capital J also confuses.

kahaaga commented 1 year ago

Fixing the tests of the Transfer Operator is very hard.

I'll have a look. Tag me when you're done changing things, so we don't do overlapping work

kahaaga commented 1 year ago

There is just so much in this source code that isn't used, makes it so hard to read the source code. In this block

Yes, I know. This code is ancient and is a direct rewrite of some messy matlab code from back in the days. As we talked about before, it will be fixed as part of #55.

But the issue shouldn't be in the loop. If the bins are computed correctly and has the expected format before the loops, then the transfer operator approximation should be correct.

Datseris commented 1 year ago

I found the issue. Something is fishy is going on with the encodings

@testset "All points covered" begin
    # Ensure that given a `RectangularBinning` no point is in invalid bin

    x = Dataset(rand(100, 2))

    binnings = [
        RectangularBinning(3),
        RectangularBinning(0.2),
        RectangularBinning([2, 3]),
        RectangularBinning([0.2, 0.3]),
    ]

    for bin in binnings
        rbe = RectangularBinEncoding(bin, x)
        visited_bins = map(pᵢ -> encode(rbe, pᵢ), x)
        @test -1 ∉ visited_bins
    end
end

This errors. I'll fix this now. Or at least I'll try.

Datseris commented 1 year ago

Well, to be precise, this is also a problem in the Tranfer Operator code. If you allow for FixedRectangularBinning to be given, you must be able to deal with points given the encoding -1, because that's something the fixed binnings support.

kahaaga commented 1 year ago

Well, to be precise, this is also a problem in the Tranfer Operator code. If you allow for FixedRectangularBinning to be given, you must be able to deal with points given the encoding -1, because that's something the fixed binnings support.

The transfer operator is approximated by how a locally linear map transforms points. An implicit assumption here is that the points are supported on the grid on which the approximation is made. It should be fine to just drop any point where one or more components are encoded as -1. We can just add a filter to the line visited_bins = map(pᵢ -> encode(encoder, pᵢ), pts), were any pᵢ that has a -1 as a component is simply dropped.

I've always made sure that the binning used covers all the point a priori, so this hadn't crossed my mind before. My mistake.

Datseris commented 1 year ago

this should be done in a different pr.

for now I found the obvious problem. When makign the range range(min, max; step = x) it is not guaranteed that max is within the range, something that RectangularBinning promises. I fix it now.