Closed Datseris closed 1 year ago
@Datseris This seems reasonable, but I think it breaks upstream code at the moment.
To get the joint histograms for multi-argument functions I simply do (with the old code)
function encode_as_tuple(e::RectangularBinEncoding, point)
(; mini, edgelengths) = e
# Map a data point to its bin edge (plus one because indexing starts from 1)
bin = floor.(Int, (point .- mini) ./ edgelengths) .+ 1
return bin # returns
end
For a D
-dimensional point, this returns a D
-dimensional tuple of integers (one for each dimension, indicating which bin along that dimension the coordinate falls in). How would I do that with your proposal?
from source code of encode
if e.precise
# Don't know how to make this faster unfurtunately...
cartidx = CartesianIndex(map(searchsortedlast, ranges, Tuple(point)))
else
bin = floor.(Int, (point .- e.mini) ./ e.widths) .+ 1
cartidx = CartesianIndex(Tuple(bin))
end
I'll extract this into a function cartesian_bin_index
that is called by encode
so that you can use that function downstream, ok?
I'll extract this into a function cartesian_bin_index that is called by encode so that you can use that function downstream, ok?
Excellent.
Fixing the tests of the Transfer Operator is very hard. I am getting
ArgumentError: Cannot decode integer -1: out of bounds of underlying binning.
There is just so much in this source code that isn't used, makes it so hard to read the source code. In this block
# Count how many points jump from the i-th bin to each of
# the unique target bins, and use that to calculate the transition
# probability from bᵢ to bⱼ.
for (j, bᵤ) in enumerate(unique(target_bins))
n_transitions_i_to_j = sum(target_bins .== bᵤ)
push!(I, i)
push!(J, bᵤ)
push!(P, n_transitions_i_to_j / n_visitsᵢ)
end
j
is not used anywhere. Interestingly, some variables have j
in their name, and the usage of capital J also confuses.
Fixing the tests of the Transfer Operator is very hard.
I'll have a look. Tag me when you're done changing things, so we don't do overlapping work
There is just so much in this source code that isn't used, makes it so hard to read the source code. In this block
Yes, I know. This code is ancient and is a direct rewrite of some messy matlab code from back in the days. As we talked about before, it will be fixed as part of #55.
But the issue shouldn't be in the loop. If the bins are computed correctly and has the expected format before the loops, then the transfer operator approximation should be correct.
I found the issue. Something is fishy is going on with the encodings
@testset "All points covered" begin
# Ensure that given a `RectangularBinning` no point is in invalid bin
x = Dataset(rand(100, 2))
binnings = [
RectangularBinning(3),
RectangularBinning(0.2),
RectangularBinning([2, 3]),
RectangularBinning([0.2, 0.3]),
]
for bin in binnings
rbe = RectangularBinEncoding(bin, x)
visited_bins = map(pᵢ -> encode(rbe, pᵢ), x)
@test -1 ∉ visited_bins
end
end
This errors. I'll fix this now. Or at least I'll try.
Well, to be precise, this is also a problem in the Tranfer Operator code. If you allow for FixedRectangularBinning
to be given, you must be able to deal with points given the encoding -1
, because that's something the fixed binnings support.
Well, to be precise, this is also a problem in the Tranfer Operator code. If you allow for FixedRectangularBinning to be given, you must be able to deal with points given the encoding -1, because that's something the fixed binnings support.
The transfer operator is approximated by how a locally linear map transforms points. An implicit assumption here is that the points are supported on the grid on which the approximation is made. It should be fine to just drop any point where one or more components are encoded as -1
. We can just add a filter to the line visited_bins = map(pᵢ -> encode(encoder, pᵢ), pts)
, were any pᵢ
that has a -1 as a component is simply dropped.
I've always made sure that the binning used covers all the point a priori, so this hadn't crossed my mind before. My mistake.
this should be done in a different pr.
for now I found the obvious problem. When makign the range range(min, max; step = x)
it is not guaranteed that max
is within the range, something that RectangularBinning
promises. I fix it now.
Our binning code was really bad when it came down to real world usage. When preparing the workshop, showing the ouputs of value histogram was alwas unintuitive. This thing we do with
n_eps
andnextfloat
always leads to completely random and unintuitive numbers for the histogram edges. it also is very hard to get "the expected histogram" for different distribtions, hence computing the KL divergence. Furthermore, it is fundamentally inaccurate.A much better approach is to give up trying to "hack up" some accuracy ourselves, and instead take advantage of Julia's base
range
system that operates usingTwicePrecision
to always keep range step sizes what the user expects without dealing with floating point precision. So that range0:0.1:1
has exactly 0.1 step and exactly length of 11.I have fully re-written the internals of rectangular binnings to utilize ranges. This has lead to many, many benefits:
RectangularBinning
is an intermediate struct that gets cast into aFixedRectangularBinning
. This reduces a lot the code.n_eps
have been completely removed. They were never accurate to begin with; they just changed the histogram sizes but they were just as inaccurate. To be accurate you need double precision.FixedRectangularBinning
now takes in standard juliarange
s as input. Onerange
for each dimension, with convenience constructors. This allows us to utilize Julia's internal double precision system without any hacky stuff. This also means taht the outcome space has nice, simple edges and bin widths, which is what a user would like..precise
option: if true, they usesearchsortedlast
, which uses internally the double precision, to map data to correct bin according to ranges. Iffalse
they use our standard division with the bin width.To give an example of how much of achange this does, here we go: