Closed Datseris closed 4 years ago
In a similar question, I am doing a boostrapping method for calculating the confidence interval for the mutual information:
bootstrap = zeros(N)
for i in 1:N
bootstrap[i] = get_mutual_information(shuffle!(x), shuffle!(y))
end
I also think that this can be further optimized, but I am not sure exactly which parts of mi_quick
don't change when I shuffle the vectors.
- does get_bin_ids! change if I delay y?
get_bin_ids!
performs the discretization. It translates an array of data into an array (the same length) of IDs, where each ID represents the bin that the data point falls in. E.g.:
bins: 0 <= bin 1 < 1; 1 <= bin 2 < 2
data: 0.5, 0.6, 1.3, 0.9, 1.1
get_bin_ids! will return: [ 1, 1, 2, 1, 2 ]
Whether you actually need to call get_bin_ids!
more than once depends on how exactly your delay is working. E.g. if you are using a sliding window on a longer timeseries, you could call get_bin_ids!
once on the full timeseries, and use the sliding window on the bin IDs. If instead you are shifting your y values and padding the array with 0 (or some predictable value), you may be able to discretize once then pad the bin IDs arrays with whichever bin ID contains your padding value.
which parts can be safely skipped?
get_frequencies_from_bin_ids
calculates a 2-dimensional distribution for x and y, using their bin IDs. E.g.:
nbins_x = 2
nbins_y = 2
xbins_0 = [ 1, 1, 2 ]
ybins_0 = [ 1, 2, 1 ]
ybins_1 = [ 1, 1, 2 ]
get_frequencies_from_bin_ids(xbins_0, ybins_0, nbins_x, nbins_y)
# 2×2 Array{Int64,2}:
# 1 1
# 1 0
get_frequencies_from_bin_ids(xbins_0, ybins_1, nbins_x, nbins_y)
# 2×2 Array{Int64,2}:
# 2 0
# 0 1
If you are shifting y so that each y data-point pairs with a different x data-point for each delayed timeseries, then you would need to call get_frequencies_from_bin_ids
for each delayed timeseries.
As for the bootstrapping, you may be able to discretize x and y once (again, this depends on how your delay is working). You'd need to shuffle the bin IDs before you call get_frequencies_from_bin_ids
, to ensure the x and y IDs are paired randomly.
Hope that helps!
Hi thanks for the reply, sorry that I wasn't clear. "Delay" works as follows. We have x, y
with equal length. For any delay τ (here positive for simplicity) the timeseries become x[1:end-τ], y[1+τ:end]
.
If I read your reply correctly, this means that I cannot do any optimization...?
get_bin_ids!
tells me that I definitely have to call it for every τget_frequencies_from_bin_ids
everytime since the joint distribution of x, y
changes with delay. The same must be done with bootstrapping.I was kinda hopping that you would have some kind of more modular lower level interface where the single distribution of x
and of y
is calculated independelty from the joint one of x, y
. Because in both scenarios I describe in this issue the individual distributions of x, y
do not change.
It seems to me that you could discretize x and y once each, then use bin_ids_x[1:end-τ]
and bin_ids_y[1+τ:end]
as your inputs to get_frequencies_from_bin_ids
. get_bin_ids!
is the part that calculates the single distributions for x and y.
As you say, get_frequencies_from_bin_ids
does need to be calculated for each pairing.
Thank you, this is what I was looking for and it was very helpful. I'll try out to write a code implementation of what I want. I'll keep this issue and report here and if all is fine I'll close it!
Hi, I Finally got around to implementing this. All of the scenarios I discuss here boil down to the same process in the end: computing mutual information without re-computing the bins.
I have got the following version, which conceptually should do what I need:
using InformationMeasures
function mi_quick(x, y, τs;
nbins = Int(round(sqrt(length(x)))),
discretizer = "uniform_width", estimator = "maximum_likelihood", base = 2)
@assert length(x) == length(y)
N = length(x)
bin_ids_x, bin_ids_y = zeros(Int, N), zeros(Int, N)
get_bin_ids!(x, discretizer, nbins, bin_ids_x)
get_bin_ids!(y, discretizer, nbins, bin_ids_y)
mis = zeros(length(τs))
for (j, τ) in enumerate(τs)
f = get_frequencies_from_bin_ids(bin_ids_x[1:end-τ], bin_ids_y[1+τ:end], nbins, nbins)
p = get_probabilities(estimator, f)
mis[j] = apply_mutual_information_formula(p, sum(p, dims = 1), sum(p, dims = 2), base)
end
return mis
end
This seems to work for me!
I seem to be getting a 2x speedup.
@Datseris Thanks for sharing this! I'll close this now, since it seems to have been successful.
Hi, I have two timeseries x, y and I am looking to efficiently get the mutual information between x and y, where y is delayed with some time τ, e.g. τ = -50:50.
I am trying to use the "quick" version you have at the docs.
mi_quick
, but it targets a fundamentally different scenario. I "could" consider each delayed version of y as a different timeseries, but that would be very inefficient I imagine. A delayed y has the same marginal distribution as the normal y.get_bin_ids!
change if I delayy
?from the main loop of the function:
which parts can be safely skipped? I want to make a loop that loops over delay times τ, I know that I will only have 2 timeseries.