Tchanders / InformationMeasures.jl

Entropy, mutual information and higher order measures from information theory, with various estimators and discretisation methods.
Other
66 stars 14 forks source link

Optimized method for Mutual Information with delay? #28

Closed Datseris closed 4 years ago

Datseris commented 4 years ago

Hi, I have two timeseries x, y and I am looking to efficiently get the mutual information between x and y, where y is delayed with some time τ, e.g. τ = -50:50.

I am trying to use the "quick" version you have at the docs. mi_quick, but it targets a fundamentally different scenario. I "could" consider each delayed version of y as a different timeseries, but that would be very inefficient I imagine. A delayed y has the same marginal distribution as the normal y.

from the main loop of the function:

    for i in 1 : nvars, j in i+1 : nvars
        f = get_frequencies_from_bin_ids(bin_ids[1:end, i:i], bin_ids[1:end, j:j], nbins, nbins)
        p = get_probabilities(estimator, f)
        mis[index] = apply_mutual_information_formula(p, sum(p, dims = 1), sum(p, dims = 2), mi_base)
        index += 1
    end

which parts can be safely skipped? I want to make a loop that loops over delay times τ, I know that I will only have 2 timeseries.

Datseris commented 4 years ago

In a similar question, I am doing a boostrapping method for calculating the confidence interval for the mutual information:

    bootstrap = zeros(N)
    for i in 1:N
        bootstrap[i] = get_mutual_information(shuffle!(x), shuffle!(y))
    end

I also think that this can be further optimized, but I am not sure exactly which parts of mi_quick don't change when I shuffle the vectors.

Tchanders commented 4 years ago
  • does get_bin_ids! change if I delay y?

get_bin_ids! performs the discretization. It translates an array of data into an array (the same length) of IDs, where each ID represents the bin that the data point falls in. E.g.:

bins: 0 <= bin 1 < 1; 1 <= bin 2 < 2 data: 0.5, 0.6, 1.3, 0.9, 1.1 get_bin_ids! will return: [ 1, 1, 2, 1, 2 ]

Whether you actually need to call get_bin_ids! more than once depends on how exactly your delay is working. E.g. if you are using a sliding window on a longer timeseries, you could call get_bin_ids! once on the full timeseries, and use the sliding window on the bin IDs. If instead you are shifting your y values and padding the array with 0 (or some predictable value), you may be able to discretize once then pad the bin IDs arrays with whichever bin ID contains your padding value.

which parts can be safely skipped?

get_frequencies_from_bin_ids calculates a 2-dimensional distribution for x and y, using their bin IDs. E.g.:

nbins_x = 2
nbins_y = 2

xbins_0 = [ 1, 1, 2 ]
ybins_0 = [ 1, 2, 1 ]
ybins_1 = [ 1, 1, 2 ]

get_frequencies_from_bin_ids(xbins_0, ybins_0, nbins_x, nbins_y)
# 2×2 Array{Int64,2}:
#  1 1
#  1 0

get_frequencies_from_bin_ids(xbins_0, ybins_1, nbins_x, nbins_y)
# 2×2 Array{Int64,2}:
#  2 0
#  0 1

If you are shifting y so that each y data-point pairs with a different x data-point for each delayed timeseries, then you would need to call get_frequencies_from_bin_ids for each delayed timeseries.

As for the bootstrapping, you may be able to discretize x and y once (again, this depends on how your delay is working). You'd need to shuffle the bin IDs before you call get_frequencies_from_bin_ids, to ensure the x and y IDs are paired randomly.

Hope that helps!

Datseris commented 4 years ago

Hi thanks for the reply, sorry that I wasn't clear. "Delay" works as follows. We have x, y with equal length. For any delay τ (here positive for simplicity) the timeseries become x[1:end-τ], y[1+τ:end].

If I read your reply correctly, this means that I cannot do any optimization...?

I was kinda hopping that you would have some kind of more modular lower level interface where the single distribution of x and of y is calculated independelty from the joint one of x, y. Because in both scenarios I describe in this issue the individual distributions of x, y do not change.

Tchanders commented 4 years ago

It seems to me that you could discretize x and y once each, then use bin_ids_x[1:end-τ] and bin_ids_y[1+τ:end] as your inputs to get_frequencies_from_bin_ids. get_bin_ids! is the part that calculates the single distributions for x and y.

As you say, get_frequencies_from_bin_ids does need to be calculated for each pairing.

Datseris commented 4 years ago

Thank you, this is what I was looking for and it was very helpful. I'll try out to write a code implementation of what I want. I'll keep this issue and report here and if all is fine I'll close it!

Datseris commented 4 years ago

Hi, I Finally got around to implementing this. All of the scenarios I discuss here boil down to the same process in the end: computing mutual information without re-computing the bins.

I have got the following version, which conceptually should do what I need:

using InformationMeasures
function mi_quick(x, y, τs;
    nbins = Int(round(sqrt(length(x)))),
    discretizer = "uniform_width", estimator = "maximum_likelihood", base = 2)

    @assert length(x) == length(y)
    N = length(x)
    bin_ids_x, bin_ids_y = zeros(Int, N), zeros(Int, N)
    get_bin_ids!(x, discretizer, nbins, bin_ids_x)
    get_bin_ids!(y, discretizer, nbins, bin_ids_y)

    mis = zeros(length(τs))
    for (j, τ) in enumerate(τs)
        f = get_frequencies_from_bin_ids(bin_ids_x[1:end-τ], bin_ids_y[1+τ:end], nbins, nbins)
        p = get_probabilities(estimator, f)
        mis[j] = apply_mutual_information_formula(p, sum(p, dims = 1), sum(p, dims = 2), base)
    end
    return mis
end

This seems to work for me!

I seem to be getting a 2x speedup.

Tchanders commented 4 years ago

@Datseris Thanks for sharing this! I'll close this now, since it seems to have been successful.