JuliaDynamics / StateSpaceSets.jl

The `StateSpaceSet` interface for packages of JuliaDynamics
MIT License
2 stars 3 forks source link

Covariance/correlation matrix #23

Open kahaaga opened 9 months ago

kahaaga commented 9 months ago

I'm cleaning up stuff in CausalityTools, and came across some code I wrote for estimating the covariance/correlation matrix of a StateSpaceSet used for the parametric Gaussian MI/CMI estimators. The code here is significantly faster than converting to a Matrix and then calling Statistics.cov/Statistics.cor.

Would this be something that belongs here?

The source code is relatively simple.

import Statistics.cov
using Statistics: mean, std
using StateSpaceSets: AbstractStateSpaceSet
using StaticArrays: @MMatrix, @MVector, SMatrix, SVector

export fastcov

# Non-allocating and faster than writing a wrapper.
# `f(x) = Statistics.cov(Matrix(x))`.
# Also accepts SubStateSpaceSets
# These functions return StaticArrays.
fastcov(x::AbstractStateSpaceSet) = fastcov(x.data)
fastmean_and_cov(x::AbstractStateSpaceSet) = fastmean_and_cov(x.data)

function fastcov(x̄, x::Vector{SVector{D, T}}) where {D, T}
    T <: AbstractFloat || error("Need `eltype(x[i]) <: AbstractFloat` ∀ i ∈ 1:length(x). Got `eltype(x[i])=$(eltype(first(x)))`")
    N = length(x) - 1
    C = @MMatrix zeros(D, D)
    x̄ = mean(x)
    Δx = @MVector zeros(D)
    @inbounds for xᵢ in x
        Δx .= xᵢ - x̄
        C .+= Δx * transpose(Δx)
    end
    C ./= N
    return SMatrix{D, D}(C)
end
# So we don't have to compute the mean twice at every iteration.
function fastcov(x::Vector{SVector{D, T}}) where {D, T}
    T <: AbstractFloat || error("Need `eltype(x[i]) <: AbstractFloat` ∀ i ∈ 1:length(x). Got `eltype(x[i])=$(eltype(first(x)))`")

    μ = mean(x)
    fastcov(μ, x)
end
function fastmean_and_cov(x::Vector{SVector{D, T}}) where {D, T}
    μ = mean(x)
    Σ = fastcov(μ, x)
    return μ, Σ
end

# Non-allocating and faster than writing a wrapper.
export fastcor
fastcor(x::AbstractStateSpaceSet) = fastcor(x.data)
function fastcor(x::Vector{SVector{D, T}}) where {D, T}
    N = length(x)
    μ, Σ = fastmean_and_cov(x)
    σ = std(x)
    C = @MMatrix zeros(D, D)
    for j in 1:D
        for i in 1:D
            C[i, j] = Σ[i, j] / (σ[i] * σ[j])
        end
    end
    return SMatrix{D, D}(C)
end

Verifying that the implementations are actually correct:

julia> using Test

julia> using Statistics

julia> D = StateSpaceSet(rand(100, 3));

julia> @test all(fastcov(D) .≈ Statistics.cov(Matrix(D)))
Test Passed

julia> @test all(fastcor(D) .≈ Statistics.cor(Matrix(D)))
Test Passed

Timing

julia> D = StateSpaceSet(rand(10000, 5)); 

julia> mcor(x) = Statistics.cor(Matrix(x))
mcor (generic function with 1 method)

julia> fastcor(D); @btime fastcor($D);
  59.708 μs (0 allocations: 0 bytes)

julia> mcor(D); @btime mcor($D);
  89.333 μs (13 allocations: 782.31 KiB)
Datseris commented 9 months ago

What is the covariance/correlation matrix in this context?

kahaaga commented 9 months ago

What is the covariance/correlation matrix in this context?

It would be covariance/correlation between the different columns of the dataset. If D = StateSpaceSet(rand(10000, 5)) and corrs = fastcor(D), then corrs[2, 5] is the correlation between the second and fifth column of D.

kahaaga commented 9 months ago

In practice, the correlation matrix is used for computing partial correlations, which is for example a pretty standard use case for the PC graph inference algorithm.

Datseris commented 9 months ago

Hm, I guess we have standardize in this package as a statistical operation, so this functionality could be added here as well, provided that it does not introduce additional dependencies! In the docs we could have a "Basic statistics" subsection.

Datseris commented 9 months ago

@MMAtrix is for example an additional dependency that we do not need (we only depend on StaticArraysCore)

kahaaga commented 9 months ago

Hm, I guess we have standardize in this package as a statistical operation, so this functionality could be added here as well, provided that it does not introduce additional dependencies! In the docs we could have a "Basic statistics" subsection.

A section with basic statistics sounds good. I'm sure there will be other statistics that we'd want to optimise for StateSpaceSets later too.

@MMAtrix is for example an additional dependency that we do not need (we only depend on StaticArraysCore)

The macro isn't really necessary here. StaticArraysCore.jl exports MMatrix, so we can just use that.

kahaaga commented 9 months ago

For the name, fastcov/fastcor names should be just cov/cor, right? We just extend the Statistics methods.

Datseris commented 9 months ago

correct