JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems
MIT License
53 stars 12 forks source link

Encode-decode step is incorrect for `OrdinalPatternEncoding` #242

Closed Datseris closed 1 year ago

Datseris commented 1 year ago

MWE:

julia> using ComplexityMeasures

julia> enc = OrdinalPatternEncoding(3)
OrdinalPatternEncoding{3, typeof(ComplexityMeasures.isless_rand)}([0, 0, 0], ComplexityMeasures.isless_rand)

julia> i = encode(enc, [10, -2, 1])
4

julia> decode(enc, i)
3-element SVector{3, Int64} with indices SOneTo(3):
 2
 3
 1

The element of the outcome space we should be getting is [3, 1, 2] since 10 is he largest value and -2 the lowest. But we don't get this as the decoded element of the encoding.

or is the documentation incorrect?

kahaaga commented 1 year ago

Hm, I think something is wrong here. Let me have a look.

kahaaga commented 1 year ago

This definitely used to work. But the encoding tests have been disabled at some point during the refactoring for v2. Some changes were done to the code after the tests were deactivated, and right now I'm not sure what exactly was changed. I'll try to trace down the commit that changed the source code, and see if this can be fixed easily.

Relevant test code:

# This is not part of the public API, but this is crucial to test directly to
    # ensure its correctness. It makes no sense to test it though "end-product" code,
    # because there would be no obvious way of debugging the forward/inverse Lehmer-code
    # code from end-product code. We therefore test the internals here.
    @testset "Encoding/decoding" begin
        m = 4
        # All possible permutations for length-4 vectors.
        πs = [
            [1, 2, 3, 4], [1, 2, 4, 3], [1, 3, 2, 4], [1, 3, 4, 2], [1, 4, 2, 3],
            [1, 4, 3, 2], [2, 1, 3, 4], [2, 1, 4, 3], [2, 3, 1, 4], [2, 3, 4, 1],
            [2, 4, 1, 3], [2, 4, 3, 1], [3, 1, 2, 4], [3, 1, 4, 2], [3, 2, 1, 4],
            [3, 2, 4, 1], [3, 4, 1, 2], [3, 4, 2, 1], [4, 1, 2, 3], [4, 1, 3, 2],
            [4, 2, 1, 3], [4, 2, 3, 1], [4, 3, 1, 2], [4, 3, 2, 1]
        ]
        encoded_πs = encode_motif.(πs, m)
        @test encoded_πs == 0:(factorial(m) - 1) |> collect
        @test all(isa.(encoded_πs, Int))

        # Decoded permutations (`SVector{m, Int}`s)
        decoded_πs = decode_motif.(encoded_πs, m)
        @test all(length.(decoded_πs) .== m)
        @test all(decoded_πs .== πs)
    end
kahaaga commented 1 year ago

Okay, I think I'm narrowing it down.

I re-did the encoding from scratch and compare it to Berger et al. (2019)'s example in their table 1.

using ComplexityMeasures
m = 4
# All possible permutations for length-4 vectors.
# Table 1 in Berger et al. These permutations should, in the given order,
# map onto integers 0, 1, ..., factorial(4) - 1.
πs = [
    [1, 2, 3, 4], # 0
    [1, 2, 4, 3], # 1
    [1, 3, 2, 4], # 2
    [1, 3, 4, 2], # 3
    [1, 4, 2, 3], # 4
    [1, 4, 3, 2], # 5
    [2, 1, 3, 4], # 6
    [2, 1, 4, 3], # 7
    [2, 3, 1, 4], # 8
    [2, 3, 4, 1], # 9
    [2, 4, 1, 3], # 10
    [2, 4, 3, 1], # 11
    [3, 1, 2, 4], # 12
    [3, 1, 4, 2], # and so on...
    [3, 2, 1, 4],
    [3, 2, 4, 1],
    [3, 4, 1, 2],
    [3, 4, 2, 1],
    [4, 1, 2, 3],
    [4, 1, 3, 2],
    [4, 2, 1, 3],
    [4, 2, 3, 1],
    [4, 3, 1, 2],
    [4, 3, 2, 1],
]

# `perm` is a *permutation* (i.e. an ordering resulting from `sortperm`), not an input data vector
function encode_pattern(perm)
    m = length(perm)
    n = 0
    for i = 1:m-1
        for j = i+1:m
            n += perm[i] > perm[j] ? 1 : 0
        end
        n = (m-i)*n
    end
    return n + 1 # add 1, because we have positive integer outcomes
end

# Encoding
julia> all(encode_pattern.(πs) .== 1:factorial(4))
true

# Decoding
julia> decode.(Ref(OrdinalPatternEncoding(4)), es .- 1) == πs
true

# But our approach fails
julia> o = OrdinalPatternEncoding(4)
OrdinalPatternEncoding{4, typeof(ComplexityMeasures.isless_rand)}([0, 0, 0, 0], ComplexityMeasures.isless_rand)

julia> @show encode.(Ref(o), πs)
encode.(Ref(o), πs) = [0, 1, 2, 4, 3, 5, 6, 7, 12, 18, 13, 19, 8, 10, 14, 20, 16, 22, 9, 11, 15, 21, 17, 23]

Both encoding and decoding works as expected with the original algorithm. The issue must have something to do with how we store/use the permutation vector inside OrdinalPatternEncoding. It's probably stupidly simple why this errors, but I'm completely unable to see it at the moment. I'll revisit it after some frustration lunch.

kahaaga commented 1 year ago

Ah, of course! If the input data to encode are already permutations, then calling sortperm again will permute the permutation.

A keyword argument checking whether the input vector is already is a permutation fixes it.

function encode(encoding::OrdinalPatternEncoding{m}, χ::AbstractVector; isperm = false) where {m}
    if m != length(χ)
        throw(ArgumentError("Permutation order and length of input must match!"))
    end
    if isperm
        perm = χ
    else
        perm = sortperm!(encoding.perm, χ)
    end
    # Begin Lehmer code
    n = 0
    for i = 1:m-1
        for j = i+1:m
            n += perm[i] > perm[j] ? 1 : 0
        end
        n = (m-i)*n
    end
    # The Lehmer code actually results in 0 being an encoded symbol. Shift by 1, so that
    # encodings are the positive integers.
    return n
end

# The same example as above
πs = [[1, 2, 3, 4], [1, 2, 4, 3], [1, 3, 2, 4], [1, 3, 4, 2], [1, 4, 2, 3], [1, 4, 3, 2], [2, 1, 3, 4], [2, 1, 4, 3], [2, 3, 1, 4], [2, 3, 4, 1], [2, 4, 1, 3], [2, 4, 3, 1], [3, 1, 2, 4], [3, 1, 4, 2], [3, 2, 1, 4], [3, 2, 4, 1], [3, 4, 1, 2], [3, 4, 2, 1], [4, 1, 2, 3], [4, 1, 3, 2], [4, 2, 1, 3], [4, 2, 3, 1], [4, 3, 1, 2], [4, 3, 2, 1]]

julia> @show encode.(Ref(o), πs)
encode.(Ref(o), πs) = [0, 1, 2, 4, 3, 5, 6, 7, 12, 18, 13, 19, 8, 10, 14, 20, 16, 22, 9, 11, 15, 21, 17, 23]

julia> @show encode.(Ref(o), πs, isperm = true)
encode.(Ref(o), πs, isperm = true) = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
kahaaga commented 1 year ago

All that aside, I was overcomplicating this. Let's consider your example again, @Datseris

The permutation of [10, -2, 1] is [2, 3, 1], not [3, 1, 2]. You can verify this by computing sortperm([10, -2, 1]) == [2, 3, 1]). Therefore, the input vector [10, -2, 1] corresponds to the outcome [2, 3, 1].

sortperm simply returns the indices that would sort input vector x in ascending order.

There are two things we need to address:

Datseris commented 1 year ago

@kahaaga well,

The permutation of [10, -2, 1] is [2, 3, 1], not [3, 1, 2]. You can verify this by computing sortperm([10, -2, 1]) == [2, 3, 1]). Therefore, the input vector [10, -2, 1] corresponds to the outcome [2, 3, 1].

Sure, but that's not really what I care about. According to the documentation, the outcome space is not the permutations; it is the ordering itself. So we need to clarify: do we need to alter the documentation because the outcome space is not the orderings, but the permutations instead?

So, what is the outcome space ? is it the permutations, i.e, the sortperm output of the element, or is it the ordering of the element, i.e., as it is writrten right now in the documentation? In which case, [10, -2, 1] maps to [3, 1, 2] because 10 is the largest value in this vector.

Datseris commented 1 year ago

So, to make this even more clear: is the outcome space the thing that if I scatter plot it, i get the ordinal pattern, or is it the sorting permutation of the vector? Because the ordinal pattern of [10, -2, 1] is this thing:

image

which is what I get if I scatterplot [3, 1, 2].

Datseris commented 1 year ago

We need a way to distinguish between the case where the input vectors are already permutations and when they are raw data. Do you think a keyword argument encode(::OrdinalPatternEncoding, x; ispermutation = false) is a good idea?

No, this is not a good idea, because this is not the supported API. encode always takes in an element of the dataset. It doesn't, and shouldn't, and has no reason to, take in a permutation, unless the permutation itself is the data one has at hand for whatever reason. In which case, this keyword serves no purpose.

I am not sure what you are trying to "fix" here with this keyword argument.

Datseris commented 1 year ago

if you want to use the Lehmer code directly, then please define an intermediate function that is called by encode instead of messing/complicating the encode function itself to make it serve two purposes. A function should serve a single, unique purpose. (from Good Scinetific Code worksop)

kahaaga commented 1 year ago

So, to make this even more clear: is the outcome space the thing that if I scatter plot it, i get the ordinal pattern, or is it the sorting permutation of the vector? Because the ordinal pattern of [10, -2, 1] is this thing:

image

which is what I get if I scatterplot [3, 1, 2]. Sure, but that's not really what I care about. According to the documentation, the outcome space is not the permutations; it is the ordering itself. So we need to clarify: do we need to alter the documentation because the outcome space is not the orderings, but the permutations instead?

Sorry, my mistake. I was reading the documentation for OrdinalPatternEncoding, in which I couldn't find anything wrong with. You mean the documentation for SymbolicPermutation & friends, of course.

The permutation you plot is down-up, or "first is largest, second is least, third is in between", "the rank ordering is [3, 1, 2]" (as you say), or "the permutation pattern is (2, 3, 1)". All are equivalent. Our implementation is based on the permutation (i.e. the result of sortperm(x)). This is typically what is done is the context of permutation entropy (as the name implies). See for example section 2 in this summary paper.

So yes, the documentation string is wrong at the moment. We can either

The latter adds extra computations, which is unfortunate if you need to keep track of outcomes for many computations. Lehmer code operates directly on the permutations anyways, so it feels natural to have the outcome space be the permutations. Any preference?

I am not sure what you are trying to "fix" here with this keyword argument. if you want to use the Lehmer code directly, then please define an intermediate function that is called by encode instead of messing/complicating the encode function itself to make it serve two purposes.

We need to be able to test, on a comprehensive analytical example, that encoding/decoding works as expected. If we can't input permutations directly (i.e. a result of sortperm(x)), then that is not possible to test directly with the current code. Of course, we can test by manually constructing a dataset whose permutations correspond exactly to the Berger et al example, in that order.

I'll define an non-public intermediate function that does the permutation-to-integer encoding, which is then called internally by encode(::OrdinalPatternEncoding, x). Is that a good compromise, or shall we go for the slightly more complicated test example, but do no changes to encode?

Datseris commented 1 year ago

The permutation you plot is down-up, or "first is largest, second is least, third is in between", "the rank ordering is [3, 1, 2]" (as you say), or "the permutation pattern is (2, 3, 1)". All are equivalent. Our implementation is based on the permutation (i.e. the result of sortperm(x)). This is typically what is done is the context of permutation entropy (as the name implies). See for example section 2 in this summary paper.

Okay, great, so we change the documentation. No reason to change the code, it is indeed expensive to do the transformation to the alternative equivalent form.

I'll define an non-public intermediate function that does the permutation-to-integer encoding, which is then called internally by encode(::OrdinalPatternEncoding, x). Is that a good compromise, or shall we go for the slightly more complicated test example, but do no changes to encode?

We should go for the slightly more complicated test example and no changes to encode. The example doesn't have to be complicated either; we can check 3 things:

  1. ws = decode.(encode.(encoding, πs)); sort(ws) == sort(πs). Confirms that all 24 possibilities are covered.
  2. Explicitly test that decode(encode(enc, [10, -1, 2])) == [1, 3, 2]
  3. Same test as above with a 4-lenth vector

and also the two tests you wrote:

# Encoding
julia> all(encode_pattern.(πs) .== 1:factorial(4))
true

# Decoding
julia> decode.(Ref(OrdinalPatternEncoding(4)), es .- 1) == πs
true
Datseris commented 1 year ago

Oh, actually, regardless of tests that should only test encode anyways, it is still useful to define a function that only applies the Lehmer code. encode then calls sortperm! and then the lehmer code function.