Parameters of "empty" histograms

heliosdrm commented 5 years ago

This may happen more frequently in vertical structures, if there are no laminar states (no vertical lines, specially if a minimum line length is defined). But it might happen also with some signals for diagonal lines and recurrence times.

In such case the calculation of various parameters (averages, entropies, etc.) gives NaN. But is that the "right" solution? The condition is easy to check, so we may set them to zero in those cases if it's preferable. (Perhaps with a warning?)

Datseris commented 5 years ago

Shouldn't the return value be empty lists? This way you keep type stability as well.

heliosdrm commented 5 years ago

Why empty lists? Maybe you are thinking on the histograms (i.e. the return value of recurrencestructures). But I'm referring to the result of the RQA functions, which are numbers (integers or floats).

In this moment they are type stable:

julia> using RecurrenceAnalysis, SparseArrays

julia> m = sparse(Int[],Int[],Bool[],10,10)
10×10 SparseMatrixCSC{Bool,Int64} with 0 stored entries

julia> rmat = RecurrenceMatrix(m)
RecurrenceMatrix of size (10, 10) with 0 entries:

julia> det = dl_average(rmat)
NaN

julia> typeof(det)
Float64

The issue is to decide what result makes sense. Using this example: if we have no lines, should the function tell that their average length cannot be calculated (e.g. implied by NaN, as happens now), or that it is zero?

Datseris commented 5 years ago

I see. Yes sorry I got confused.

I think that this case should have zero in it, not NaN. I also think a warning should be issued as well, for example:

@warn "We could not find vertical/diagonal lines with length greater than the given `lmin`. The returned result is zero."

pucicu commented 5 years ago

It should not be zero, when the histogram is empty. Zeros means that we would have a very discrete distribution, i.e., the histogram would have only one bin that is filled (the others are empty or zero). This is has a certain meaning. If the distribution would be completely zero for each value x, then this would be more like a uniform distribution (which would have highest entropy value). But because there is not any bin in the histogram filled, i.e., there is no bin filled, it is still different from the uniform distribution. Therefore, only NaN would be correct.

Am 16.01.2019 um 10:40 schrieb George Datseris notifications@github.com:

I see. Yes sorry I got confused.

I think that this case should have zero in it, not NaN. I also think a warning should be issued as well, for example:

@warn "We could not find vertical/diagonal lines with length greater than the given lmin. The returned result is zero." — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JuliaDynamics/RecurrenceAnalysis.jl/issues/41#issuecomment-454714509, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7WkPxdS1kr8PvVY5hBn1lBCHk0bozgks5vDvN5gaJpZM4aCFvr.

Datseris commented 5 years ago

TL;DR: I don't mind about either 0 or NaN, but we have to be clear about it in the docs.

Norbert's comments make sense, but there is some merit to the 0 side as well.

0 isn't so unintuitive as far as the e.g. average or maximum is concerned. Because the lengths are positive-definite, the only way for their mean/maximum to be exactly zero is if none exist. (Each individual length being zero coincides with them not existing).

If the distribution would be completely zero for each value x, then this would be more like a uniform distribution (which would have highest entropy value).

How can a distribution be completely zero for each x value? Can we call this thing a "distribution"? It doesn't have an integral of 1 for example, so as far as probability axioms it fails there.

I am fine either way, but I now realize we should definitely add a comment on both the docstring of rqa and me as well on the documentation page as a Note, which explicitly states how we handle the "empty histogram" cases.

heliosdrm commented 5 years ago

With the current implementation, empty histograms give the following values:

DET (for diagonal) / LAM (for vertical) : 0.0
averages (Lmean, TT, MRT): 0.0
maximum values (Lmax, Vmax, etc.): 0
entropies (ENTR, RTE, etc.): 0.0
NMPRT: 0

The only possible ambiguity that I can think of is in the case of entropies, but not in the sense that you are commenting: empty bins are not counted in the calculation of entropy, so this result is equivalent to the case where all lines or recurrent times have the same length. On the other hand, according to the formula of the Shannon entropy, zero is the correct result for an empty set, so that ambiguity is more philosophical than mathematical. For all the other parameters, those results are only possible if the histogram is empty (no lines greater than the minimum length).

We can coerce some values to NaN if necessary when the histogram is empty, but my gut feeling when I see NaNs is that something is ill, and moreover NaNs are propagated to all subsequent calculations, so I'd prefer to avoid them when zero may be a correct result.

Another story is how empty histograms are presented. In the current implementation of the package they are not long array fulls of zeros, but an array with only a single zero. I also thought on giving a zero-element array, but although it would work in the present state of the package, I fear that that would have a greater risk of inadverted broken code if something is changed in the future.

pucicu commented 5 years ago

With “distribution” I mean “frequency distribution” (e.g. histogram) and not “probability distribution”. Sorry that this was not clear.

I think, if there is not any line then the “frequency distribution” is empty and the “probability distribution” does not exist. And for something that is not existing, there cannot be a valid entropy value. Moreover, as I said, entropy = 0 would mean already something, namely one line length appearing with probability one.

BTW, the CRP Toolbox for MATLAB gives NaN in this case. 😉

Am 17.01.2019 um 12:23 schrieb George Datseris notifications@github.com:

TL;DR: I don't mind about either 0 or NaN, but we have to be clear about it in the docs.

Norbert's comments make sense, but there is some merit to the 0 side as well.

0 isn't so unintuitive as far as the e.g. average or maximum is concerned. Because the lengths are positive-definite, the only way for their mean/maximum to be exactly zero is if none exist. (Each individual length being zero coincides with them not existing).

If the distribution would be completely zero for each value x, then this would be more like a uniform distribution (which would have highest entropy value).

How can a distribution be completely zero for each x value? Can we call this thing a "distribution"? It doesn't have an integral of 1 for example, so as far as probability axioms it fails there.

I am fine either way, but I now realize we should definitely add a comment on both the docstring of rqa and me as well on the documentation page as a Note, which explicitly states how we handle the "empty histogram" cases.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JuliaDynamics/RecurrenceAnalysis.jl/issues/41#issuecomment-455138029, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7WkFVehWmS_mqVixwmnr3zpiPRhtzNks5vEF02gaJpZM4aCFvr.

Datseris commented 5 years ago

We can coerce some values to NaN if necessary when the histogram is empty, but my gut feeling when I see NaNs is that something is ill, and moreover NaNs are propagated to all subsequent calculations, so I'd prefer to avoid them when zero may be a correct result.

Yeah this is a very important point. I would agree: If I ever saw a NaN anywhere I know something went terribly wrong.

edit: But this is also another very good point:

I think, if there is not any line then the “frequency distribution” is empty and the “probability distribution” does not exist. And for something that is not existing, there cannot be a valid entropy value. Moreover, as I said, entropy = 0 would mean already something, namely one line length appearing with probability one.

Damn, tough development decisions :D Should I serve as the tie breaker?

pucicu commented 5 years ago

It should be only NaN for the entropies. Zeros for the other measures is right.

Am 17.01.2019 um 12:39 schrieb Helios De Rosario notifications@github.com:

With the current implementation, empty histograms give the following values:

DET (for diagonal) / LAM (for vertical) : 0.0 averages (Lmean, TT, MRT): 0.0 maximum values (Lmax, Vmax, etc.): 0 entropies (ENTR, RTE, etc.): 0.0 NMPRT: 0

Datseris commented 5 years ago

It should be only NaN for the entropies. Zeros for the other measures is right.

Alright, this seems okay for me as well.

@heliosdrm

Another story is how empty histograms are presented. In the current implementation of the package they are not long array fulls of zeros, but an array with only a single zero.

Wait, I am confused. Why isn't the array completely empty and instead has an entry? Shouldn't it be totally empty?

pucicu commented 5 years ago

But this a personal opinion. For me a NaN is not a problem. It just tells me to be more careful when further investigating the data. 🤓

Am 17.01.2019 um 12:41 schrieb George Datseris notifications@github.com:

We can coerce some values to NaN if necessary when the histogram is empty, but my gut feeling when I see NaNs is that something is ill, and moreover NaNs are propagated to all subsequent calculations, so I'd prefer to avoid them when zero may be a correct result.

Yeah this is a very important point. I would agree: If I ever saw a NaN everywhere I know something went terribly wrong.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JuliaDynamics/RecurrenceAnalysis.jl/issues/41#issuecomment-455142654, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7WkBuzFUkGTFh8uy-gT_8A0yzc-10cks5vEGFSgaJpZM4aCFvr.

heliosdrm commented 5 years ago

Wait, I am confused. Why isn't the array completely empty and instead has an entry? Shouldn't it be totally empty?

It's just a consequence of how the histogram is calculated: it starts with that single "empty" bin, and then it is extended with further bins a line longer than the maximum existing bin is found.

I prefer to keep it like that to prevent the usual out-of-bound problems with zero-element arrays. But I can modify the code of recurrencestructures (so far the only exported function that shows the histograms), and return Int[] in such cases.

Datseris commented 5 years ago

It's just a consequence of how the histogram is calculated: it starts with that single "empty" bin, and then it is extended with further bins a line longer than the maximum existing bin is found.

Okay, this is fine, no need to change it.

So we all agree to only change the entropy return values to NaN ?

pucicu commented 5 years ago

I strongly suggest this. I had also discussed this in my group and they have the same opinion.

Am 17.01.2019 um 12:57 schrieb George Datseris notifications@github.com:

It's just a consequence of how the histogram is calculated: it starts with that single "empty" bin, and then it is extended with further bins a line longer than the maximum existing bin is found.

Okay, this is fine, no need to change it.

So we all agree to only change the entropy return values to NaN ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JuliaDynamics/RecurrenceAnalysis.jl/issues/41#issuecomment-455146889, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7WkOL2Zcyxf_3ggILyLDdvqlzmm76xks5vEGUXgaJpZM4aCFvr.

heliosdrm commented 5 years ago

And tell that in the documentation. :wink:

JuliaDynamics / RecurrenceAnalysis.jl

Parameters of "empty" histograms #41