Encoding for binnings - Githubissues

Datseris commented 1 year ago

This PR achieves:

Declare the encoding API. Closes #171
Implements encoding API for binnings
Refactors, and shortens, source code for histogram stuff. The binnings are nothing more than declarative structs that provide instructions on how to make a binning. The encoding is instantiated immediatelly.
ValueHistogram now requires that input x is provided if the binning is not fixed, so that outcome space is always well defined.
Moves Entropies.jl to StateSpaceSets.jl. This is the core, lowest dependency of JuliaDynamics that basically defines Dataset, dimension.

I have not written explicit tests for encodings yet. But the current tests pass. To consider also: should we expose the n_eps keyword to the API, and if so, maybe its time to give it a better name.

Datseris commented 1 year ago

cc @kahaaga in this PR I will change ValueHistogram to have the encoding directly as its field. This means that now the rectangular binnings are only used as "instructions". An instance of a bbinning is never used/stored but only to generate the RectangularBinEncoding. This will simplify source code even more. In fact, the ValueHistogram will become a three liner call to encode and decode.

kahaaga commented 1 year ago

cc @kahaaga in this PR I will change ValueHistogram to have the encoding directly as its field. This means that now the rectangular binnings are only used as "instructions". An instance of a bbinning is never used/stored but only to generate the RectangularBinEncoding. This will simplify source code even more. In fact, the ValueHistogram will become a three liner call to encode and decode.

Excellent.

Datseris commented 1 year ago

@kahaaga this PR will also remove all near-duplicate code that handles explictly input x that is a timeseries Vector{<:Real}. It will treat it as a 1-dimensional dataset. all related outcome space will be 1-element static vectors. You can re0interpret them to vector of numbers like so:

y = rand(SVector{1, Float64}, 100)
x = reinterpret(Float64, y) # behaves identically to Vector{Float64}

Datseris commented 1 year ago

This PR will also make outcome_space work for any binning. it will make the histogram size part of the bin encoding struct.

At the encoding level we have full knowledge: we know the limits of the histogram as we have already extracted them from data. I'll also put a notice to the docstirngs that it is strongly preferred to use Fixed binnings as their outcome space is not dependent on the input data.

kahaaga commented 1 year ago

@kahaaga this PR will also remove all near-duplicate code that handles explictly input x that is a timeseries Vector{<:Real}. It will treat it as a 1-dimensional dataset. all related outcome space will be 1-element static vectors. You can re0interpret them to vector of numbers like so:

That makes sense. Then, before merging this, we need to make sure that

[ ] Input dimension is checked explicitly for estimators that only work on 1D data. This is the case for some of the direct entropy estimators.
[ ] Update signatures in docstrings for all probabilities and entropy estimators that previously had Vector_or_Dataset

Datseris commented 1 year ago

no no I think you misunderstood: You can still use timeserries as input. We just don't need duplicate source code. The outcome space for timeseries input will be a vector of 1-length svectors instead of a vector of Floats. From the user input perspective nothing changes. So functions that accepted Vector_or_Dataset, still do so.

kahaaga commented 1 year ago

We just don't need duplicate source code. The outcome space for timeseries input will be a vector of 1-length svectors instead of a vector of Floats. From the user input perspective nothing changes. So functions that accepted Vector_or_Dataset, still do so.

Ah, yes, I misunderstood. Then these sound like reasonable changes.

kahaaga commented 1 year ago

This PR will also make outcome_space work for any binning. it will make the histogram size part of the bin encoding struct. At the encoding level we have full knowledge: we know the limits of the histogram as we have already extracted them from data.

Just to ensure that I understood this comment:

The outcome space for any RectangularBinning cannot be known in advance without providing input data. I.e. outcome_space(::RectangularBinning) is not defined.
The outcome space for any RectangularBinning is known when providing input data, because it is precisely the set of bins visited by the input data. I.e. outcome_space(x, ::RectangularBinning) is well-defined.

Is that what you meant?

I'll also put a notice to the docstirngs that it is strongly preferred to use Fixed binnings as their outcome space is not dependent on the input data.

I agree on recommending this.

Datseris commented 1 year ago

Now ValueHistogram() signature requires x if the bininng isn't FixedBinning so that all outcome spaces are knwon. Its similar to how we do it for e.g. the spectral entropy.

Datseris commented 1 year ago

@kahaaga I've updated the PR top level comment. This is ready for review and merge. I'll now be gonne for a fair bit of time, so go ahead and merge/modify as you see fit.

kahaaga commented 1 year ago

I've updated the PR top level comment. This is ready for review and merge. I'll now be gonne for a fair bit of time, so go ahead and merge/modify as you see fit.

Will do! Thanks for the effort.

Datseris commented 1 year ago

@kahaaga I'll slowly try to catch up in the coming days. What's the status here? This wasn't merged in, but is there any blockers?

kahaaga commented 1 year ago

@Datseris Welcome back!

I've been majorly busy too, so I just haven't had time to look at this PR properly yet. Will try to do so tonight or tomorrow morning.

The status here is:

I need to review this PR properly to make sure it doesn't interfere with anything transfer operator related. Will try to do that tonight or tomorrow morning.
- The entropy_signature branch (#168) (just switching argument orders of probabilities and entropy + friends) is ready-ish. I think we both decided on the design there, so probably it can just be merged? I'm basing everything CausalityTools v2 on that branch, so would be nice to merge it sooner rather than later.
- To complete the move to Complexity.jl, we need to finally decide on the design and and merge
- The multiscale API (#126)
- The generic setup for spatiotemporal analyses (#132)
- If you agree on the design part of the two PRs above, I will do the work of bringing them up to speed with main. There's been quite a few changes to main since I first made those PRs.

In general:

I've implemented many more literature entropy/mutual information/conditional mutual information/relative entropy/divergence estimators, including the most recent nearest-neighbor based developments.
I'm preparing educational videos/animations for several of the entropy estimators (think workshop/documentation)
I've completely redesigned the information measures for CausalityTools, so that any lower-level estimator like EntropyEstimator and ProbabilitiesEstimator can be used to estimate higher-level things like (conditional) mutual information.

However, I haven't prepared any of the two points above for review yet, because I want us to decide on the argument order, multiscale and spatiotemporal API before finalizing anything.

TLDR: 1) I will try to look at this PR tonight. 2) The outstanding PRs need to be resolved to ready the other packages.

kahaaga commented 1 year ago

should we expose the n_eps keyword to the API, and if so, maybe its time to give it a better name.

Maybe just call it tolerance or tol, and explain in the docstring what it means?

kahaaga commented 1 year ago

Okay, I've tried to have a look, but I'll need some more time to wrap my head around these changes. The tests do not pass with the current version of main. This happens because encode_as_bin and decode_as_bin can't straight-forwardly be replaced with the new encode (which gives an integer) in the source code for TransferOperator. I think.

My mind is a bit clouded at the moment. I'll have another go at it tomorrow again with a blank slate.

Datseris commented 1 year ago

@kahaaga would you be up for a video call as I still can't read many letters in one sitting and the reply is large here

kahaaga commented 1 year ago

@Datseris 13:30 CET, today?

Datseris commented 1 year ago

just saw this now; I'm free from 14:30 CET

kahaaga commented 1 year ago

just saw this now; I'm free from 14:30 CET

@Datseris I'm busy with other things from 14:30 CET and onwards today. I can either do a call after 20 CET tonight, or any time from 10-17 CET tomorrow. Do any of those alternatives work for you?

Datseris commented 1 year ago

Tomorrow is fine for me, I'll post a message here when I'm free, probably around 2-3pm!

kahaaga commented 1 year ago

Tomorrow is fine for me, I'll post a message here when I'm free, probably around 2-3pm!

Sounds good. Just tag me explicitly here when you're ready, so I'll get an e-mail notification.

kahaaga commented 1 year ago

@Datseris I'm available now. Just post the link when you're ready.

Datseris commented 1 year ago

@kahaaga me too. I don't have any pro zoom stuff. I can post a google meets link.

kahaaga commented 1 year ago

@Datseris I have the pro version. Here's a Zoom meeting:

https://uib.zoom.us/j/66488837829?pwd=c3FCUFhZdjFjN0V5SkZ4UWJSd0o5dz09

Datseris commented 1 year ago

@kahaaga shall we merge this in? Shall we merge in first #168 and then this? Transfer operator update can be done in a separate PR. I have about 3 hours of worktime per day and I want to finish as many tickboxes as possible and (maybe) get a stable Entropies release out before the end of the year. (but very low likelyhood that this will happen)

Datseris commented 1 year ago

A lol #168 is already merged. Okay. I will update this branch to master by changing the signatures for value histogram stuff and merge. Transfer Operator fixes shoujld be done in a different PR.

kahaaga commented 1 year ago

@kahaaga shall we merge this in? Shall we merge in first https://github.com/JuliaDynamics/Entropies.jl/pull/168 and then this? Transfer operator update can be done in a separate PR. I have about 3 hours of worktime per day and I want to finish as many tickboxes as possible and (maybe) get a stable Entropies release out before the end of the year. (but very low likelyhood that this will happen)

Yes, we should try to get a stable release out before the end of the year. I aim to do the same for CausalityTools (which I'm currently working on during my available work hours), so that we'll have it ready to prepare workshop material.

A lol https://github.com/JuliaDynamics/Entropies.jl/pull/168 is already merged. Okay. I will update this branch to master by changing the signatures for value histogram stuff and merge. Transfer Operator fixes shoujld be done in a different PR.

Yes, do that! I have started fixing the transfer operator in a separate branch, but it isn't pushed to remote yet.

Datseris commented 1 year ago

Test are passing here (for ValueHistogram). I am merging this in before my brain explodes due to the amount of PRs open with concurrent changes!

JuliaDynamics / ComplexityMeasures.jl

Encoding for binnings #177