KL Divergence optimization

tpoisot commented 1 year ago

I like the idea - this is what Fauxcurrence uses as well. One thing that may be useful is to use Jensen-Shannon instead, which is symmetrical, and bounded (to 1 when using log2, to log2 when using log2). The square root of JS is also a distance (that is bounded to 1 when using log2, which is really nice in terms of giving a sense of the quality of the fit).

Thoughts?

gottacatchenall commented 1 year ago

Seems like it makes sense to use JS distance. The only issue I'm running into is whencomputing JS-div by definition as

$$JS(P,Q) = \frac{1}{2}KL(P,M) + \frac{1}{2}KL(Q,M)$$ where $M=\frac{1}{2}(P+Q)$

when using a MixtureModel from Distributions.jl for M, this works fine for Normal distributions but for MvNormals the methods within kldivergence calls sample expectation values (presumably because it can't be computed analytically in general), so there is variance in JS measures on the same pair of distributions. With enough samples it goes down ofc, but there is going to be a trade-off in terms of speed of eval. Around $10^5$ samples is relatively stable for 5 layers, but this could be variable depending on the input number of layers.

gottacatchenall commented 1 year ago

Also I think it may be worth importing SA/other methods from Optim.jl instead of rewriting tools to do fit diagnostics from scratch, I'm going to take a closer look at Optim in a bit

PoisotLab / BiodiversityObservationNetworks.jl

KL Divergence optimization #40