Open dehann opened 3 years ago
Separately, nullhypo=
answer to question, which will be on similar docs page. Just dropping it in here as best ergonomic issue thread for now.
In case of a nullhypo=
being set,
nullhypo=
weird,
=0.1
then 10% of time that "factor should not be there"...?nullhypo=
,
nullhypo=
in a stochastic sense means maximum entropy, or said differently -- the factor does not reduce entropy in variable estimate under nullhypothesis.nullhypo=
can 'increase' entropy at the sampling proposal stage, but must then reject any samples that add 'too much' entropy,What does that mean... When a factor is used as conditional likelihood for calculating proposal distributions, when nullhypo is active we add a large amount of entropy to the given sample (must stay on-manifold). This gain in entropy is then 'rejected' during the product between proposals steps from other factors on the variable. The multiscale Gibbs operation selects the most relevant kernels from incoming proposal densities, and it is very likely that the high entropy kernels will be rejected.
Said another way, if a variable were not restricted (say one connected factor at nullhypo=0.5
), then there needs to be enough entropy that 50% of the time that variable can be anywhere on the domain. However that is not computationally possible either. So we do the next best thing, we adaptively add a bunch of entropy based on current numerical values (50% of the time), which spreads the likely location of the variable out over much larger area in the domain. If more factors pull the variable elsewhere, then the adaptive entropy will still spread around the current estimate. So the big boost in entropy is still somewhat local, but infinite support is impossible with current understanding of sampling and Bayes tree techniques. Practical testing, and tweaking of how much noise to add (i.e. params.spreadNH
) allows pretty powerful "best alternative" solution to a pretty hard problem. It is likely that in most cases a user would not know the difference between puritan max entropy nullhypo=
and our current rejection sampling approach.
It is possible that the current IIF code is not always adding the noise (entropy) right, or the spreadNH
mechanics can be improved, but I'm pretty confident about this solution as a good medium term stand in solution (barring any bugs of course).
cc @lemauee, @GearsAD
Thanks @dehann for the writeup, thats pretty much what I already figured out by performing convolutions using factors with the nullhypo set.
However, add entropy
is still somehow a spongy term for me. What distribution (uniform? broad gaussian?) underlies this process? params.spreadNH
should set some parameter of that distribution, (connected to distribution width for uniform, to standard deviation for gaussian?). What parameter is this exactly?
Best, Leo
Hi @dehann ,
I came across having to describe the mechanics of the "adding entropy" for my thesis again (see previous post). Could you elaborate that term a little or point me to the implementation? Grasping the mathematical model behind this would be really important for me :)
Best, Leo
Hi Leo, I can point you to the code where it is added: https://github.com/JuliaRobotics/IncrementalInference.jl/blob/4ac8ccda4f27d8834950ad6a1913851031932add/src/ApproxConv.jl#L228-L239 Sorry, I can't help you more than that. My understanding is it only spreads that is currently there. So don't think its any distribution.
Thanks for the hint, if maniAddOps does not do anything strange, by using rand()
we should end up with a uniform distribution of width spreadDist
?:
using Gadfly
spreadDist = 1;
n = 1000;
x = Vector{Float64}(undef,n)
for k in 1:length(x)
x[k] = spreadDist*(rand()-0.5)
end
pl = plot(x=x, Geom.histogram(bincount=10))
Next question would be if spreadDist
is directly equal to the spreadNH
solver parameter. From performing an example convolution it seems like this is the case:
julia> fg = initfg()
julia> getSolverParams(fg).spreadNH
3.0
using Caesar, RoMEPlotting, LinearAlgebra, Statistics, Gadfly
N = 100
fg = initfg()
addVariable!(fg, :l0, Point2, N=N)
trans = MvNormal([1.0,0.0], Diagonal([0.01;0.01]))
addVariable!(fg, :l1, Point2, N=N)
l1Prior = PriorPoint2(MvNormal([1,0], Diagonal([0.01;0.01])))
addFactor!(fg,[:l1], l1Prior)
f1 = Point2Point2(trans)
addFactor!(fg, [:l0, :l1], f1, nullhypo=0.5)
pts = approxConv(fg, :l0l1f1, :l0)
p = kde!(pts)
pll0 = plotKDE(p)
ensureAllInitialized!(fg)
pl = plotSLAM2DLandmarks(fg)
The particles of l1 get spread by the convolution in a circle with diameter 3.0 (+the prior noise on l1 and the factors noise) around l0s position implied by the factor in a uniform way.
Best, Leo
Hi @lemauee ,
So the most important aspect of nullhypo is that for some fraction of the particles calculated on a variable proposal (via samples as proxy) on say p(X | Y=Y) (i.e. evaluating the conditional p(X|Y) for given values Y), that for some of the samples it is as if p(X|Y) never even existed. So if you say 10% of samples are "nullhypo" (and remember given the Bayes tree approach we dont want to change the factor graph by deleting factors) we need to do the next best thing ... how to make as if the factor was never there?
Here in IIF we inject noise onto 10% of existing samples in X, and then solve the remaining 90% according to the structure in p(X|Y). It is important that the 10% follows from the existing values in X, since the factor should have NO affect on those 10% samples. However we cannot just use the values in X because that will break the Markov assumption in sampling that each step in the sampling chain should only be correlated to one previous step, and not correlate back multiple stages in the sampling. The practical effect of this is that say true value x is around say 10, but the conditional p(X|Y=Y) is trying to force X toward say -100, then there needs to be some way (during this iteration of the Gibbs sampling) to relieve that "strain". By saying 10% nullhypo, it allows some chance that a few samples will be much closer to x. As the Gibbs process continues, samples nearer x* will be selected and all the other 99% of samples predicted by p(X|Y=Y) and nullhypo will be rejected at that stage. Therefore the posterior on X might be a little biased due to our workaround approach, but that is the difference between say 10 to 11, vs 10 to -100. The "win" in this compromise is that the structure of the Bayes tree does not have to be updated.
What spreadNH does is magnify the existing spread of pre-X values by adding random values that much bigger than the current spread (makes us add noise relative to the scale being used). If spreadNH is too big, then the bias on posterior will get a bit worse. If spreadNH is too small, then large discrepancies between true value and proposals from p(X|Y) will not allowed to alleviate the "strain" as though a nullhypo. spreadNH=3 works well from trail and error in SLAM, as a good tradeoff. The idea is to add this "noise" to pre-X samples on-manifold. So we have to be careful how the exponential map is defined. Before AMP 41, RoME 244, we have to use maniOpsAdd
as callback -- we are busy changing all that to just use Manifolds.jl as good standard.
On rand vs randn, it does not really matter, since this is a rejection sampling technique that is still localized upto spreadNH around current values. rand is a little better in that there is no concentration within spreadNH region -- but either should work fine. rand should have just a little less bias in the final posterior result.
Hope that helps, we still need to write all this up in our publications :-)
Best, Dehann
Hi @dehann ,
Thanks for the insight, getting the idea of working on the existing values of X was really important to understand the behavior. In my little example that could not be distinguished from my (wrong) assumption. So to get a belief for a certain contitional before the gibbs sampling step in case there is a nullhypo on that conditional, the prior kernels from the variable that is being sampled to, "pre-X" are spread for the nullhypo amount, and not the kernels originating from the factor "X|Y"?
Whats still a statement that needs clarification for me from your post is "makes us add noise relative to the scale being used". How does the relative spreadNH translate an absolute spread? Or did I missunderstand something there?
I experienced first hand what setting spreadNH too big does, the all-famous 150 pose example with 10% outliers added, and nullhypo set to 0.1 looks like this when spreadNH is set to the standard 3.0: With spreadNH set to 16 (the 16 coming from my wrong assumption of how the particles must be spread to reach every corner of my example) particles drift away several magnitudes. The solver even crashed due to some infinite Number error :
But with speadNH=3.0 the example looks wrong after the whole round: GTSAM with a robust cost function (that needed some manual tuning, I admit, thought that maybe wasn't necessary in Caesar ;) ) solves it fine: Is there at least some advice when to set spreadNH to lower and when to higher values, and by how much the standard 3.0 should be changed? Blind trial and error with Caesar can become a little time-consuming atm :) I suppose for my example setting spreadNH a little higher would make sense due to that one can see a wrong "strain" (wrong loopclosure) acting.
If its of any help I can port the currently "read-from-matfile" example to a "1-script-julia-version". I don't want to see Caesar loose against paramertric solving in such a simple scenario ;)
Best, Leo
""" I have read the data association page and I think it makes sense. One of the questions I had was on scope/approach - depending on where you want to go with it, it may make sense to first introduce the problem in general terms before describing how it is tackled within Caesar. """