Open dehann opened 5 years ago
Note that when using getSolverParams(fg).dbg=true
each of the major stages of the CSM clique inference subgraphs are stored to the getLogPath(fg)
destination -- including:
fg_build
, as copied from main fg objectfg_beforeupsolve
fg_afterupsolve
fg_beforedownsolve
fg_afterdownsolve
which shows the belief messages passed up or down in each case added as MsgPriors
to the clique subgraphs.
Is there a way to debug with a failed solve (where it errors and doesn't return)?
Hi @Affie , yeah so the best I have been able to do so far is as follows:
First thing is to find where the logs are going to land:
@show getLogPath(fg)
Then solve with all debug options:
getSolverParams(fg).dbg = true
getSolverParams(fg).async = true
tree, smt, hist = solveTree!(fg, verbose=true, recordcliqs=ls(fg));
Verbose is very useful, but only prints to the console so that you can get an idea of how far each CSM gets and in which functions the error might occur.
With this you will then get process 1 context back while the solver works in the background -- do not start a second solve while the first is still working in the background.
Once the failures occur, there are three things to do:
smt
on which CSM failed, and you should be able to see the error stack there as well (but not always),hist4 = fetch(smt[4])
.One more thing you can do is to limit the number of iterations for all CSM:
getSolverParams(fg).limititers = 13
which allows you to force early termination of CSM and thereby get the csmc object nearest the failure. It is actually possible to get even closer, one CSM at a time:
fg = generateCanonicaFG_Hexagonal()
@show getLogPath(fg)
tree = wipeBuildNewTree!(fg)
drawTree(tree, show=true)
getSolverParams(fg).dbg = true
getSolverParams(fg).async = true
getSolverParams(fg).limititers = 13
cliqtask = solveCliq!(fg, tree, :x2, verbose=true, recordcliq=true, async=true)
# wait for that task to finish
sleep(10)
getSolverParams(fg).limititers = 9
cliqtask = solveCliq!(fg, tree, :x0, verbose=true, recordcliq=true, async=true)
...
Actually there is one more you can do with hist::Dict
(automatically fetched for all CSM if async=false
), use printCliqHistorySequential
to see output like this:
using RoME
fg = generateCanonicalFG_Hexagonal()
tree, smt, hist = solveTree!(fg, recordcliqs=ls(fg));
printCliqHistorySequential(hist)
semicolon is recommended :-)
julia> printCliqHistorySequential(hist)
1.1 10:27:13.868 testCliqCanRecycle 0 | null P ---- C 2null 3null 4null |S|
1.2 10:27:14.303 testCliqCanIncremt 0 | null P ---- C 2null 3null 4null |S|
1.3 10:27:14.31 isCliqUpSolved 0 | null P ---- C 2null 3null 4null |S|
1.4 10:27:14.362 buildCliqSubgraph 0 | null P ---- C 2null 3null 4null |S|
1.5 10:27:15.617 isCliqNull 0 | null P ---- C 2null 3null 4null |S|
1.6 10:27:15.908 doesCliqNeeddownms 0 | null P ---- C 2null 3null 4null |S|
1.7 10:27:15.917 determineCliqNeedD 0 | null P ---- C 2null 3null 4null |S|
1.8 10:27:15.955 slowCliqIfChildren 0 | null P ---- C 2null 3null 4null |S|
2.1 10:27:15.987 testCliqCanRecycle 0 | null P 1null C 5null 6null |S| 3null 4null
2.2 10:27:15.991 testCliqCanIncremt 0 | null P 1null C 5null 6null |S| 3null 4null
2.3 10:27:15.993 isCliqUpSolved 0 | null P 1null C 5null 6null |S| 3null 4null
2.4 10:27:15.995 buildCliqSubgraph 0 | null P 1null C 5null 6null |S| 3null 4null
2.5 10:27:15.999 isCliqNull 0 | null P 1null C 5null 6null |S| 3null 4null
2.6 10:27:16.003 doesCliqNeeddownms 0 | null P 1null C 5null 6null |S| 3null 4null
2.7 10:27:16.007 checkIfCliqNullBlo 0 | null P 1null C 5null 6null |S| 3null 4null
2.8 10:27:16.086 blockCliqSiblingsP 0 | null P 1null C 5null 6null |S| 3null 4null
2.9 10:27:16.151 determineCliqNeedD 0 | null P 1null C 5null 6null |S| 3null 4null
2.10 10:27:16.155 slowCliqIfChildren 0 | null P 1null C 5null 6null |S| 3null 4null
3.1 10:27:16.159 testCliqCanRecycle 0 | null P 1null C ---- |S| 2null 4null
3.2 10:27:16.165 testCliqCanIncremt 0 | null P 1null C ---- |S| 2null 4null
3.3 10:27:16.168 isCliqUpSolved 0 | null P 1null C ---- |S| 2null 4null
3.4 10:27:16.17 buildCliqSubgraph 0 | null P 1null C ---- |S| 2null 4null
3.5 10:27:16.337 isCliqNull 0 | null P 1null C ---- |S| 2null 4null
3.6 10:27:16.342 doesCliqNeeddownms 0 | null P 1null C ---- |S| 2null 4null
3.7 10:27:16.347 checkIfCliqNullBlo 0 | null P 1null C ---- |S| 2null 4null
3.8 10:27:16.352 blockCliqSiblingsP 0 | null P 1null C ---- |S| 2null 4null
3.9 10:27:16.358 determineCliqNeedD 0 | null P 1null C ---- |S| 2null 4null
3.10 10:27:16.363 getBetterName7b 0 | null P 1null C ---- |S| 2null 4null
4.1 10:27:16.39 testCliqCanRecycle 0 | null P 1null C ---- |S| 2null 3null
4.2 10:27:16.392 testCliqCanIncremt 0 | null P 1null C ---- |S| 2null 3null
4.3 10:27:16.394 isCliqUpSolved 0 | null P 1null C ---- |S| 2null 3null
4.4 10:27:16.397 buildCliqSubgraph 0 | null P 1null C ---- |S| 2null 3null
4.5 10:27:16.449 isCliqNull 0 | null P 1null C ---- |S| 2null 3null
4.6 10:27:16.454 doesCliqNeeddownms 0 | null P 1null C ---- |S| 2null 3null
4.7 10:27:16.459 checkIfCliqNullBlo 0 | null P 1null C ---- |S| 2null 3null
4.8 10:27:16.465 blockCliqSiblingsP 0 | null P 1null C ---- |S| 2null 3null
4.9 10:27:16.47 determineCliqNeedD 0 | null P 1null C ---- |S| 2null 3null
4.10 10:27:16.476 getBetterName7b 0 | null P 1null C ---- |S| 2null 3null
5.1 10:27:16.481 testCliqCanRecycle 0 | null P 2null C ---- |S| 6null
5.2 10:27:16.483 testCliqCanIncremt 0 | null P 2null C ---- |S| 6null
5.3 10:27:16.486 isCliqUpSolved 0 | null P 2null C ---- |S| 6null
5.4 10:27:16.488 buildCliqSubgraph 0 | null P 2null C ---- |S| 6null
5.5 10:27:16.493 isCliqNull 0 | null P 2null C ---- |S| 6null
5.6 10:27:16.499 doesCliqNeeddownms 0 | null P 2null C ---- |S| 6null
5.7 10:27:16.504 checkIfCliqNullBlo 0 | null P 2null C ---- |S| 6null
5.8 10:27:16.51 blockCliqSiblingsP 0 | null P 2null C ---- |S| 6null
5.9 10:27:16.515 determineCliqNeedD 0 | null P 2null C ---- |S| 6null
5.10 10:27:16.52 getBetterName7b 0 | null P 2null C ---- |S| 6null
6.1 10:27:16.525 testCliqCanRecycle 0 | null P 2null C ---- |S| 5null
6.2 10:27:16.528 testCliqCanIncremt 0 | null P 2null C ---- |S| 5null
6.3 10:27:16.53 isCliqUpSolved 0 | null P 2null C ---- |S| 5null
6.4 10:27:16.532 buildCliqSubgraph 0 | null P 2null C ---- |S| 5null
6.5 10:27:16.537 isCliqNull 0 | null P 2null C ---- |S| 5null
6.6 10:27:16.543 doesCliqNeeddownms 0 | null P 2null C ---- |S| 5null
6.7 10:27:16.548 checkIfCliqNullBlo 0 | null P 2null C ---- |S| 5null
6.8 10:27:16.553 blockCliqSiblingsP 0 | null P 2null C ---- |S| 5null
6.9 10:27:16.559 determineCliqNeedD 0 | null P 2null C ---- |S| 5null
6.10 10:27:16.564 getBetterName7b 0 | null P 2null C ---- |S| 5null
3.11 10:27:16.57 attemptCliqInitUp 0 | null P 1null C ---- |S| 2null 4null
3.12 10:27:16.598 mustInitUpCliq 0 | null P 1null C ---- |S| 2null 4null
3.13 10:27:19.439 doCliqUpSolveIniti 0 | null P 1null C ---- |S| 2null 4null
4.11 10:27:23.505 attemptCliqInitUp 0 | null P 1null C ---- |S| 2null 3null
4.12 10:27:23.511 mustInitUpCliq 0 | null P 1null C ---- |S| 2null 3null
4.13 10:27:23.516 doCliqUpSolveIniti 0 | null P 1null C ---- |S| 2null 3null
5.11 10:27:29.764 attemptCliqInitUp 0 | null P 2null C ---- |S| 6null
5.12 10:27:29.769 mustInitUpCliq 0 | null P 2null C ---- |S| 6null
5.13 10:27:29.774 doCliqUpSolveIniti 0 | null P 2null C ---- |S| 6null
6.11 10:27:35.021 attemptCliqInitUp 0 | null P 2null C ---- |S| 5null
6.12 10:27:35.027 mustInitUpCliq 0 | null P 2null C ---- |S| 5null
6.13 10:27:35.032 doCliqUpSolveIniti 0 | null P 2null C ---- |S| 5null
3.14 10:27:40.464 rmUpLikeliSaveSubF 0 | upsolved P 1null C ---- |S| 2null 4null
3.15 10:27:40.523 finishCliqSolveChe 0 | upsolved P 1null C ---- |S| 2null 4null
4.14 10:27:40.841 rmUpLikeliSaveSubF 0 | upsolved P 1null C ---- |S| 2null 3upsolved
4.15 10:27:40.846 finishCliqSolveChe 0 | upsolved P 1null C ---- |S| 2null 3upsolved
5.14 10:27:40.887 rmUpLikeliSaveSubF 0 | upsolved P 2null C ---- |S| 6null
5.15 10:27:40.892 finishCliqSolveChe 0 | upsolved P 2null C ---- |S| 6null
6.14 10:27:40.899 rmUpLikeliSaveSubF 0 | upsolved P 2null C ---- |S| 5upsolved
6.15 10:27:40.904 finishCliqSolveChe 0 | upsolved P 2null C ---- |S| 5upsolved
3.16 10:27:40.955 determineCliqIfDow 0 | upsolved P 1null C ---- |S| 2null 4upsolved
2.11 10:27:41.187 isCliqNull 0 | null P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
2.12 10:27:41.191 doesCliqNeeddownms 0 | null P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
2.13 10:27:41.194 checkIfCliqNullBlo 0 | null P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
2.14 10:27:41.198 blockCliqSiblingsP 0 | null P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
2.15 10:27:41.202 determineCliqNeedD 0 | null P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
2.16 10:27:41.206 getBetterName7b 0 | null P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
4.16 10:27:41.21 determineCliqIfDow 0 | upsolved P 1null C ---- |S| 2null 3upsolved
5.16 10:27:41.215 determineCliqIfDow 0 | upsolved P 2null C ---- |S| 6upsolved
6.16 10:27:41.221 determineCliqIfDow 0 | upsolved P 2null C ---- |S| 5upsolved
2.17 10:27:41.311 attemptCliqInitUp 0 | null P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
2.18 10:27:41.327 mustInitUpCliq 0 | null P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
2.19 10:27:42.015 doCliqUpSolveIniti 0 | null P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
2.20 10:27:42.99 rmUpLikeliSaveSubF 0 | upsolved P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
2.21 10:27:43.061 finishCliqSolveChe 0 | upsolved P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
1.9 10:27:43.065 isCliqNull 0 | null P ---- C 2upsolved 3upsolved 4upsolved |S|
1.10 10:27:43.069 doesCliqNeeddownms 0 | null P ---- C 2upsolved 3upsolved 4upsolved |S|
1.11 10:27:43.074 determineCliqNeedD 0 | null P ---- C 2upsolved 3upsolved 4upsolved |S|
1.12 10:27:43.086 getBetterName7b 0 | null P ---- C 2upsolved 3upsolved 4upsolved |S|
2.22 10:27:43.091 determineCliqIfDow 0 | upsolved P 1null C 5upsolved 6upsolved |S| 3upsolved 4upsolved
1.13 10:27:43.193 attemptCliqInitUp 0 | null P ---- C 2upsolved 3upsolved 4upsolved |S|
1.14 10:27:43.201 mustInitUpCliq 0 | null P ---- C 2upsolved 3upsolved 4upsolved |S|
1.15 10:27:43.211 doCliqUpSolveIniti 0 | null P ---- C 2upsolved 3upsolved 4upsolved |S|
1.16 10:27:44.979 rmUpLikeliSaveSubF 0 | upsolved P ---- C 2upsolved 3upsolved 4upsolved |S|
1.17 10:27:44.983 finishCliqSolveChe 0 | upsolved P ---- C 2upsolved 3upsolved 4upsolved |S|
1.18 10:27:44.999 determineCliqIfDow 0 | upsolved P ---- C 2upsolved 3upsolved 4upsolved |S|
3.17 10:27:45.195 doCliqDownSolve 0 | upsolved P 1downsolved C ---- |S| 2upsolved 4upsolved
4.17 10:27:46.427 doCliqDownSolve 0 | upsolved P 1downsolved C ---- |S| 2upsolved 3downsolved
2.23 10:27:47.372 doCliqDownSolve 0 | upsolved P 1downsolved C 5upsolved 6upsolved |S| 3downsolved 4downsolved
1.19 10:27:48.323 exitStateMachine 0 | downsolved P ---- C 2downsolved 3downsolved 4downsolved |S|
3.18 10:27:48.35 exitStateMachine 0 | downsolved P 1downsolved C ---- |S| 2downsolved 4downsolved
5.17 10:27:48.355 doCliqDownSolve 0 | upsolved P 2downsolved C ---- |S| 6upsolved
6.17 10:27:49.298 doCliqDownSolve 0 | upsolved P 2downsolved C ---- |S| 5downsolved
4.18 10:27:50.247 exitStateMachine 0 | downsolved P 1downsolved C ---- |S| 2downsolved 3downsolved
2.24 10:27:50.251 exitStateMachine 0 | downsolved P 1downsolved C 5downsolved 6downsolved |S| 3downsolved 4downsolved
5.18 10:27:50.257 exitStateMachine 0 | downsolved P 2downsolved C ---- |S| 6downsolved
6.18 10:27:50.348 exitStateMachine 0 | downsolved P 2downsolved C ---- |S| 5downsolved
The most pressing developer tool here is to better manage how CSM terminates on error. The two options I have come up with are either (sure there are more options available):
see #758
reminder that the revamped getSolverParams(fg).useMsgLikelihoods = true
is default false at this time and will be made default true in future.
Another trick that doesn't come up too often but none the less: getSolverParams(fg).multiproc = false
-- is default true. Similarly, you can skip up or down stages by getSolverParams(fg).dodownsolve = false
. All the parameter options are:
getSolverParams(fg) |> typeof |> fieldnames
Also worth stating again here, the purpose of recordcliqs
is to allow repeat steps:
# either from hist4 = hist[4] or
hist4 = fetch(cliqtask4)
# the next function that would have been called at step 11
fnc11! = hist4[11][3]
csmc11_ = deepcopy(hist4[11][4])
# must reset the condition
getSolveCondition(csmc11_.cliq) = Condition()
# now redo that step (likely with debugger)
fnc12_ = fnc11!(csmc11_)
# note the csmc11_ object will now be changed after fnc11!
## The design intends
fnc13_ = fnc12_(csmc11_)
# ...
added a new tool limittercliqs::Vector{Pair{Symbol, Int}}
that allows early stopping of any specific CSMs per user desire, making it much easier to debug how CSMs work:
https://github.com/JuliaRobotics/IncrementalInference.jl/blob/1ad9b34579d9bb6fbbe8ea317ba8d4ff66b241e7/test/testBasicTreeInit.jl#L8
Snapshot example of hex init script: https://github.com/JuliaRobotics/Caesar.jl/blob/a486c0728139e485ded88fa7f725295add002a7a/examples/dev/devCascadeInitCSM.jl#L3
Commits 426e98f , 1dbe967 , 030d75b add a new swim lane type CSM printout:
julia> IIF.printCSMHistoryLogical(hists)
| x0 | x4 | l1 | x1 | x5 | x3
----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------
1 | 1 testCliqC null | 10 testCliqC null | 21 testCliqC null | 32 testCliqC null | 43 testCliqC null | 54 testCliqC null
2 | 2 testCliqC null | 11 testCliqC null | 22 testCliqC null | 33 testCliqC null | 44 testCliqC null | 55 testCliqC null
3 | 3 isCliqUpS null | 12 isCliqUpS null | 23 isCliqUpS null | 34 isCliqUpS null | 45 isCliqUpS null | 56 isCliqUpS null
4 | 4 buildCliq null | 13 buildCliq null | 24 buildCliq null | 35 buildCliq null | 46 buildCliq null | 57 buildCliq null
5 | 5 canCliqMa null | 14 canCliqMa null | 25 canCliqMa null | 36 canCliqMa null | 47 canCliqMa null | 58 canCliqMa null
6 | 6 blockUnti null | 15 blockUnti null | 26 blockUnti null | 37 blockUnti null | 48 blockUnti null | 59 blockUnti null
7 | 7 trafficRe null | 16 trafficRe null | 27 trafficRe null | 38 trafficRe null | 49 trafficRe null | 60 trafficRe null
8 | 8 determine null | 17 checkIfCl null | 28 checkIfCl null | 39 checkIfCl null | 50 checkIfCl null | 61 checkIfCl null
9 | 9 childCliq null | 18 blockCliq null | 29 blockCliq null | 40 blockCliq null | 51 blockCliq null | 62 blockCliq null
10 | 103 trafficRe null | 19 determine null | 30 determine null | 41 determine null | 52 determine null | 63 determine null
11 | 104 determine null | 20 childCliq null | 31 towardUpO null | 42 towardUpO null | 53 towardUpO null | 64 towardUpO null
12 | 105 towardUpO null | 85 trafficRe null | 65 attemptCl null | 68 attemptCl null | 71 attemptCl null | 74 attemptCl null
...
as of FSM v0.2.8 a new injectDelayBefore
as well as a watchdog timeout
features have been added to help with the debugging process. Note that watchdog timeout
is not really meant for mainstream use in CSM just yet, more to help development when CSMs don't terminate by themselves.
Just copying a recent example here for the record:
using Caesar, RoME
fg = generateCanonicalFG_Hexagonal(graphinit=false)
getSolverParams(fg).treeinit = true
getSolverParams(fg).graphinit = false
getSolverParams(fg).limititers = 100
getSolverParams(fg).drawtree = true
getSolverParams(fg).showtree = true
getSolverParams(fg).dbg = true
getSolverParams(fg).async = true
# limitcliqs = [:x0=>8;:x4=>12;:l1=>21;:x1=>21;:x5=>50;:x3=>50] # breaks
# limitcliqs = [:x0=>8;:x4=>13;:l1=>21;:x1=>21;:x5=>60;:x3=>60] # 50 # doesnt break, blocks
# injectDelayBefore=[2=>(canCliqMargRecycle_StateMachine=>5); ] # step 8
# injectDelayBefore=[5=>(canCliqMargRecycle_StateMachine=>5); ]
# injectDelayBefore=[6=>(towardUpOrDwnSolve_StateMachine=>10); ]
# injectDelayBefore = nothing
mkpath(getLogPath(fg))
verbosefid = open(joinLogPath(fg, "csmVerbose.log"),"w")
# verbosefid = stdout
tree, smt, hists = solveTree!(fg, recordcliqs=ls(fg), verbose=true, verbosefid=verbosefid, timeout=50 ) #, timeout=40 , injectDelayBefore=injectDelayBefore ) #, limititercliqs=limitcliqs);
flush(verbosefid)
close(verbosefid)
# for .async = true (because .drawTree=true)
smt[7] |> x->schedule(x, InterruptException(), error=true)
open(joinLogPath(fg, "csmLogicalReconstructMax.log"),"w") do io
IIF.reconstructCSMHistoryLogical(getLogPath(fg), fid=io)
end
# async case
fetchCliqHistoryAll!(smt, hists)
open(joinLogPath(fg, "csmSequ.log"),"w") do fid
printCliqHistorySequential(hists, nothing, fid)
end
open(joinLogPath(fg, "csmLogi.log"),"w") do fid
printCSMHistoryLogical(hists, fid)
end
# printCliqHistorySequential(hists)
# printCliqHistorySequential(hists, 1=>10)
# printCliqHistorySequential(hists, [1,4,6]=>11:15)
# printCliqHistorySequential(hists, [1=>9:16; 2=>20:34; 4=>29:34])
# printCliqHistorySequential(hists, [5=>12:21;6=>12:21])
printCSMHistoryLogical(hists)
# also see dbg logs at this path for more info
# @show getLogPath(fg)
using Images
csmAnimateSideBySide(tree, hists, encode=true, nvenc=true, show=true)
Enable clique history recording and provide smtasks, then retrieve with fetchCliqHistoryAll!
even if an exception is thrown.
smtasks = Task[]
solveTree!(fg; smtasks, verbose=true, timeout=10, recordcliqs=ls(fg));
hists = fetchCliqHistoryAll!(smtasks);
We can automate it further in the future.
Will the internal fetchCliqHistoryAll!
work in general or should we definitely call it after a solve?
See it is commented out:
https://github.com/JuliaRobotics/IncrementalInference.jl/blob/8bede364027b62eb1e3dba1a5b2285de583c6050/src/SolverAPI.jl#L533
Will the internal fetchCliqHistoryAll! work in general or should we definitely call it after a solve?
For now, it has to be called after the solve with smtasks
passed in by reference. I was thinking to still return it if dbg is on.
So it would not throw an error but @error
and return the normal tree, smtasks, hists
I don't know about the commented out call.
Everything logged with logCSM
as debug level can be displayed on the console by enabling debug printing on the clique, for example, enable CSM 2 with ENV["JULIA_DEBUG"] = :csm_2
Another usefull workflow for debugging CSM freezes is:
getSolverParams(fg).async = true
smtasks = Task[]
solveTree!(fg; smtasks)
Then if it hangs call:
IIF.throwIntExcToAllTasks(smtasks)
getSolverParams(fg).dbg=true
also saves the csmc
and statemachine
in task storage.
it is saved in:
smtasks[i].storage[:csmc]
smtasks[i].storage[:statemachine]
EDIT, some of these code snippets are out of date.
Console
solveTree!
will always generate content atgetLogPath(fg)
location.getSolverParams(fg).dbg
.recordcliqs=[:x1;:x2;...]
will save the CSM history to RAM.Animate CSM Video
Can also make concurrent video of CSM at work:
Also see
csmAnimate
's own documentation.EDIT
new
Caesar.writevideo
function directly engages withffmpeg
which is easier:https://github.com/JuliaRobotics/Caesar.jl/blob/b514d18cbb6cc3b5557f67617ed08918c4260a26/src/images/images.jl#L6-L50
Also see example of debug process in #754