JuliaRobotics / RoME.jl

Robot Motion Estimate: Tools, Variables, and Factors for SLAM in robotics; also see Caesar.jl.
MIT License
64 stars 13 forks source link

Running Rome in multiple julia instances freezes solving #418

Open lemauee opened 3 years ago

lemauee commented 3 years ago

Hi,

This is a issue very specific to experiments for my thesis/application:

As I want to perform a lot of individual solves for my thesis, I run multiple julia instances (each using multiple workers and threads) on a pretty beefy machine. This seems to be faster than running them sequentially, because parallelism can't be exploited everywhere. It apparently worked fine until the RoME v0.13 release (I think). Now the solver just "hangs up" in all but one instance (this happens at a different pose for every time I try).

This is the last thing that gets written to stdout (aka my logs ;)):

Solve Progress: approx max 1752, at iter 711     Time: 0:01:50[ Info: CSM-5 Clique 47 finished

Solve Progress: approx max 1752, at iter 716     Time: 0:01:51

I run the multiple julia instances each in their own screen session like

screen -S <sessionname>

And then fire up my my evaluation script that reads my dataset from a matfile and takes some other arguments (command gets auto-generated by my matlab frontend)

julia -t auto -p 8 --project=Masterarbeit/svn/julia -J ~/.julia/sysimage_RoME.so -- Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl --path tmp --file sigmaTrajTf-0-001-0-000-0-005_sigmaLmBearingAndRanging-0-004-0-001_wrongRatio-1-000_minDist-0-200_maxDist-2-000.mat --trajKeys tf --lmKeys bearingAndRanging --startPoseIdx 1 --endPoseIdx 150 --startPoseVal "[7.15793412168699;3.38794983135837;-1.36840501375598]" --plotSaveFinal 1 --plotSaveIter 1 --nRuns 1 --suffix variables --useMsgLikelihoods 0 --nullhypo 0.000000 --tukey 15.000000 --nKernels 100 --spreadNH 3.000000 --inflation 3.000000

I will resort to running things sequentally (as any online application would), but especcialy for tuning out parameters like spreadNH or inflation running things in parallel was very useful, as I don't have access to an infinite amount of machines ;) As multiprocess/multithread performance increased in the last releases, I suspect that something (maybe not even directly RoME-related, but in Julia general) gets in each others way, some lock does not get released.

Best, Leo

dehann commented 3 years ago

Hi @lemauee ,

Oops that is odd. This might be something upstream. We'd have to look a bit more at what's happening. In the mean time, perhaps we can help compute results on local servers. Do you perhaps want to send tar.gz files of graphs you want solved and I can run on compute power here. Will return the result in new tar.gz. Perhaps in the process we start getting the same error here. I'm quite surprised a multi-Julia issue might be happening. If not there then something in Distributed.jl perhaps. Either way, we'd likely have to fix locally in the short term.

Sure you already using as such, but just in case: You can generate with fg = initfg(); getSolverParams(fg).graphinit=false and then the generation of graphs should be pretty quick. Solving can then use either graphinit (might not follow exact sequence) or treeinit (our desired but still experimental) method at later solve.

Lastly, I just merged (what I expect) the fix for your issue #378. Will tag IIF v0.21.3 in a few hours.

cc, might be worth @jim-hill-r (hi!) to look out for something like this multiple Julia's issue too.

Best, Dehann

lemauee commented 3 years ago

Hi @dehann,

I don't know if sending you the tar.gz would help to reproduce the issue as expected, as I'm still doing a solve for every pose (I know that this is slow, but consistency to the online-approach we use in classical parametric fgs was important to us).

However, I'd happily send you the Julia-Project needed to run my stuff, the matfile and the commands to invoke the solve. Otherwise I could also set up the stuff through ssh on your server and then see if i can reproduce it there.

Best, Leo

lemauee commented 3 years ago

The same hangup just happened when running only one julia instance, so this might not even be multiinstance-related (but the test yesterday suggested so...). Also, it occured on one of my private and not the universitys pcs. I recently did a fresh ubuntu install there, so nothing strange in the environment should hinder things there. Maybe its also worth to give it a try without a precompiled sysimage as I once had problems with RoMEPlotting as seen in JuliaRobotics/RoMEPlotting.jl#140 as you know.

Best, Leo

lemauee commented 3 years ago

Whats also notable is that there is absolutely nothing executed (no computational load) present when the hangup occurs (no CPU load in top or something similar can be seen).

Affie commented 3 years ago

Just a sanity check question, did you update the precompiled sysimage with the latest tags?

Other than that, do you know what clique is stalling? One way to find out is to draw the tree (drawtree=true, showtree=true) and see at what clique it is waiting on. Then you can try turning on debug messages for that clique CSM, say for 2: ENV["JULIA_DEBUG"] = :csm_2 And post results here please.

lemauee commented 3 years ago

I just compiled the sysimage this morning using RoME v0.13.0 and IIF v0.21.2 . The problem is that the clique it stopped on was another one everytime I think, but still need to confirm that. Some examples even ran through that just differed by the amount of noise or even just the setting of the spreadNH parameter and still "made it differently far". I just started a run without the compiled sysimage to roule out that source of failure. To draw the tree I'll need a machine with x-server to start some pdf viewer, right?

Affie commented 3 years ago

To draw the tree I'll need a machine with x-server to start some pdf viewer, right?

You should be able to set drawtree=true and then open the tree over SSH, I haven't done it before, so just guessing.

Edit: The tree should be at the logger file location or /tmp/caesar

Affie commented 3 years ago

You could also try setting a timeout and then seeing if there is some more information on where it timed out, eg: solveTree!(fg, timeout=70)

lemauee commented 3 years ago

timeout is in seconds I suppose? Then I should give it half an hour (1800s) or something for my example to be safe I'd say ;) I will start an example with and without sysimage inclding drawtree, showtree and timeout now and report back.

Affie commented 3 years ago

Seconds, yes. It is per step for every clique individually, so should be fine lower.

lemauee commented 3 years ago

So lets make it 5 minutes, 300s to be save not to kill something prematurely

Affie commented 3 years ago

There are a few tools that we use to debug the CSM, you can also try this one: https://github.com/JuliaRobotics/IncrementalInference.jl/issues/443#issuecomment-699843926

lemauee commented 3 years ago

Just got an error when trying to run with drawtree, showtree and timeout:

ERROR: LoadError: MethodError: no method matching solveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree; multithread=true, drawtree=true, showtree=true, timeout=300)
Closest candidates are:
  solveTree!(::AbstractDFG, ::AbstractBayesTree; timeout, storeOld, verbose, verbosefid, delaycliqs, recordcliqs, limititercliqs, injectDelayBefore, skipcliqids, eliminationOrder, variableOrder, eliminationConstraints, variableConstraints, smtasks, dotreedraw, runtaskmonitor, algorithm, multithread) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:258 got unsupported keyword arguments "drawtree", "showtree"
  solveTree!(::AbstractDFG) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:258 got unsupported keyword arguments "multithread", "drawtree", "showtree", "timeout"
Stacktrace:
 [1] kwerr(::NamedTuple{(:multithread, :drawtree, :showtree, :timeout),Tuple{Bool,Bool,Bool,Int64}}, ::Function, ::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree) at ./error.jl:157
Affie commented 3 years ago

Sorry should have been clearer:

getSolverParams(fg).drawtree = true
getSolverParams(fg).showtree = true
lemauee commented 3 years ago

There are a few tools that we use to debug the CSM, you can also try this one: JuliaRobotics/IncrementalInference.jl#443 (comment)

This will be difficult if julia "hangs up" right? I dont now how to do the fetching step after the repl is frozen (i would even need to find a way to run my script in the repl with command line arguments first).

Affie commented 3 years ago

If the timeout is triggered it should stop everything and continue with the next lines.

lemauee commented 3 years ago

How hard are the performance hits of enabling this? Is there a good way to write this debugging info to file to include it into my normal workflow? Or how to trigger this best on the timeout kicking in?

Affie commented 3 years ago

I don't think you want it in normal operation. You can remove the recordcliqs=ls(fg) or just record a few if you know where the problem is.

Affie commented 3 years ago

You can add this without a performance hit: https://github.com/JuliaRobotics/IncrementalInference.jl/issues/443#issuecomment-785895964

lemauee commented 3 years ago

Opening the bt.pdf (using evince) to see the solve progress currently fails on my machine:

Der Dateityp Graphviz-DOT-Graph (text/vnd.graphviz) wird nicht unterstützt

Seems like its no pdf but a dot file.

lemauee commented 3 years ago

The dot file using the dot viewer seems to be working though :)

lemauee commented 3 years ago

Also updates fine through sshfs

Affie commented 3 years ago

You need graphviz and xdot, https://juliarobotics.org/Caesar.jl/latest/installation_environment/#Local-Dependencies

lemauee commented 3 years ago

Its clearly reproducible when running only one julia instance, I suspect using more workers makes this appear more often. Tree: Bildschirmfoto von 2021-02-25 15-46-11

cliq4_stacktrace:

InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Channel{Any}) at ./task.jl:294
 [6] macro expansion at ./task.jl:333 [inlined]
 [7] solveCliqDownFrontalProducts!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::IncrementalInference.TreeClique, ::SolverParams, ::Base.CoreLogging.SimpleLogger; MCIters::Int64) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:491
 [8] solveCliqDownFrontalProducts! at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:469 [inlined]
 [9] solveDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:654
 [10] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [11] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [12] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [13] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169

cliq4_csm.txt is empty

cliq14_stacktrace:

InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_unbuffered(::Channel{LikelihoodMessage}) at ./channels.jl:405
 [5] take!(::Channel{LikelihoodMessage}) at ./channels.jl:381
 [6] takeBeliefMessageDown!(::MetaBayesTree, ::LightGraphs.SimpleGraphs.SimpleEdge{Int64}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/TreeMessageAccessors.jl:115
 [7] waitForDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:467
 [8] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [9] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [10] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [11] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169

cliq14_csm.txt is empty

cliq29_stacktrace:

InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_unbuffered(::Channel{LikelihoodMessage}) at ./channels.jl:405
 [5] take!(::Channel{LikelihoodMessage}) at ./channels.jl:381
 [6] takeBeliefMessageDown!(::MetaBayesTree, ::LightGraphs.SimpleGraphs.SimpleEdge{Int64}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/TreeMessageAccessors.jl:115
 [7] waitForDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:467
 [8] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [9] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [10] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [11] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169

cliq29_csm.txt is empty

last lines of stdout:

f3b00c5b/etc/fonts/conf.d/80-delicious.conf", line 6: invalid attribute 'version'
Fontconfig warning: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 4: unknown element "its:rules"
Fontconfig warning: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 5: unknown element "its:translateRule"
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 5: invalid attribute 'translate'
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 5: invalid attribute 'selector'
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 6: invalid attribute 'xmlns:its'
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 6: invalid attribute 'version'
Fontconfig error: Cannot load config file from /home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/fonts.conf
┌ Error: Task 14 failed, sending error to all cliques
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:183
[ Info: All cliques should have exited
┌ Error: Task 29 failed, sending error to all cliques
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:183
[ Info: All cliques should have exited

InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Channel{Any}) at ./task.jl:294
 [6] macro expansion at ./task.jl:333 [inlined]
 [7] solveCliqDownFrontalProducts!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::IncrementalInference.TreeClique, ::SolverParams, ::Base.CoreLogging.SimpleLogger; MCIters::Int64) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:491
 [8] solveCliqDownFrontalProducts! at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:469 [inlined]
 [9] solveDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:654
 [10] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [11] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [12] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [13] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169┌ Warning: printCliqHistorySummary -- No CSM history found.
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/nmcd8/src/TreeDe
bugTools.jl:211
ERROR: LoadError: TaskFailedException:
InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Channel{Any}) at ./task.jl:294
 [6] macro expansion at ./task.jl:333 [inlined]
 [7] solveCliqDownFrontalProducts!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::IncrementalInference.TreeClique, ::SolverParams, ::Base.CoreLogging.SimpleLogger; MCIters::Int64) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:491
 [8] solveCliqDownFrontalProducts! at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:469 [inlined]
 [9] solveDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:654
 [10] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [11] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [12] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [13] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169

...and 2 more exception(s).

Stacktrace:
 [1] sync_end(::Channel{Any}) at ./task.jl:314
 [2] macro expansion at ./task.jl:333 [inlined]
 [3] taskSolveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64; oldtree::MetaBayesTree, drawtree::Bool, verbose::Bool, verbosefid::Base.TTY, limititers::Int64, limititercliqs::Array{Pair{Symbol,Int64},1}, downsolve::Bool, incremental::Bool, multithread::Bool, skipcliqids::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, delaycliqs::Array{Symbol,1}, smtasks::Array{Task,1}, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:49
 [4] solveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree; timeout::Int64, storeOld::Bool, verbose::Bool, verbosefid::Base.TTY, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, limititercliqs::Array{Pair{Symbol,Int64},1}, injectDelayBefore::Nothing, skipcliqids::Array{Symbol,1}, eliminationOrder::Nothing, variableOrder::Nothing, eliminationConstraints::Array{Symbol,1}, variableConstraints::Nothing, smtasks::Array{Task,1}, dotreedraw::Array{Int64,1}, runtaskmonitor::Bool, algorithm::Symbol, multithread::Bool) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:371
 [5] macro expansion at /home/lemau/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:163 [inlined]
 [6] macro expansion at ./timing.jl:233 [inlined]
 [7] top-level scope at /home/lemau/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:159
 [8] include(::Function, ::Module, ::String) at ./Base.jl:380
 [9] include(::Module, ::String) at ./Base.jl:368
 [10] exec_options(::Base.JLOptions) at ./client.jl:296
 [11] _start() at ./client.jl:506
in expression starting at /home/lemau/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:128
lemau@joule:~$ 

I will run it again and see if it fails at a different spot.

Affie commented 3 years ago

Can you upload the saved factor graph for the above example and I'll see if I can reproduce it.

Affie commented 3 years ago

It is clique 4 that is timing out on the down solve. We haven't seen something like this before.

lemauee commented 3 years ago

Unfortunately as I ran this as a script and not from the REPL (as I would have to look into how to supply all the command line arguments to my script in the REPL) I only have the "pose-iteration" before the failure. I can inform myself (or do you know an easy way how to do this) and do this in a later run and save it out, but I'd wait for the repetition I run now (should be finished in 1-2 hours if it breaks at the same spot) to see if it breaks at the same spot.

Affie commented 3 years ago

Another thing you could try is to turn on debug, I think it doesn't work well with multithread=true yet, so perhaps switch it off.

getSolverParams(fg).dbg=true

It's also possible that multithread=true is causing your problem.

Affie commented 3 years ago

Unfortunately as I ran this as a script and not from the REPL (as I would have to look into how to supply all the command line arguments to my script in the REPL) I only have the "pose-iteration" before the failure. I can inform myself (or do you know an easy way how to do this) and do this in a later run and save it out, but I'd wait for the repetition I run now (should be finished in 1-2 hours if it breaks at the same spot) to see if it breaks at the same spot.

Perhaps save the fg before every solve in the script (if I'm understanding you correctly?)

lemauee commented 3 years ago

The second run already got past pose 50 it reached in the first one, so my suspicion that this is not related to a particular graph seems true.

Not using my sysimage also did not help, timeout/hangup also occur there.

lemauee commented 3 years ago

Unfortunately as I ran this as a script and not from the REPL (as I would have to look into how to supply all the command line arguments to my script in the REPL) I only have the "pose-iteration" before the failure. I can inform myself (or do you know an easy way how to do this) and do this in a later run and save it out, but I'd wait for the repetition I run now (should be finished in 1-2 hours if it breaks at the same spot) to see if it breaks at the same spot.

Perhaps save the fg before every solve in the script (if I'm understanding you correctly?)

Why didnt this obvious solution occur to me ;) Saving the fg before every solve would be possible, but since the graph itself doesn't seem to be causing the problem, I'll try other things instead.

lemauee commented 3 years ago

Another thing you could try is to turn on debug, I think it doesn't work well with multithread=true yet, so perhaps switch it off.

getSolverParams(fg).dbg=true

It's also possible that multithread=true is causing your problem.

Switching multitheread off and maybe even running it on one worker are my next best ideas to track this down.

Affie commented 3 years ago

In that case perhaps getSolverParams(fg).dbg=true will work, the subgraph of every clique is saved before the solve. If it fails you can re-run the subgraph and see if it's repeatable. It will also save the last state of the CSM. I'll have to find the code to extract it (can't remember).

EDIT: With debug on, for failed task i the last state can be extracted with smtasks[i].storage[:csmc] I think you will have to do it locally or start julia in the remote to have access to the REPL for debugging and saving.

Affie commented 3 years ago

Also perhaps you can rebuild the 50 pose fg that failed and share it. We can queue a few solves and see if all of them pass.

lemauee commented 3 years ago

As the second run already got past pose 50, I also have the graph for this ;)

00050_fg.tar.gz

lemauee commented 3 years ago

In that case perhaps getSolverParams(fg).dbg=true will work, the subgraph of every clique is saved before the solve. If it fails you can re-run the subgraph and see if it's repeatable. It will also save the last state of the CSM. I'll have to find the code to extract it (can't remember).

EDIT: With debug on, for failed task i the last state can be extracted with smtasks[i].storage[:csmc] I think you will have to do it locally or start julia in the remote to have access to the REPL for debugging and saving.

Is it possible to save this to file or should I start to search how I can pass command line Arguments to a script in the REPL?

Affie commented 3 years ago

The subgraph will be stored in a file per clique. (default somewhere in /tmp/caesar)

The task storage files will have to be debugged interactively. You can try and save it with https://docs.julialang.org/en/v1/stdlib/Serialization/ but i don't know how portable it will be.

lemauee commented 3 years ago

I'll read into passing Command line arguments in the REPL then.

The 'multithread off' and 'completely on one core' examples are running now, I'll see how that goes. But we really have to look out for 'lucky shots': one solve just failed at pose 139/150 (running on 8 not 16 workers because of different hardware underneath), this easily could have finished without something going wrong out of pure luck ;) Maybe using a lot of workers makes this fail earlier?

dehann commented 3 years ago

This is likely a result of a rare bug in differential factors and msg prior selectionand therefore I suggest using instead getSolverParams(fg).useMsgLikelihoods=false for the time being. I don't think this is related to multithreading as speculated earlier, however, tests to confirm are still ongoing. See here for more details: https://github.com/JuliaRobotics/IncrementalInference.jl/issues/1136#issuecomment-786326876

this testing was done using the 00139_fg.tar.gz file


When testing with 00138_fg.tar.gz a more serious error occurred in this case .useMsgLikelihoods=true (**note this did not happen on .useMsgLikelihoods=false although singular tests not good enough to be certain this switch caused the segfault):

    nested task error: 
signal (11): Segmentation fault
in expression starting at REPL[11]:1
uv_fileno at /workspace/srcdir/libuv/src/unix/core.c:764
_fd at ./stream.jl:1124 [inlined]
show at ./stream.jl:263
unknown function (ip: 0x7fa37c971ec5)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2238 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2420
_show_default at ./show.jl:412
show_default at ./show.jl:395 [inlined]
show at ./show.jl:390
unknown function (ip: 0x7fa37c9717f5)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2238 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2420
showerror at ./errorshow.jl:272
#showerror#762 at ./errorshow.jl:88
showerror##kw at ./errorshow.jl:87
unknown function (ip: 0x7fa37c96f746)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2238 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2420
showerror at ./task.jl:25
showerror at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:57 [inlined]
#showerror#762 at ./errorshow.jl:88
showerror##kw at ./errorshow.jl:87
unknown function (ip: 0x7fa37c969b76)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2238 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2420
show_exception_stack at ./errorshow.jl:885
#show_task_exception#535 at ./task.jl:82
show_task_exception at ./task.jl:80 [inlined]
#showerror#534 at ./task.jl:76
showerror at ./task.jl:70 [inlined]
showerror at ./task.jl:70
unknown function (ip: 0x7fa37c963785)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2238 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2420
showerror at ./task.jl:49
#showerror#762 at ./errorshow.jl:88
showerror at ./errorshow.jl:87
unknown function (ip: 0x7fa37c962dac)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2238 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2420
#tryCliqStateMachineSolve!#443 at /home/dehann/.julia/dev/IncrementalInference/src/SolverAPI.jl:133
tryCliqStateMachineSolve!##kw at /home/dehann/.julia/dev/IncrementalInference/src/SolverAPI.jl:94 [inlined]
#439 at ./task.jl:406
unknown function (ip: 0x7fa384179a1c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2238 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2420
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:839
unknown function (ip: (nil))
Allocations: 836216544 (Pool: 836037182; Big: 179362); GC: 647
Segmentation fault (core dumped)

This is only the second time ever that I have seen a segfault in Juila, so just logging here for future reference. Will obviously investigate this more. Conditions where 10 processes in default topology (Julia 1.6 rc1) and 6 threads each. Ubuntu 18.04

Affie commented 3 years ago

@dehann, I'm almost certain it's not related, he is not getting an error. We are not serializing solver parameters and we need to set them manually to the same values Leo is using. I only know he is using useMsgLiklihoods=false, the rest might also have an influence here. From his script:

--useMsgLikelihoods 0 --spreadNH 3.000000 --inflation 3.000000

@lemauee could you please write down the solver parameters that were set with the factor graphs saved.

dehann commented 3 years ago

oh okay, yes please send the solver parameters used.


Related

xref https://github.com/JuliaRobotics/IncrementalInference.jl/issues/1183#issuecomment-785305730

Affie commented 3 years ago

It does not look like we are able to reproduce this issue on our side. I ran it a few times overnight and from slack I see Dehann ran it a lot too. I was on julia 1.6-rc1 IIF 0.21.3 and RoME v0.13 The best I can think of is to help figure out how you can debug it yourself.

On the command line arguments question, I don't know if ARGS is what you mean, it looks like you are already using it. Perhaps you can use a bash script?

lemauee commented 3 years ago

oh okay, yes please send the solver parameters used.

Related

xref JuliaRobotics/IncrementalInference.jl#1183 (comment)

The solver params used are actually a little different for every example (as I was about to tune out the best spreadNH for my usecase in some of the calculations that failed): I always have useMsgLikelihoods=false though and inflation=3.0 (which should be the standard value, maybe I could even add a -1 in my commandline interface to use the standard parameter for this in the script itself). spreadNH was set to 3.0, 4.0, 5.0 and 6.0 depending on the example.

lemauee commented 3 years ago

I have not updated from IIF 0.21.2 to .3 yet, but could do this of course.

For the commandline stuff, its possible to set ARGS in the repl. I made myself a printargs script that only does

print(ARGS);

and should be able to set args in the repl using the output of this script called with my cli arguments. So I should be able to use the clique debugging stuff in my script if the last changes I did do not solve my problems.

lemauee commented 3 years ago

Even the example using only one thread on one worker with solveTree!(...,multithread=false), addFactor!(...,threadModel=SingleThreaded) and no @async call failed. I will run it in the REPL now to get the debug stuff.

lemauee commented 3 years ago

https://github.com/lemauee/thesis_calculations should reproduce the error

lemauee commented 3 years ago

Hi, I now have a failed example in the repl with

You can add this without a performance hit: JuliaRobotics/IncrementalInference.jl#443 (comment)

enabled:

tasklist

What steps could I do now to debug it / send you data to look at? Serializing the smtasks and even individual smtasks[i].storage[:csmc] fails:

csmc_serialization_issue

I could only send you the smtasks[i].storage[:statemachine] files (which at least cover the state the csm stopped in, right?)

Another problem I have is that my REPL still gets spammed by a Fontconfig error (this is why I sent you screenshots instead of code as otherwise (I'm not fast enought to copy what I see in the REPL before this gets overwritten). It seems like there is still something running, but I dont have much CPU load from the julia process. The Fontconfig error has long been present, but I didn't bother to resolve it since everything seemed to work. But It might even be that this is related to the original freezing problem now. fontconfig_error

Best, Leo

Affie commented 3 years ago

Hi Leo. I’m not in front of a computer, but can try and help. The repl messages can come from drawtree, the task is probably not stopped cleanly with the async option on (its best not to use drawtree with async for now as it still needs better api to control it) you can lookin the solverapi.jl source code and try to find the draw task, but i cant remember now.

For debugging, there is a subfg object in :csmc. You can save that. The :statemachine Has the last function called. The way to debug would be to rerun the function on csmc. Perhaps turn on debugging for that clique with ENV as mentioned above.

lemauee commented 3 years ago

Hi @Affie, according to :statemachine all cliques seem to be stuck on the same function:

julia> using RoME
[ Info: Including InteractiveUtils related functions in IncrementalInference.

julia> using Serialization

julia> deserialize("statemachine58.out")
StateMachine{CliqStateMachineContainer}(IncrementalInference.waitForDown_StateMachine, 6, Tuple{Dates.DateTime,Int64,Function,CliqStateMachineContainer}[], "cliq58")

julia> deserialize("statemachine55.out")
StateMachine{CliqStateMachineContainer}(IncrementalInference.waitForDown_StateMachine, 6, Tuple{Dates.DateTime,Int64,Function,CliqStateMachineContainer}[], "cliq55")

julia> deserialize("statemachine54.out")
StateMachine{CliqStateMachineContainer}(IncrementalInference.waitForDown_StateMachine, 6, Tuple{Dates.DateTime,Int64,Function,CliqStateMachineContainer}[], "cliq54")

julia> deserialize("statemachine53.out")
StateMachine{CliqStateMachineContainer}(IncrementalInference.waitForDown_StateMachine, 6, Tuple{Dates.DateTime,Int64,Function,CliqStateMachineContainer}[], "cliq53")

julia> deserialize("statemachine52.out")
StateMachine{CliqStateMachineContainer}(IncrementalInference.waitForDown_StateMachine, 6, Tuple{Dates.DateTime,Int64,Function,CliqStateMachineContainer}[], "cliq52")

julia> deserialize("statemachine44.out")
StateMachine{CliqStateMachineContainer}(IncrementalInference.waitForDown_StateMachine, 6, Tuple{Dates.DateTime,Int64,Function,CliqStateMachineContainer}[], "cliq44")

The tree implies the same: Bildschirmfoto von 2021-02-27 11-31-59

lemauee commented 3 years ago

I think i didn't even turn on getSolverParams(fg).async = true, is it on by default?