JuliaRobotics / RoME.jl

Robot Motion Estimate: Tools, Variables, and Factors for SLAM in robotics; also see Caesar.jl.
MIT License
64 stars 13 forks source link

Running Rome in multiple julia instances freezes solving #418

Open lemauee opened 3 years ago

lemauee commented 3 years ago

Hi,

This is a issue very specific to experiments for my thesis/application:

As I want to perform a lot of individual solves for my thesis, I run multiple julia instances (each using multiple workers and threads) on a pretty beefy machine. This seems to be faster than running them sequentially, because parallelism can't be exploited everywhere. It apparently worked fine until the RoME v0.13 release (I think). Now the solver just "hangs up" in all but one instance (this happens at a different pose for every time I try).

This is the last thing that gets written to stdout (aka my logs ;)):

Solve Progress: approx max 1752, at iter 711     Time: 0:01:50[ Info: CSM-5 Clique 47 finished

Solve Progress: approx max 1752, at iter 716     Time: 0:01:51

I run the multiple julia instances each in their own screen session like

screen -S <sessionname>

And then fire up my my evaluation script that reads my dataset from a matfile and takes some other arguments (command gets auto-generated by my matlab frontend)

julia -t auto -p 8 --project=Masterarbeit/svn/julia -J ~/.julia/sysimage_RoME.so -- Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl --path tmp --file sigmaTrajTf-0-001-0-000-0-005_sigmaLmBearingAndRanging-0-004-0-001_wrongRatio-1-000_minDist-0-200_maxDist-2-000.mat --trajKeys tf --lmKeys bearingAndRanging --startPoseIdx 1 --endPoseIdx 150 --startPoseVal "[7.15793412168699;3.38794983135837;-1.36840501375598]" --plotSaveFinal 1 --plotSaveIter 1 --nRuns 1 --suffix variables --useMsgLikelihoods 0 --nullhypo 0.000000 --tukey 15.000000 --nKernels 100 --spreadNH 3.000000 --inflation 3.000000

I will resort to running things sequentally (as any online application would), but especcialy for tuning out parameters like spreadNH or inflation running things in parallel was very useful, as I don't have access to an infinite amount of machines ;) As multiprocess/multithread performance increased in the last releases, I suspect that something (maybe not even directly RoME-related, but in Julia general) gets in each others way, some lock does not get released.

Best, Leo

lemauee commented 3 years ago

In the (still messed up) REPL I can't get a the subfg object with csmc58sfg = smtasks[58].storage[:csmc].subfg, does it have a different fieldname or is the storage[:csmc] another dict? Could we rerun the function on this or do we need the whole csmc object?

I also thougth about quitting the REPL session again to run it again with async, drawtree and showtree off to be able to work in that repl properly ;)

Affie commented 3 years ago

Sorry, I’m going from memory. I think it’s cliqSubFg. Can you try a deepcopy and then serialize the whole csmc object. We do need it, but can try and rebuild it if it doesn’t want to work.

lemauee commented 3 years ago

Just as a note for writing the somehow protechted ARGS, this is the stuff I do now:

julia> newARGS=["--path", "tmp2", "--file", "sigmaTrajTf-0-000-0-000-0-003_sigmaLmBearingAndRanging-0-002-0-001.mat", "--trajKeys", "tf", "--lmKeys", "bearingAndRanging", "--startPoseIdx", "1", "--endPoseIdx", "150", "--startPoseVal", "[7.15793412168699;3.38794983135837;-1.36840501375598]", "--plotSaveFinal", "1", "--plotSaveIter", "1", "--nRuns", "1", "--suffix", "variables", "--useMsgLikelihoods", "0", "--nullhypo", "0.000000", "--tukey", "15.000000", "--nKernels", "100", "--spreadNH", "3.000000", "--inflation", "3.000000"]
julia> empty!(ARGS)
julia> for arg in newARGS; push!(ARGS,arg); end

A new session with drawtree/showtree/async off is running now, but I still have the old one online.

EDIT: when running the new session on the same machine with the old one still on another screen, I get this error:

ERROR: LoadError: TaskFailedException:
IOError: mkdir: permission denied (EACCES)
Stacktrace:
 [1] uv_error at ./libuv.jl:97 [inlined]
 [2] mkdir(::String; mode::UInt16) at ./file.jl:177
 [3] mkpath(::String; mode::UInt16) at ./file.jl:227
 [4] mkpath(::String; mode::UInt16) at ./file.jl:225 (repeats 4 times)
 [5] mkpath at ./file.jl:222 [inlined]
 [6] drawGraph(::LightDFG{SolverParams,DFGVariable,DFGFactor}; viewerapp::String, filepath::String, engine::String, show::Bool) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/AdditionalUtils.jl:56
 [7] _dbgCSMSaveSubFG(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::String) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:75
 [8] buildCliqSubgraph_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:122
 [9] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/leopold/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [10] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [11] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [12] (::IncrementalInference.var"#439#442"{MetaBayesTree,Bool,Bool,Base.TTY,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64,Int64})() at ./task.jl:356
Stacktrace:
[ Info: monitorCSMs: all tasks done
 [1] sync_end(::Channel{Any}) at ./task.jl:314
 [2] macro expansion at ./task.jl:333 [inlined]
 [3] taskSolveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64; oldtree::MetaBayesTree, drawtree::Bool, verbose::Bool, verbosefid::Base.TTY, limititers::Int64, limititercliqs::Array{Pair{Symbol,Int64},1}, downsolve::Bool, incremental::Bool, multithread::Bool, skipcliqids::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, delaycliqs::Array{Symbol,1}, smtasks::Array{Task,1}, algorithm::Symbol) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:49
 [4] solveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree; timeout::Int64, storeOld::Bool, verbose::Bool, verbosefid::Base.TTY, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, limititercliqs::Array{Pair{Symbol,Int64},1}, injectDelayBefore::Nothing, skipcliqids::Array{Symbol,1}, eliminationOrder::Nothing, variableOrder::Nothing, eliminationConstraints::Array{Symbol,1}, variableConstraints::Nothing, smtasks::Array{Task,1}, dotreedraw::Array{Int64,1}, runtaskmonitor::Bool, algorithm::Symbol, multithread::Bool) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:371
 [5] macro expansion at /home/leopold/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:156 [inlined]
 [6] macro expansion at ./timing.jl:233 [inlined]
 [7] top-level scope at /home/leopold/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:149
 [8] include(::String) at ./client.jl:457
 [9] top-level scope at REPL[4]:1
in expression starting at /home/leopold/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:131

Should be related to the getSolverParams(fg).dbg=true still set.

lemauee commented 3 years ago

Sorry, I’m going from memory. I think it’s cliqSubFg. Can you try a deepcopy and then serialize the whole csmc object. We do need it, but can try and rebuild it if it doesn’t want to work.

serializing .cliqSubFg works :)

lemauee commented 3 years ago

Doing the deepcopy and then serializing the csmc container did not work: csmc58dc = deepcopy(csmc58) Bildschirmfoto von 2021-02-27 12-40-34

lemauee commented 3 years ago

I will have to quit the session and run again or run on another machine then if theres no way to stop the draw task.

Affie commented 3 years ago

With the task that doesn’t want to serialize, you can try throwing an interrupt exception to it.

lemauee commented 3 years ago

Just as a note for writing the somehow protechted ARGS, this is the stuff I do now:

julia> newARGS=["--path", "tmp2", "--file", "sigmaTrajTf-0-000-0-000-0-003_sigmaLmBearingAndRanging-0-002-0-001.mat", "--trajKeys", "tf", "--lmKeys", "bearingAndRanging", "--startPoseIdx", "1", "--endPoseIdx", "150", "--startPoseVal", "[7.15793412168699;3.38794983135837;-1.36840501375598]", "--plotSaveFinal", "1", "--plotSaveIter", "1", "--nRuns", "1", "--suffix", "variables", "--useMsgLikelihoods", "0", "--nullhypo", "0.000000", "--tukey", "15.000000", "--nKernels", "100", "--spreadNH", "3.000000", "--inflation", "3.000000"]
julia> empty!(ARGS)
julia> for arg in newARGS; push!(ARGS,arg); end

A new session with drawtree/showtree/async off is running now, but I still have the old one online.

EDIT: when running the new session on the same machine with the old one still on another screen, I get this error:

ERROR: LoadError: TaskFailedException:
IOError: mkdir: permission denied (EACCES)
Stacktrace:
 [1] uv_error at ./libuv.jl:97 [inlined]
 [2] mkdir(::String; mode::UInt16) at ./file.jl:177
 [3] mkpath(::String; mode::UInt16) at ./file.jl:227
 [4] mkpath(::String; mode::UInt16) at ./file.jl:225 (repeats 4 times)
 [5] mkpath at ./file.jl:222 [inlined]
 [6] drawGraph(::LightDFG{SolverParams,DFGVariable,DFGFactor}; viewerapp::String, filepath::String, engine::String, show::Bool) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/AdditionalUtils.jl:56
 [7] _dbgCSMSaveSubFG(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::String) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:75
 [8] buildCliqSubgraph_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:122
 [9] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/leopold/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [10] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [11] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [12] (::IncrementalInference.var"#439#442"{MetaBayesTree,Bool,Bool,Base.TTY,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64,Int64})() at ./task.jl:356
Stacktrace:
[ Info: monitorCSMs: all tasks done
 [1] sync_end(::Channel{Any}) at ./task.jl:314
 [2] macro expansion at ./task.jl:333 [inlined]
 [3] taskSolveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64; oldtree::MetaBayesTree, drawtree::Bool, verbose::Bool, verbosefid::Base.TTY, limititers::Int64, limititercliqs::Array{Pair{Symbol,Int64},1}, downsolve::Bool, incremental::Bool, multithread::Bool, skipcliqids::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, delaycliqs::Array{Symbol,1}, smtasks::Array{Task,1}, algorithm::Symbol) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:49
 [4] solveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree; timeout::Int64, storeOld::Bool, verbose::Bool, verbosefid::Base.TTY, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, limititercliqs::Array{Pair{Symbol,Int64},1}, injectDelayBefore::Nothing, skipcliqids::Array{Symbol,1}, eliminationOrder::Nothing, variableOrder::Nothing, eliminationConstraints::Array{Symbol,1}, variableConstraints::Nothing, smtasks::Array{Task,1}, dotreedraw::Array{Int64,1}, runtaskmonitor::Bool, algorithm::Symbol, multithread::Bool) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:371
 [5] macro expansion at /home/leopold/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:156 [inlined]
 [6] macro expansion at ./timing.jl:233 [inlined]
 [7] top-level scope at /home/leopold/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:149
 [8] include(::String) at ./client.jl:457
 [9] top-level scope at REPL[4]:1
in expression starting at /home/leopold/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:131

Should be related to the getSolverParams(fg).dbg=true still set.

Also appears on another machine.

lemauee commented 3 years ago

With the task that doesn’t want to serialize, you can try throwing an interrupt exception to it.

If the smtasks are already done or failed, should that be necessary? Or are we talking about a different task here? How do you throw it and at what task exactly?

Affie commented 3 years ago

I’m just going by the error message of: cannot serialize a running task. Mabe the clique sub fg is enough to figure it out.

Affie commented 3 years ago

@lemauee, are you still running the solves with the timeout? From the tree, it looks like it might just be timing out waiting too long. The previous freeze was up higher, so the timeout makes a bit more sense. Lower down the timeout needs to be longer to account for all the cliques above it finishing. There should actually not be a timeout on the wait state. Here is what I would suggest in summary:

getSolverParams(fg).async = true
getSolverParams(fg).dbg=true
smtasks = Task[]
solveTree!(fg; smtasks)
lemauee commented 3 years ago

I'll try tomorrow, thanks again for reading through everything and suggesting! I still solved with the timeout.

Affie commented 3 years ago

I think i didn't even turn on getSolverParams(fg).async = true, is it on by default?

No. Not default.

lemauee commented 3 years ago

Just as a note for writing the somehow protechted ARGS, this is the stuff I do now:

julia> newARGS=["--path", "tmp2", "--file", "sigmaTrajTf-0-000-0-000-0-003_sigmaLmBearingAndRanging-0-002-0-001.mat", "--trajKeys", "tf", "--lmKeys", "bearingAndRanging", "--startPoseIdx", "1", "--endPoseIdx", "150", "--startPoseVal", "[7.15793412168699;3.38794983135837;-1.36840501375598]", "--plotSaveFinal", "1", "--plotSaveIter", "1", "--nRuns", "1", "--suffix", "variables", "--useMsgLikelihoods", "0", "--nullhypo", "0.000000", "--tukey", "15.000000", "--nKernels", "100", "--spreadNH", "3.000000", "--inflation", "3.000000"]
julia> empty!(ARGS)
julia> for arg in newARGS; push!(ARGS,arg); end

A new session with drawtree/showtree/async off is running now, but I still have the old one online. EDIT: when running the new session on the same machine with the old one still on another screen, I get this error:

ERROR: LoadError: TaskFailedException:
IOError: mkdir: permission denied (EACCES)
Stacktrace:
 [1] uv_error at ./libuv.jl:97 [inlined]
 [2] mkdir(::String; mode::UInt16) at ./file.jl:177
 [3] mkpath(::String; mode::UInt16) at ./file.jl:227
 [4] mkpath(::String; mode::UInt16) at ./file.jl:225 (repeats 4 times)
 [5] mkpath at ./file.jl:222 [inlined]
 [6] drawGraph(::LightDFG{SolverParams,DFGVariable,DFGFactor}; viewerapp::String, filepath::String, engine::String, show::Bool) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/AdditionalUtils.jl:56
 [7] _dbgCSMSaveSubFG(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::String) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:75
 [8] buildCliqSubgraph_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:122
 [9] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/leopold/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [10] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [11] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [12] (::IncrementalInference.var"#439#442"{MetaBayesTree,Bool,Bool,Base.TTY,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64,Int64})() at ./task.jl:356
Stacktrace:
[ Info: monitorCSMs: all tasks done
 [1] sync_end(::Channel{Any}) at ./task.jl:314
 [2] macro expansion at ./task.jl:333 [inlined]
 [3] taskSolveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64; oldtree::MetaBayesTree, drawtree::Bool, verbose::Bool, verbosefid::Base.TTY, limititers::Int64, limititercliqs::Array{Pair{Symbol,Int64},1}, downsolve::Bool, incremental::Bool, multithread::Bool, skipcliqids::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, delaycliqs::Array{Symbol,1}, smtasks::Array{Task,1}, algorithm::Symbol) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:49
 [4] solveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree; timeout::Int64, storeOld::Bool, verbose::Bool, verbosefid::Base.TTY, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, limititercliqs::Array{Pair{Symbol,Int64},1}, injectDelayBefore::Nothing, skipcliqids::Array{Symbol,1}, eliminationOrder::Nothing, variableOrder::Nothing, eliminationConstraints::Array{Symbol,1}, variableConstraints::Nothing, smtasks::Array{Task,1}, dotreedraw::Array{Int64,1}, runtaskmonitor::Bool, algorithm::Symbol, multithread::Bool) at /home/leopold/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:371
 [5] macro expansion at /home/leopold/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:156 [inlined]
 [6] macro expansion at ./timing.jl:233 [inlined]
 [7] top-level scope at /home/leopold/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:149
 [8] include(::String) at ./client.jl:457
 [9] top-level scope at REPL[4]:1
in expression starting at /home/leopold/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:131

Should be related to the getSolverParams(fg).dbg=true still set.

Also appears on another machine.

At least I got this solved, it originated from me using a relative path as logpath for the fg (which works fine if not debugging ;)). What lead me to the solution was the error that one can take much more from in julia 1.6 than in 1.5.3:

IOError: mkdir("/tmp2"; mode=0o777): permission denied (EACCES)
Stacktrace:
  [1] uv_error
    @ ./libuv.jl:97 [inlined]
  [2] mkdir(path::String; mode::UInt16)
    @ Base.Filesystem ./file.jl:179
  [3] mkpath(path::String; mode::UInt16)
    @ Base.Filesystem ./file.jl:230
  [4] mkpath(path::String; mode::UInt16)
    @ Base.Filesystem ./file.jl:228
  [5] mkpath
    @ ./file.jl:225 [inlined]
  [6] drawGraph(fgl::LightDFG{SolverParams, DFGVariable, DFGFactor}; viewerapp::String, filepath::String, engine::String, show::Bool)
    @ IncrementalInference ~/.julia/packages/IncrementalInference/8DImq/src/AdditionalUtils.jl:56
  [7] _dbgCSMSaveSubFG(csmc::WARNING: both ManifoldsBase and ApproxManifoldProducts export "vee"; uses of it in module IncrementalInference must be qualified
WARNING: both ManifoldsBase and ApproxManifoldProducts export "vee!"; uses of it in module IncrementalInference must be qualified
CliqStateMachineContainer{BayesTreeNodeData, LightDFG{SolverParams, DFGVariable, DFGFactor}, LightDFG{SolverParams, DFGVariable, DFGFactor}, MetaBayesTree}, filename::String)
    @ IncrementalInference ~/.julia/packages/IncrementalInference/8DImq/src/CliqStateMachineUtils.jl:77
  [8] buildCliqSubgraph_StateMachine(csmc::CliqStateMachineContainer{BayesTreeNodeData, LightDFG{SolverParams, DFGVariable, DFGFactor}, LightDFG{SolverParams, DFGVariable, DFGFactor}, MetaBayesTree})
    @ IncrementalInference ~/.julia/packages/IncrementalInference/8DImq/src/CliqueStateMachine.jl:122
  [9] (::StateMachine{CliqStateMachineContainer})(userdata::CliqStateMachineContainer{BayesTreeNodeData, LightDFG{SolverParams, DFGVariable, DFGFactor}, LightDFG{SolverParams, DFGVariable, DFGFactor}, MetaBayesTree}, timeout::Nothing; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique})
    @ FunctionalStateMachine ~/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:82
 [10] initStartCliqStateMachine!(dfg::LightDFG{SolverParams, DFGVariable, DFGFactor}, tree::MetaBayesTree, cliq::IncrementalInference.TreeClique, timeout::Nothing; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol)
    @ IncrementalInference ~/.julia/packages/IncrementalInference/8DImq/src/CliqueStateMachine.jl:63
 [11] tryCliqStateMachineSolve!(dfg::LightDFG{SolverParams, DFGVariable, DFGFactor}, treel::MetaBayesTree, cliqKey::Int64, timeout::Nothing; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Vector{Symbol}, recordcliqs::Vector{Symbol}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol)
    @ IncrementalInference ~/.julia/packages/IncrementalInference/8DImq/src/SolverAPI.jl:110
 [12] (::IncrementalInference.var"#439#442"{MetaBayesTree, Bool, Bool, Base.TTY, Bool, Bool, Vector{Symbol}, Vector{Symbol}, Symbol, LightDFG{SolverParams, DFGVariable, DFGFactor}, MetaBayesTree, Nothing, ProgressMeter.ProgressUnknown, Int64, Int64})()
    @ IncrementalInference ./task.jl:406┌ Warning: printCliqHistorySummary -- No CSM history found.
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/8DImq/src/TreeDebugTools.jl:211
┌ Warning: printCliqHistorySummary -- No CSM history found.
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/8DImq/src/TreeDebugTools.jl:211

Maybe we should open an issue for the debugging not working correctly with a relative logpaht? Or is this expected to fail?

lemauee commented 3 years ago

@lemauee, are you still running the solves with the timeout? From the tree, it looks like it might just be timing out waiting too long. The previous freeze was up higher, so the timeout makes a bit more sense. Lower down the timeout needs to be longer to account for all the cliques above it finishing. There should actually not be a timeout on the wait state. Here is what I would suggest in summary:

* don't draw or show the tree.

* no timeout
getSolverParams(fg).async = true
getSolverParams(fg).dbg=true
smtasks = Task[]
solveTree!(fg; smtasks)

I know that turning on the async solve would be good to be able to stop the tasks from the REPL if they stall, but is there something I can use to know if my graph finished solving? At the moment my code just "floods" the graph with all factors and leaves not time to solve (which is to be expected when using the async solve I suppose), but I want to get the result for the solve at every pose (should be easy to see what I mean from the repo I created on Friday).

I'll just run the solve In Julia 1.5.3 and 1.6 tonight with async off and see what I get there tomorrow.

Affie commented 3 years ago

Not that I can think of now. You can wait for all the tasks, but then it will just be blocking on the wait. Let's see if the cliqueSubFg stored with dbg=true is helpful if it fails.

lemauee commented 3 years ago

I have good but also confusing news: The example ran through on both machines fine tonight. The only thing I did yesterday evening (besides updating one machine from julia 1.5.3 to 1.6 and changing the SolverParams like suggested) was also updating from IIF v0.21.2 to v0.21.3 on both of them. Has there been a change that could affect this issue in this version switch? Maybe @dehann could run the stuff from the github repo I created on friday against both versions to really see where this was coming from. I'll start other examples relevant for my thesis/getting optimal parameters to see if it continues to work with v0.21.3. Solves where done with 8 workers and 1 thread each.

dehann commented 3 years ago

Hi Leo,

Has there been a change that could affect this issue in this version switch?

IIF v0.21.3 were numerical improvements only for =multihypo. We have not changed any aspects effecting multithreading since IIF v0.16.0. There have been some minor changes in Julia 1.6, but nothing that we think should change the behaviour of Caesar / RoME usage.

So it sounds like trying to find this error would happen somewhere inside the test grid:

Observed Failures Running Fine
IIF v0.21.2 IIF v0.21.3
JL v1.5.3 JL v1.6.0-rc1
IIF.MultiThreaded IIF.SingleThreaded

Will test exercise the variations here to try find the same error you are seeing.


Additional info

xref JuliaRobotics/IncrementalInference.jl#705

dehann commented 3 years ago

but is there something I can use to know if my graph finished solving?

on the params.async=true option, the best way to check is if all tasks are finished, as you were almost doing here: https://github.com/JuliaRobotics/RoME.jl/issues/418#issuecomment-787030828

dehann commented 3 years ago

You don't have to use params.async=true, that is for a different use case. What is likely happening on Julia seriailzation (i.e. not our DFG / Caesar serialization) might be part of the problem: https://github.com/JuliaRobotics/RoME.jl/issues/418#issuecomment-787060272

It is important that during multiprocess operation, each process loads exactly the same version of code. I have seen similar serialize or deserialize errors when you activate different environments on different processes. So for example, this might fail:

cd(myproject)
using Pkg
Pkg.activate(".")

using Distributed
addprocs(10)

@everywhere using RoME

The differences between processes mean the serializations could be different and therefore throws an error.

The most reliable way to test is to disable the precompile (as Johan mentioned, if not actively debugging), and also load the multiprocess inside the script, and force loading everything:

using Distributed
using RoME
addprocs(10)
using RoME # yes do it again
@everywhere using RoME

Probably best to do something similar if you are using an environment. It is also possible that there are differences between Julia 1.5 and 1.6 regarding how environments and packages are loaded. I wonder if this is not part of the problem you are seeing.


Also, the Fontconfig warnings and errors come from Graphviz when drawing the tree visualization. This is more a Julia thing between own fonts and GraphViz using system fonts. They are annoying but not causing the solves to fail. The folks over at Gadfly have been working with Fonts quite a bit the last while and we will resolve the Fontconfig drawing warnings/errors once final decisions from plotting libraries are clear.

lemauee commented 3 years ago

Hi Leo,

Has there been a change that could affect this issue in this version switch?

IIF v0.21.3 were numerical improvements only for =multihypo. We have not changed any aspects effecting multithreading since IIF v0.16.0. There have been some minor changes in Julia 1.6, but nothing that we think should change the behaviour of Caesar / RoME usage.

So it sounds like trying to find this error would happen somewhere inside the test grid: Observed Failures Running Fine IIF v0.21.2 IIF v0.21.3 JL v1.5.3 JL v1.6.0-rc1 IIF.MultiThreaded IIF.SingleThreaded

Will test exercise the variations here to try find the same error you are seeing.

Additional info

xref JuliaRobotics/IncrementalInference.jl#705

This should be a good test grid, but IIF.MultiThreaded means two things for me: solveTree!(fg; smtasks, multithread=true) and addFactor!(... threadmodel = MultiThreaded) (which should be tested seperately in a "test matrix"). Also I did not assign the julia workers more than one thread in the running fine case (always used -t auto before).

lemauee commented 3 years ago

You don't have to use params.async=true, that is for a different use case. What is likely happening on Julia seriailzation (i.e. not our DFG / Caesar serialization) might be part of the problem: #418 (comment)

It is important that during multiprocess operation, each process loads exactly the same version of code. I have seen similar serialize or deserialize errors when you activate different environments on different processes. So for example, this might fail:

cd(myproject)
using Pkg
Pkg.activate(".")

using Distributed
addprocs(10)

@everywhere using RoME

The differences between processes mean the serializations could be different and therefore throws an error.

The most reliable way to test is to disable the precompile (as Johan mentioned, if not actively debugging), and also load the multiprocess inside the script, and force loading everything:

using Distributed
using RoME
addprocs(10)
using RoME # yes do it again
@everywhere using RoME

Probably best to do something similar if you are using an environment. It is also possible that there are differences between Julia 1.5 and 1.6 regarding how environments and packages are loaded. I wonder if this is not part of the problem you are seeing.

Also, the Fontconfig warnings and errors come from Graphviz when drawing the tree visualization. This is more a Julia thing between own fonts and GraphViz using system fonts. They are annoying but not causing the solves to fail. The folks over at Gadfly have been working with Fonts quite a bit the last while and we will resolve the Fontconfig drawing warnings/errors once final decisions from plotting libraries are clear.

Thanks for the info on the serialization stuff, I'll do loading RoME on all workers the way you proposed from now on and hope to that the Fontconfig error can be sorted out soon.

Affie commented 3 years ago

Cross-referencing this issue here for future users. Different environments with addprocs and @everywhere. https://github.com/JuliaRobotics/RoME.jl/issues/407#issuecomment-784525682