JuliaRobotics / DistributedFactorGraphs.jl

Abstraction layer for spanning factor graphs over various technologies
Apache License 2.0
22 stars 2 forks source link

[BUG] subgraph functions not working for LightDFG #261

Closed dehann closed 4 years ago

dehann commented 4 years ago
julia> sfg = buildCliqSubgraph(fg, tree, :x1)
ERROR: getSubgraphAroundNode not implemented for LightDFG{SolverParams,DFGVariable,DFGFactor}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] #getSubgraphAroundNode#19(::Int64, ::typeof(getSubgraphAroundNode), ::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::DFGVariable, ::Int64, ::Bool, ::GraphsDFG{SolverParams}) at /home/dehann/.julia/dev/DistributedFactorGraphs/src/services/AbstractDFG.jl:430
 [3] (::getfield(DistributedFactorGraphs, Symbol("#kw##getSubgraphAroundNode")))(::NamedTuple{(:solvable,),Tuple{Int64}}, ::typeof(getSubgraphAroundNode), ::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::DFGVariable, ::Int64, ::Bool, ::GraphsDFG{SolverParams}) at ./none:0
 [4] #buildSubgraphFromLabels!#192(::GraphsDFG{SolverParams}, ::Int64, ::Array{Symbol,1}, ::typeof(buildSubgraphFromLabels!), ::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::Array{Symbol,1}) at /home/dehann/.julia/dev/DistributedFactorGraphs/src/needsahome.jl:34
 [5] (::getfield(DistributedFactorGraphs, Symbol("#kw##buildSubgraphFromLabels!")))(::NamedTuple{(:subfg, :solvable, :allowedFactors),Tuple{GraphsDFG{SolverParams},Int64,Array{Symbol,1}}}, ::typeof(buildSubgraphFromLabels!), ::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::Array{Symbol,1}) at ./none:0
 [6] buildCliqSubgraph(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::BayesTree, ::IncrementalInference.TreeClique, ::GraphsDFG{SolverParams}) at /home/dehann/.julia/dev/IncrementalInference/src/CliqStateMachineUtils.jl:1045
 [7] buildCliqSubgraph(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::BayesTree, ::Symbol, ::GraphsDFG{SolverParams}) at /home/dehann/.julia/dev/IncrementalInference/src/CliqStateMachineUtils.jl:1064 (repeats 2 times)
 [8] top-level scope at REPL[7]:1
dehann commented 4 years ago

this one is missed, https://github.com/JuliaRobotics/DistributedFactorGraphs.jl/blob/bea526695ad7f55ab695fd80b34dd74e3bcd60d6/src/LightDFG/services/LightDFG.jl#L394

but instead the compiler is finding a general definition here: https://github.com/JuliaRobotics/DistributedFactorGraphs.jl/blob/bea526695ad7f55ab695fd80b34dd74e3bcd60d6/src/services/AbstractDFG.jl#L429

dehann commented 4 years ago

Still works on GraphsDFG.

dehann commented 4 years ago

Unfortunately not yet:

julia> sfg = buildCliqSubgraph(fg, tree, :x1)
ERROR: MethodError: no method matching _copyIntoGraph!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::GraphsDFG{SolverParams}, ::Array{Int64,1}, ::Bool)
Closest candidates are:
  _copyIntoGraph!(::LightDFG{#s54,V<:AbstractDFGVariable,F<:AbstractDFGFactor} where #s54<:AbstractParams, ::LightDFG{#s53,V<:AbstractDFGVariable,F<:AbstractDFGFactor} where #s53<:AbstractParams, ::Array{Int64,1}, ::Bool) where {V<:AbstractDFGVariable, F<:AbstractDFGFactor} at /home/dehann/.julia/dev/DistributedFactorGraphs/src/LightDFG/services/LightDFG.jl:353
  _copyIntoGraph!(::LightDFG{#s56,V<:AbstractDFGVariable,F<:AbstractDFGFactor} where #s56<:AbstractParams, ::LightDFG{#s55,V<:AbstractDFGVariable,F<:AbstractDFGFactor} where #s55<:AbstractParams, ::Array{Int64,1}) where {V<:AbstractDFGVariable, F<:AbstractDFGFactor} at /home/dehann/.julia/dev/DistributedFactorGraphs/src/LightDFG/services/LightDFG.jl:353
dehann commented 4 years ago

Think the issue is with having both LightDFG and GraphDFG...

dehann commented 4 years ago

Right, so now the issue becomes that LightDFG wants to index with Ints -- something we worked hard to move away from when we built GraphDFG the first time:

julia> sfg = buildCliqSubgraph(fg, tree, :x1)
ERROR: MethodError: no method matching _copyIntoGraph!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::GraphsDFG{SolverParams}, ::Array{Int64,1}, ::Bool)
Closest candidates are:
  _copyIntoGraph!(::G<:AbstractDFG, ::H<:AbstractDFG, ::Array{Symbol,1}, ::Bool; copyGraphMetadata) where {G<:AbstractDFG, H<:AbstractDFG} at /home/dehann/.julia/dev/DistributedFactorGraphs/src/services/AbstractDFG.jl:461
  _copyIntoGraph!(::G<:AbstractDFG, ::H<:AbstractDFG, ::Array{Symbol,1}) where {G<:AbstractDFG, H<:AbstractDFG} at /home/dehann/.julia/dev/DistributedFactorGraphs/src/services/AbstractDFG.jl:461

cc @Affie, could I please ask for help to migrate the _copyIntoGraph! functionality to AbstractDFG only(done in #262 ) rather than replicating the functionality several specialized functions. It's been quite time consuming to debug. The design pattern should be to use multiple dispatch as abstracted functionality just once -- if the abstract definition is slow, then we should make an effort to make the common method faster first, rather than writing a near duplicates with a few optimizations just for that one case please (as happened here).

There will always be around 5 or 6 different DFG drivers, Fuse, File, Light, Graphs, Cloud, Graff, etc. The way LightDFG is now requires the cross functions between these cases to be written separately. _copyIntoGraph! was not even imported into LightDFG...(is now) Then we would have to write LightDFG to GraphDFG (the source of this error), or LightDFG to FileDFG, LightDFG to CloudDFG, etc.

EDIT: We started to modify LightDFG in this way with #262 , hopefully we can get enough done today.

Affie commented 4 years ago

Right, so now the issue becomes that LightDFG wants to index with Ints -- something we worked hard to move away from when we built GraphDFG the first time

As far as i know GraphsDFG still uses Int as index. The only driver that doesn't was SymbolDFG.

Using only Int to describe nodes makes it possible to use LightGraphs's functions and its super fast.

Affie commented 4 years ago

LightDFG to GraphDFG (the source of this error)

  1. The source of this error is changes in IIF with building the clique subgraphs. The build subgraphs from one type to another was not implemented as the error suggests: "getSubgraphAroundNode not implemented" although it could be clearer.

  2. The CGDFG implementation is likely also broken.

  3. I completely agree that we should aim to have one generic function in as many of the cases as possible, but feel there is a place for optimizations. If code works and is covered by tests, there is no reason it should be difficult to maintain (LightGraphs has 99% coverage). I really don't see the need to go overwrite functions they implemented already to have it more generic.

  4. So the next step: why do you want a getSubgraphAroundNode from GraphsDFG to LightDFG? We have (had?) one from Cloud to an in-memory dfg.

Affie commented 4 years ago

Jip, cloud is also broken by the new sub graph functions

Affie commented 4 years ago

Tried to trace changes, this was where the function still worked: https://github.com/JuliaRobotics/IncrementalInference.jl/blob/4719466df52cb1ff7b1987ffd0e6c57d57f34c14/src/SubGraphFunctions.jl#L56-L68

Affie commented 4 years ago

I would rather give the functionality where it should be, something like : https://github.com/JuliaRobotics/DistributedFactorGraphs.jl/pull/263

dehann commented 4 years ago

quick comment, i changed Graphs not to use Int but Symbol directly. Okay since symbol goes through the label dict lookup in lightDFG, so Dict still the fundamental search, the int allows better memory via Array — so that should be okay.

if things are breaking with CloudDFG, just want to make sure @GearsAD is aware.

Affie commented 4 years ago

quick comment, i changed Graphs not to use Int but Symbol directly.

It does not appear to be the case, it uses a similar approach to LightDFG with a dictionary lookup.

mutable struct GraphsNode
    index::Int
    dfgNode::DFGNode
end
const FGType = Graphs.GenericIncidenceList{GraphsNode,Graphs.Edge{GraphsNode},Dict{Int,GraphsNode},Dict{Int,Array{Graphs.Edge{GraphsNode},1}}}

...

dfg.g.vertices[dfg.labelDict[label]].dfgNode
dehann commented 4 years ago

Punchline

there is no reason it should be difficult to maintain (LightGraphs has 99% coverage). I really don't see the need to go overwrite functions they implemented already to have it more generic.

That is totally fair, but maybe we are just misunderstanding each other (as usual). I'm pushing for symmetry and all individual libraries such as LightGraphs are single technology specific. Perhaps the compromise is as a team we try implement the abstracts for symmetry first (and likely wont be perfect, but thats okay). Specialized functions then follow, especially if upstream is already stable -- but dont forget there are still many downstream changes coming!

Definitely agree that we must leverage existing upstream functions from LightGraphs (or wherever). Guess my point is that specialized functions on their own can also have negative impact when downstream changes occur. My request therefore is that we (as a group) try to build abstracts first, because then we can clean up and optimization/performance effort together as well.

In hindsight, what happened here was a unexpected error when doing something for the first time. Then several related functions were found to already exist, but were not sufficiently abstracted yet. (Nothing wrong with that, actually quite normal progression). That resulted in the purposeful undoing of some optimizations to address the bigger design goals; these design goals are somewhat emergent as the project grows (or were poorly communicated, i.e. #264 comes into existence to try help). Once the dust settles, we'll put the specialized/fast code back in. Guess one hard aspect then is knowing when to start optimizing for performance -- balance is usual answer I guess.

Just FYI, my expectation is we are likely to go through more of these cycles as things develop.

I really don't see the need to go overwrite functions they implemented already to have it more generic.

I want to make sure i understand you properly, could you perhaps ping and show a case where this happened if a case arises (if you happen upon a case, just call on the handle with a comment or something). Agree we definitely don't want to "overwrite functions they implemented already". Let me labour the point some more, getting DFG designed well and implemented accordingly is hard!


General Discussion and Framing

It does not appear to be the case... As far as i know GraphsDFG still uses Int as index. The only driver that doesn't was SymbolDFG.

Apologies for mis-remembering, you are right I never changed away from index::Int lookup, but did go through a lot of effort to incorporate the Dict lookup directly into the graph structure. The intention was to use ::Symbol directly but then... Performance limits in Julia 0.4/0.5 became an issue with Dict{Symbol,...}. The Graphs.jl conversion from Array to Dict was meant to allow index::Symbol, but guess we only made that jump with GraphsDFG.

Using only Int to describe nodes makes it possible to use LightGraphs's functions and its super fast.

Jip, we did decide and resolve this as 100% okay a while back. I think that is great, and not the main issue here. The reason I mentioned the Int error is more to do with symmetry arguments relating to #264. For the eventual downstream user, I think we are building consensus that DFG should be graphs that use ::Symbol as identifier -- any internal ::Int for speed should be hidden from view. Definitely chime in if you think we need to change that plan? I think we are all still saying the same thing on that front.

Tried to trace changes, this was where the function still worked

I remember that function you added -- it actually took me around 3 days of consolidation work to get rid of it :-) It still exists here in needsahome.jl as soft removed commented code:

https://github.com/JuliaRobotics/DistributedFactorGraphs.jl/blob/master/src/needsahome.jl#L90-L132

So the next step: why do you want a getSubgraphAroundNode from GraphsDFG to LightDFG? We have (had?) one from Cloud to an in-memory dfg.

Please see full symmetry and cross-feed argument here #264, we can push and pull it out there. In short, if we build the abstracts first then they might be slow but available. We should all do an effort to build the abstracts first as a help the team effort thing. In this case we working our way up from a specialized implementation which is okay too i guess, but the problem is general maintainability which gets really hard for me since Im spending all day fixing small things in many places. The going rate seems to be around 3-4 places always need to changed in parallel and lockstep.

The source of this error is changes in IIF with building the clique subgraphs

Can guarantee there will be many more, and maybe that is the main point here. Changes are going to keep coming for quite some time still, we are in the process of getting the abstraction right.

I completely agree that we should aim to have one generic function in as many of the cases as possible, but feel there is a place for optimizations.

Perhaps the adjacent argument is that we should not pre-optimize code performance too early. Faster code is great, but comes at the cost of harder maintainability. In reality its a human machine story, so we need to consider the humans too.

The build subgraphs from one type to another was not implemented as the error suggests: "getSubgraphAroundNode not implemented" although it could be clearer.

Oh, this is all a team effort, so don't feel too much individual responsibility. I trying to push the whole group a little towards "Distributed". Definitely want to help getting it resolved. For example, see the combined effort golden thread:

The CGDFG implementation is likely also broken

hi @GearsAD :-) think Sam wants to work on this when he gets a chance but time is just tight at the moment.