marcrasi commented 4 years ago

I'll post information about performance investigations here.

marcrasi commented 4 years ago

I ran Pose2SLAMG2O on the INTEL dataset from https://lucacarlone.mit.edu/datasets/.

Here's a flamegraph: https://storage.googleapis.com/swift-tensorflow-misc-files/flamegraphs/2020-05-07-SwiftFusion-Pose2G2O.svg

Discussion

It's spending a lot of time doing small matrix operations in TensorFlow. The TensorFlow overhead is large relative to the small operations, so most of the time is probably spent on TensorFlow overhead. Therefore, switching from Tensor to Array (or eventually SIMD) is going to be a huge performance win and we should do that before any other attempts at optimization.

How to make the flame graph (Linux)

Prerequisites:

linux-perf debian package (or equivalent)
clone https://github.com/brendangregg/FlameGraph somewhere

# start in SwiftFusion repo root
# assumes FlameGraph is cloned to ../FlameGraph
swift build -c release
perf record -F 499 -g -- .build/release/Pose2SLAMG2O ~/input_INTEL_g2o.g2o
perf script | ../FlameGraph/stackcollapse-perf.pl | swift-demangle | ../FlameGraph/flamegraph.pl --width 1800 > flamegraph.svg

dellaert commented 4 years ago

Am I right in concluding transpose is a lot of the CPU? Not used to reading the flamegraphs

marcrasi commented 4 years ago

Am I right in concluding transpose is a lot of the CPU? Not used to reading the flamegraphs

Yes, that's correct! And there's an easy optimization to completely eliminate that call. matmul, which is called from this method, has transposed arguments that let you replace

matmul(jacobians[pos].transposed(), r)

with

matmul(jacobians[pos], transposed: true, r)

ProfFan commented 4 years ago

Thanks Marc for the benchmarks :) I think judging from the flamegraph the TF engine seems to be doing a lot of work for the transpose call? I guess it's a elementwise copy (non-SIMD), but still curious on what exactly is happening under the call.

marcrasi commented 4 years ago

I think judging from the flamegraph the TF engine seems to be doing a lot of work for the transpose call? I guess it's a elementwise copy (non-SIMD), but still curious on what exactly is happening under the call.

The TF engine has a lot of overheads. I don't know specifically what's happening in the transpose, but in general it's something like this:

calculate the shape of the result
allocate space for the result and reference count the space
indirection: the swift function builds a "tranpose op" and sends it to TF, then TF interprets the op and executes a "kernel" that does the actual work (I think that there are actually even more layers of indirection but I don't know the exact details)
check if any errors happened while executing the op

I wouldn't be surprised if this transpose is spending less than 1% of the time doing the actual work and the rest of the time in TensorFlow overhead.

marcrasi commented 4 years ago

I switched things from Tensor to an array-backed Vector type and it got a lot faster. The time it takes to do 200 CGLS iterations on the INTEL dataset went from 90sec to 0.1sec.

I still need to clean things up and document them before it's ready for a PR, but here's a draft commit: https://github.com/borglab/SwiftFusion/commit/3e36e82d358a4e8e2f36140bd6bb88ac53deb6f5

Here's the new flamegraph: https://storage.googleapis.com/swift-tensorflow-misc-files/flamegraphs/2020-05-09-SwiftFusion-Pose2G2O.svg

dellaert commented 4 years ago

Well, that certainly has the right directionality :-)

dellaert commented 4 years ago

FYI, on my mac:

ProfFan commented 4 years ago

I just did a profile of Pose3SLAM, it seems that most of the time is used in the initialization of Tensors (Tensor.init(), 14s/16s total), the tensor ops actually only uses a fraction of the runtime.

marcrasi commented 4 years ago

I just did a profile of Pose3SLAM, it seems that most of the time is used in the initialization of Tensors (Tensor.init(), 14s/16s total), the tensor ops actually only uses a fraction of the runtime.

This makes sense, Tensor.init() does heap allocations and Tensor.init() is called a lot of times in the inner loop of linearization for Pose3/Rot3. In principle, this should be very easy to solve by replacing Tensor with Matrix3x3. (Matrix would improve things a bit over Tensor, but it still has a heap allocation for the array). I have filed https://github.com/borglab/SwiftFusion/issues/52 for this. I won't start work on https://github.com/borglab/SwiftFusion/issues/52 right away, because there's another performance thing that I'm looking at (will write it up here soon), but I could do https://github.com/borglab/SwiftFusion/issues/52 pretty soon.

marcrasi commented 4 years ago

Slowness from heap allocations and dictionary lookup

I can confirm that it is important to minimize heap allocations and dictionary lookups in our inner loop. Evidence:

Here's a flamegraph for Pose2SLAM on the INTEL dataset at current master. Lots of time is spent on allocation/initialization/deallocation and on dictionary operations.
The 900x performance improvement that I mentioned in https://github.com/borglab/SwiftFusion/issues/42#issuecomment-626142882 happens when I did an experiment that removes all heap allocations and dictionary lookups from the CGLS inner loop.

Therefore, we should discuss how to structure the APIs to achieve this.

Dictionary lookup

Dictionary lookups that we have now are:

While evaluating factor errors, the factor looks up values in a dictionary.
While performing GaussianFactorGraph "matmul", the GaussianFactorGrapg looks up the vectorvalue in a dictionary.

We could eliminate this by assigning a consecutive Int to each Key that is added to a NonlinearFactorGraph, and then use that Int everywhere instead of the Key.

Users can still construct factor graphs using arbitrary Keys and they can still refer to their values using Keys. We just do the mapping to consecutive Ints under the hood.

Heap allocations

Heap allocations that currently exist are (in order of badness):

VectorValues allocates its entries on the heap because each Vector is an array. (This is the worst because it happens in the CGLS inner loop.)
Values allocates its entries on the heap because type-erase values require heap allocations.
NonlinearFactorGraph allocates its entries on the heap because type-erased factors require heap allocations.

We could eliminate these heap allocations by implementing some sort of "flattened container" that stores everything as a single array of Doubles under the hood.

e.g. VectorValues really is just an array of Doubles. Right now we're storing it as an array of array of Doubles, but we could flatten it out and store offsets. We can do similar things for Values and NonlinearFactorGraph.

This doesn't have to be exposed through the user-facing APIs at all! These "flattened containers" can have APIs that make them look like totally normal collections of normal structs. Just under the hood, they store things efficiently.

I think these are pretty important to explore soon, because they may require some changes to the APIs, and we don't want to get locked into unperformant APIs.

Thoughts?

dellaert commented 4 years ago

Methinks our data-oriented design discussion might have a large influence on this.

some relevant slides that I have not parsed in any depth :-)

marcrasi commented 4 years ago

Yeah, this is, I think, very related to data-oriented design. I haven't read any of the related work you mentioned during the meeting yet. I'll go read them now and also those slides.

dellaert commented 4 years ago

Yeah, esp. http://optlang.org/ should be insightful. We are threading two graphs here: autodiff graph and sparse problem graph.

marcrasi commented 4 years ago

optlang is very cool. I'm just a few pages into the paper, and I'm already seeing some interesting ideas that could make the SwiftFusion API very different, in a good way.

Idea: The only "fixed" thing is the generic concept of a nonlinear least squares problem. There are multiple concrete representations of the problem. Depending on your concrete problem and depending on what algorithms you want to run on the problem, different concrete representations are optimal.

For instance, should the factor graph (the vertices and edges) be materialized in memory? If so, which graph representation should we use in memory? Should the data associated with the vertices (e.g. the BetweenFactor distance) be stored inside or outside the graph datastructure? There is no right answer. It depends on the concrete problem. E.g:

Pose2SLAM from a g2o problem: Load an adjacency list representation into memory. Put the data associated with vertices in a homogeneous buffer, for fast access. Linearize, storing the resulting jacobians in a homogeneous buffer. Pass the adjacency list and jacobians to a direct LLS solver.
Optical flow problem (from the bottom of http://optlang.org/): Don't materialize the graph structure in memory! The graph structure is given by a simple function like { (vertex: PixelPosition) in [vertex + (0,1), vertex + (0,-1), vertex + (1,0), vertex + (-1,0) }.
Huge SfM problem: Partition the graph into a distributed graph datastructure that lives on multiple machines. Each machine linearizes the factors that it owns, storing the jacobians in a homogeneous buffer. Run a distributed LLS solver.

We want an API with the flexibility to choose different data representations, and then run algorithms that are compatible with those representations. e.g. The linearization algorithm does not care how the graph is represented. It can operate generically on any graph datastructure that supports a query for vertex neighbors, like an adjacency list or the "neighboring pixels" function from the optical flow problem.

This is very compatible with Swift because Swift encourages "protocol oriented programming"! (I highly recommend these talks: Protocol Oriented Programming, Embracing Algorithms). Protocol oriented orogramming is quite similar to data oriented programming. You write types representing your data, you write protocols representing the allowed operations on your data, and you write generic algorithms that can operate on any data that supports the required operations.

ProfFan commented 4 years ago

We want an API with the flexibility to choose different data representations, and then run algorithms that are compatible with those representations. e.g. The linearization algorithm does not care how the graph is represented.

I agree with that (actually that is exactly what Opt and similar "optimization DSLs" are doing: decoupling abstract representation of data with its in memory representation :)

This is very compatible with Swift because Swift encourages "protocol oriented programming"! (I highly recommend these talks: Protocol Oriented Programming, Embracing Algorithms). Protocol oriented orogramming is quite similar to data oriented programming. You write types representing your data, you write protocols representing the allowed operations on your data, and you write generic algorithms that can operate on any data that supports the required operations.

These talks are great! (BTW I also love 80x24 terminals, haha)

We could eliminate these heap allocations by implementing some sort of "flattened container" that stores everything as a single array of Doubles under the hood.

Yeah I also think we can implement a container-like thing that offers the same API but uses a raw storage backend. (like a memory pool allocator in C++).

P.S. I also did a profile on the G2O for INTEL and the result aligned with yours :-\

dellaert commented 4 years ago

Excellent, I’m loving this discussion, and he will I will watch the talks. I think we may be able to undo some of the mistakes that were made in GTSam, because we did not know better at the time.

The one thing we need to absolutely keep is the factor graph abstraction. This is strictly more general than least-squares solving. For example, we can solve a sudoku, by making the factors constraints. Or do inference with Bayes networks, I making the factors probabilities depending on discrete Variables.

Something GTsam cannot do, is mixing discrete and continuous variables. In principle, a factor graph can have both, and factors could be continuous (think least squares or robust error metric), discrete (such as a constraint, or a probability), or mixed.

Really: Factor: anything -> bool or float

Where float is least squares or a negative log(probability) (least squares is -log Gaussian)

Optimization: feasible (all bool factors are True) and sum of floats is minimized

marcrasi commented 4 years ago

The one thing we need to absolutely keep is the factor graph abstraction. This is strictly more general than least-squares solving. For example, we can solve a sudoku, by making the factors constraints. Or do inference with Bayes networks, I making the factors probabilities depending on discrete Variables.

Something GTsam cannot do, is mixing discrete and continuous variables. In principle, a factor graph can have both, and factors could be continuous (think least squares or robust error metric), discrete (such as a constraint, or a probability), or mixed.

Interesting, thanks for these examples, I will think about them.

Based on this, I'd argue that the general abstraction we should start with is a "Factor Objective" which is a function from "Variables" to "Factor Errors". Perhaps the objective also has a notion of "total error" and "feasibility".

What's not included in the "Factor Objective" abstraction:

Any notion of graph.
Any notion of building it by adding factors.
Any constraints on what types the variables are or what types the errors are.
Any constraints on what the function can do.

So "Factor Objective" is way too abstract to run any useful optimization algorithms on it. (I guess if you have a random generator for variables values, you can try a bunch of random values and see which one gives you the best result.) Therefore, we need slightly more concrete abstractions. (In Swift, making abstractions slightly more concrete is "protocol refinement").

One important refinement would be the FactorGraphObjective, which adds some additional constraints to the objective that let you run graph algorithms on it.

Concrete implementations of FactorGraphObjective could be structs like:

struct AdjacencyListFactorGraph: FactorGraphObjective {
  var factors: [FactorKey: NonlinearFactor]
  var adjacencyList: BipartiteGraphAdjacencyList<FactorKey, ValueKey>
  ...
}

struct DistributedFactorGraph: FactorGraphObjective {
  var factors: DistributedCollection<FactorKey, NonlinearFactor>
  var graph: DistributedBipartiteGraph<FactorKey, ValueKey>
  ...
}

/// This is kinda like the `NonlinearFactorGraph` that is in `master` today.
struct JustAnArrayOfNonlinearFactors: FactorGraphObjective {
  var factors: [NonlinearFactor]
}

Another important refinement is a LinearLeastSquaresObjective supporting the operations required for LLS solvers like CGLS.

Concrete graph types like AdjacencyListFactorGraph, DistributedFactorGraph can conform to LinearLeastSquaresObjective, letting you pass them directly to a CGLS solver.

The power of abstracting things this way is that you're not constrained to use a particular representation of data when you invoke an algorithm. For example, maybe there's some solver that extracts a dense linear least squares subproblem from a factor graph and then runs some LLS solver on that dense problem (dunno if there is such an algorithm, it's just an example). If the LLS solver requires AdjacencyListFactorGraph, then we have to uselessly spend some time constructing an adjacency list before we invoke the solver. If the LLS solver is generic, then we simply construct the most efficient dense representation and immediately pass to the LLS solver.

marcrasi commented 4 years ago

My favorite way of designing something like this is iteratively building up the abstract protocols and concrete structs together. Today we have some concrete structs already! I'm about to make a PR suggesting a slight abstraction over them.

dellaert commented 4 years ago

I like FactorObjective as a protocol, if not its name :-) I think "Factor" would be nice ! Similarly, why not just make the other protocol "FactorGraph". One of the requirements would be that it has an "error", which we could make float, reserving infinity for infeasible configurations.

This has the danger of being religious/subtly, but I don't love "Objective", as it subtly shifts the emphasis away from the factor graph concept, in a way that I think hurts the cause. I understand how after seeing the OPTLang talk everything will seem cool and least-squarey :-)

I think LinearLeastSquaresObjective moves us away even more, now having zero graph concept. But, truly, to me the graph is everything. The stenscils in optlang are all about a concise way of specifying a graph. The actual objective is almost irrelevant to the structure of the computation.

Two Frankisms (aka Crusty grumblings) that are relevant:

a (sparse) matrix is just a clumsy way of writing down a graph
we have to get out of the matrix gutter

Objective pushes us to matrix-land by association.

ProfFan commented 4 years ago

I'll add my two cents here: I agree with Frank that one of the reasons why Opt is fast is by getting rid of the sparse matrix representation of LS problems (though not completely, as the notion of Jacobian is still there), and replace these into implicit representations, the Jacobian can be computed on GPU ad-hoc, saving a lot of memory bandwidth and ops. I consider the factor graph scheme to be something that effectively encloses the concept of objective function (in a LS sense), but with much more potential.

Currently we are using an immediate-mode evaluation of factors (on linearization), and the graph structure can only be exploited at the sparse Jacobian level. However, with minimal amount of hint, we can effectively extract the common shared structure of sub-graphs, and build optimizations on top of that.

One of the major roadblocks for "matrix-free optimization" in the past is that it is very hard to represent free-form Jacobians in a concise manner without too much boilerplate. S4TF does exactly that, so I think we are pretty lucky here :)

ProfFan commented 4 years ago

Also, an interesting read would be https://stanford.edu/~boyd/papers/pdf/abs_ops.pdf

Boyd is one of the biggest names in optimization :-\

marcrasi commented 4 years ago

This conversation has become very interesting. Would you like to have a call some time tomorrow so that we can discuss a bit more efficiently? I'm free any time after 2pm Eastern time.

I'll also keep responding here in github because I can't wait to keep talking :)

This has the danger of being religious/subtly, but I don't love "Objective", as it subtly shifts the emphasis away from the factor graph concept, in a way that I think hurts the cause. I understand how after seeing the OPTLang talk everything will seem cool and least-squarey :-)

I agree that "objective" has some unnecessary connotations, let's remove it.

I agree that LinearLeastSquaresObjective is a terrible name. I have thought more about it and discussed with Dave Abrahams and I think I have a better one now: DecomposedAffineFunction. I will update the PR.

I think LinearLeastSquaresObjective moves us away even more, now having zero graph concept.

Yes and no. There are two things:

There are concrete structs conforming to LinearLeastSquaresObjective, and they absolutely still have the graph concept front and center in them. (You may be thinking in a C++ mindset where having a LinearLeastSquaresObjective superclass dictates things about your implementation. Swift protocols dictate nothing. The concrete structs can be whatever you want.)
There is the abstract protocol LinearLeastSquaresObjective, which does not have a graph concept in it. I argue that this is an unambiguously good thing because it means that algorithms that truly do not depend on graph structure (e.g. CGLS) are not coupled to graph structure.

I'll boldly claim that by decoupling DecomposedAffineFunction from the concept of a graph we actually make things more graph-oriented because you are freed to experiment with different graph representations of a problem without worrying about how you're going to coerce it into whatever concrete type CGLS operates on.

For example, suppose I want to run CGLS on a matrix-free graph. I can't do that today with GaussianFactorGraph. One approach would be subclassing: have a "matrix free jacobian factor" that is allowed to be inside GaussianFactorGraph. This approach has drawbacks: (1) virtual function call overhead (2) not data-oriented (3) assumes that the factors can even be stored in a single array on a single computer (4) each matrix-free jacobian instance needs to store references to the data that it is derived from, even though it's always the same data. The DecomposedAffineFunction approach is very simple and has none of the drawbacks:

struct MyProblem {
  var graphRepresentation: SomeGraphDatastructure
}

// Note that this struct _represents_ the linearized problem at a point, but constructing this
// struct doesn't actually do any linearization work. That happens later when we're actually
// evaluating the affine function.
struct MyProblemLinearized {
  var graphRepresentation: SomeGraphDatastructure
  var point: Point
}

extension MyProblemLinearized: DecomposedAffineFunction {
  typealias Input = KeyedVector
  typealias Output = KeyedVector

  func applyLinear(_ v: KeyedVector) -> KeyedVector {
    return differential(of: { someFactorFunction(graphRepresentation, $0), at: point)(v)
  }

  func applyLinearDual(_ v: KeyedVector) -> KeyedVector {
    return pullback(of: { someFactorFunction(graphRepresentation, $0) }, at: point)(v)
  }
  ...
}

let problem = MyProblem(file: "path/to/my/data.txt")
var x = KeyedVector.zero
var optimizer = CGLS()
optimizer.optimize(problem: problem, initial: &x)

This is very data-oriented. MyProblem just stores the data, and the data doesn't "know" how to do anything to itself. The extension MyProblemLinearized: DecomposedAffineFunction specifies the operations on the data.

It's also very graph oriented -- the user specifies exactly what graph datastructure best represents their problem.

Ohhhhhh, I think after I wrote this example I finally realize what you mean about differential and pullback being very well suited to factor graph optimization. They let you avoid ever materializing the jacobian!

Currently we are using an immediate-mode evaluation of factors (on linearization), and the graph structure can only be exploited at the sparse Jacobian level. However, with minimal amount of hint, we can effectively extract the common shared structure of sub-graphs, and build optimizations on top of that.

Yes, and I think that DecomposedAffineFunction protocol is a very important part of doing this. DecomposedAffineFunction encapsulates precisely what it means to linearize a factor graph, so that you can store whatever structure information you need for computing jacobians, without being constrained by anything except the small number of requirements on DecomposedAffineFunction.

Overall, I'm trying to argue for the view that we should provide a library of generic functions (optimizers, graph algorithms) and a library of graph datastructures. Then, the user with a concrete problem defines their own type representing the data in their own problem, using the most appropriate datastructures from our library, and finally invokes some of our generic algorithms on their datastructure.

/// An affine function decomposed into its linear and bias components.
protocol DecomposedAffineFunction {
  associatedtype Input: EuclideanVectorSpace
  associatedtype Output: EuclideanVectorSpace

  /// Apply the function to `x`.
  ///
  /// This is equal to `applyLinear(x) + bias`.
  ///
  /// Note: A default implementation is provided, if you specify `bias`.
  func callAsFunction(_ x: Input) -> Output

  /// The linear component of the affine function.
  func applyLinear(_ x: Input) -> Output

  /// The dual map of the linear component of the affine function.
  func applyLinearDual(_ y: Output) -> Input

  /// The bias component of the affine function.
  ///
  /// This is equal to `applyLinear(Input.zero)`.
  ///
  /// Note: A default implementation is provided, if you specify `callAsFunction`.
  var bias: Output { get }
}

marcrasi commented 4 years ago

Also, an interesting read would be https://stanford.edu/~boyd/papers/pdf/abs_ops.pdf

This is cool. I've started reading it.

I notice that the "Forward-Adjoint Oracle" is very similar to DecomposedAffineFunction:

applyLinear is the "algorithm for evaluating f"
applyLinearDual is the "algorithm for evaluating f*"
bias is doesn't exist in FAO because FAO is for linear, not affine, functions.

ProfFan commented 4 years ago

So, I finally matched the error terms between GTSAM and SwiftFusion, and here are the latest benchmark results:

I think Chordal still has some discrepancy against GTSAM, so definitely need to investigate. I am speculating on the ClosestTo method, but need to do a side-by-side with GTSAM to confirm (no time for now)

@dellaert @marcrasi

dellaert commented 4 years ago

Awesome work! My bet would be on LM being different, as we discussed. Perhaps to allow Marc and Dave to concentrate on performance, can you time the inner linear solving step and the linearization separately? Also, you should probably ensure only one thread is used in both cases.

ProfFan commented 4 years ago

@dellaert Sure thing, I'll try benching the inner loop. BTW, yes, we are using only one thread in both configurations.

marcrasi commented 4 years ago

Cool graphs! Do you have instructions for generating them so that I can quickly see how perf optimizations show up on these graphs?

marcrasi commented 4 years ago

Also I'm curious: What linear solver are the GTSAM runs using?

ProfFan commented 4 years ago

@marcrasi It's the sequential Cholesky solver (a direct solver). I pushed all results and scripts to generate these graphs to https://github.com/ProfFan/expmap-benchmark, however you will also need the prebuilt wheels of GTSAM in different configurations in https://github.com/borglab/gtsam-manylinux-build/branches (the CI artifacts)

dellaert commented 4 years ago

@marcrasi It's the sequential Cholesky solver (a direct solver).

Ah, that also explains some things! Can you do PCG with Jacobi also? How do they compare? Again it will be important to choose the exact same convergence criteria on both sides of the "aisle", to make sure we're not timing the convergence criterion but rather the solve.

marcrasi commented 4 years ago

Can you do PCG with Jacobi also?

For perfectly comparable results, you'll need to do PCG with DummyPreconditioner because I haven't gotten to implementing the Jacobi preconditioning in Swift yet. (I intend to do that after simplifying the matrix boilerplate, because the preconditioning involves some linear algebra that will be much easier when there is an easy way to manipulate matrices.)

ProfFan commented 4 years ago

@marcrasi It's the sequential Cholesky solver (a direct solver).

Ah, that also explains some things! Can you do PCG with Jacobi also? How do they compare? Again it will be important to choose the exact same convergence criteria on both sides of the "aisle", to make sure we're not timing the convergence criterion but rather the solve.

I wanted to but haven't got time, probably will do next week

ProfFan commented 4 years ago

Need some new figs:

Dummy & Jacobi in GTSAM (should get same iterations as SwFusion)
CGLS(=Dummy) in SwiftFusion (should give exactly same numbers as in GTSAM)

borglab / SwiftFusion

Main issue for performance investigations #42

Discussion

How to make the flame graph (Linux)

Slowness from heap allocations and dictionary lookup

Dictionary lookup

Heap allocations