review/improve/finish optimization class design

RussTedrake commented 8 years ago

Here are my basic thoughts on it so far. There are 3 important class types:

Function
- represents y = f(x). but also, eventually [y1,...ym] = f(x1,...,xn).
- x and y are Vectors. currently just Eigen::VectorXd, but could be templated on a vector type (perhaps just support that through a derived class).
- currently uses the syntax void eval(const& x, &y), because it will scale to the multi-input/output case and potentially avoid dynamic allocation, especially when x and y are passed as more complex types.
- having only the single input- single output case implemented well is sufficient for now.
- for every (xi, yj) pair, the function should specify it's input-output relationship, which includes:
- the functional form: is it linear, polynomial, differentiable, arbitrary (there is a clear hierarchy of these types). default is to assume arbitrary.
- the sparsity pattern (e.g. the third element of yj does not depend on the second element of xi). default, we assume that everything depends on everything.
- this input-output relationship will also get used by the system class to define the output and dynamics methods. (those could be implemented explicitly as functions, but currently are not).
- note that any function [y1,y2] = f(x1,x2) ISA function y = f(x), where x is the combinedvector with x1,x2, etc. This is not just abstract; it's also likely very useful. but it throws away some of the detail in the input-output structure (if y is linear in x1, but polynomial in x2, then it's polynomial in x).
- it seems natural that a derived class of function should be able to consume a lambda expression and/or std::function so that they can be created with an inline syntax. But we don't need that yet.
Constraint
- a constraint is a function + lower bounds and upper bounds. it can be implemented directly by derived from the constraint class, or
- in the current implementation, it also adds a reference to the decision variables of a specific optimization problem. we should consider pulling that out into a wrapper class, because constraints make sense for a system for multiple optimization problems.
- many use cases will only need single input / single output. the decision variable lists from optimization problem allow quite a bit of flexibility there.
- while it's not clear if we ever need lists of Function objects, we are definitely going to need (runtime dynamic) lists of Constraint objects.
Optimization Problem
- keeps track of decision variables, constraints, and the functional form of the optimization, and provides a wrapper to dispatch to the various solvers.
- could clean things up the dynamic allocation happening in problem_type (#1655)
- the problem_type logic and the functional form logic in the Function class are doing almost the same thing. I prototyped them in two different ways, but am leaning towards the problem_type version.

hongkai-dai commented 8 years ago

Just to clarify, when you mention the input/output relationship, with vector x in R^n as the input, and vector y in R^m as output, the relationship is just a single value, not the a matrix of size m x n, which would specify the relationship between the i'th entry of x and the j'th entry of y. If we were to use the matrix form input-output relationship, then when we had two input vectors x1,x2, and concantenate them into one single vector x, we would not throw away any input-output relationship specified individually for x1 and x2.

RussTedrake commented 8 years ago

good question. when i say [y1,..,ym], I mean that each of those is a vector in an of itself. The input/output relationship functional form is a single value (e.g. linear) for the entire vector xi->xj. The sparsity pattern is a boolean matrix (or comparable representation) for xi->yj. Sound reasonable?

hongkai-dai commented 8 years ago

Why we do not specify the input/output relationship as a matrix also, as we do for the sparsity pattern? For example y = x(1)^2+x(2)+0*x(3), we can specify a 1 x 3 vector relationship, as [Quadratic Linear Constant]. Then if we are going to concantenate input vectors together (for example, y = x(1)^2+x(2)+0*x(3)+z(1), and concantenate the input vector [x;z]), the input/output relationship of each individual pair would not be thrown away.

If we were to use this matrix input/output relationship, we will need to add a method, that can extract the overall relationship between vector input and vector output. For example, if the output is linear w.r.t some input, and quadratic w.r.t the other input, then the overall relationship would be quadratic, and the constraint should be a quadratic constraint.

I bring this because sometimes we do not want to throw away the input/output relationship. One example is in SNOPT, we often have a function y=f(x1,x2), and it is nonlinear w.r.t x1, but linear w.r.t x2. If we can keep the linear relationship here, then SNOPT does not need to compute the Hessain w.r.t x2, thus save the computation.

RussTedrake commented 8 years ago

That's a great idea. I like that much better. So we will have a matrix of functional forms (e.g. Linear, Polynomial) instead of the boolean matrix + the functional form. Brilliant.

bradking commented 8 years ago

for every (xi, yj) pair, the function should specify it's input-output relationship ... So we will have a matrix of functional forms (e.g. Linear, Polynomial) instead of the boolean matrix + the functional form. Brilliant.

This makes sense. One of the functional forms can be "Zero" (other name ideas?) to indicate no dependence, thus subsuming the sparsity pattern (i.e. generalizing it).

in the current implementation, it also adds a reference to the decision variables of a specific optimization problem.

IIUC DecisionVariableView maps between the OptimizationProblem's list of decision variables and each function's list of input variables. New variables are allocated via addContinuousVariables directly from the OptimizationProblem.

we should consider pulling that out into a wrapper class, because constraints make sense for a system for multiple optimization problems.

If we are to define constraints independent of an OptimizationProblem then we will need an independent source of (symbolic) variable allocation. The constraints would then be defined in terms of symbolic variables not tied to any OptimizationProblem. Adding a constraint to a specific OptimizationProblem would also bind its symbolic variables to specific indexes for that problem instance. While solving, that specific OptimizationProblem would allocate storage for local use in storing actual values for the variables.

bradking commented 8 years ago

matrix of functional forms (e.g. Linear, Polynomial)

We may be able to store the matrix as an actual matrix whose component type is our own Form struct. If we define the multiplication operator on that component type to compose forms then we can compose the relationship matrices for h(x) = g(f(x)) just by multiplying the matrices from g and f. The resulting matrix will be exactly the input-output relationship matrix for h(x).

RussTedrake commented 8 years ago

I think Zero is a fine name. And yes, I love the idea of having Form be the ScalarType of an Eigen::Matrix.

For the constraint independent of the optimization problem, I was thinking of simply removing the decision variables from the current Constraint class into some form of class/tuple which (e.g. ConstraintWrapper) which has a list of decision variables + the constraint. OptimizationProblem only holds likes of ConstraintWrappers, but you could call addConstraint by passing in the constraint and decision variable list.

bradking commented 8 years ago

I love the idea of having Form be the ScalarType of an Eigen::Matrix.

I'll look at implementing this.

list of decision variables + the constraint

Currently a list of decision variables is represented only in terms of a specific OptimizationProblem. Variables are identified by their index within the problem, and variable lists are just lists of ranges of those indexes. If we associate variables with a constraint instead then they are identified only by their index within a constraint. In order to share variables among multiple constraints and among multiple problems we need an independent identification for them. That independent identification structure could also store other optional meta-data about the variables such as a human-readable name and (later) an indication of its physical unit.

RussTedrake commented 8 years ago

It’s true. In our matlab version, functions and constraints have some number of variables associated with them, and one can optionally set a string name for each of the variables. In C++, we could imagine this role being played by the Vector concept (like in the systems classes). This is redundant information from the optimization problem itself; where the DecisionVariable holds that information.

I’m not sure what’s best here.

For reference, the matlab version is here: https://github.com/RobotLocomotion/drake/blob/master/drake/solvers/Constraint.m

bradking commented 8 years ago

With regard to the in/out relationship matrix, the prototype in drake/core/Function.h currently distinguishes ZERO from CONSTANT and AFFINE from LINEAR. Why? Both cases are separated only by the addition of a constant, which does not affect the derivative. Are there solvers that actually take advantage of this structural distinction?

Consider linear form a*x1. If it is part of a larger expression with another variable, say a*x1 + h(x2), then for a fixed x2 the form of the function with respect to x1 is actually affine. Similarly, "zero" form 0*x1 as part of 0*x1 + h(x2) is actually constant for a fixed x2. Therefore ZERO and LINEAR forms are only valid if all other entries in the same relationship matrix row are ZERO.

Alternatively, if we interpret the functional form with respect to a given variable as having meaning only in terms containing that variable (where it participates in the derivative) then there is no difference between ZERO and CONSTANT or LINEAR and AFFINE.

RussTedrake commented 8 years ago

The short answer is yes — there are big differences. Not for the gradients, but for other forms of analysis. (For instance, many forms of stability analysis and control design work for linear systems, but not affine systems in general, because the long-term solution to a linear dynamical system takes a considerably simpler form).

bradking commented 8 years ago

Is there some established research literature on tracking functional forms through function composition, in particular for optimizer structure?

I don't think we can compute the component-wise composed forms correctly from just the original component-wise forms. For example, consider

{y1,y2} = f({x1,x2}) = {x1, x1 + x2}

with relationship matrix:

{{LINEAR, CONSTANT},
 {AFFINE, AFFINE }}

and

z = g({y1,y2}) = y1*y2 + c

with relationship matrix:

{{AFFINE, AFFINE}}

From the expanded form

z = g(f({x1,x2})) = (x1)*(x1 + x2) + c = x1^2 + x1*x2 + c

we can see that z is POLYNOMIAL in x1 and AFFINE in x2.

However, an operation like leastCommonAncestor will not produce POLYNOMIAL. We don't have enough information to distinguish the case from z_ = g_({y1,y2}) = y1 + y2 because the relationship matrices for g and g_ are the same.

hongkai-dai commented 8 years ago

In your example, I do not think the relationship matrix for z is

{{AFFINE,AFFINE}}

I think we should call it {{NONLINEAR,NONLINEAR}}, or more specifically {{BILINEAR,BILINEAR}}. AFFINE means the condition y=A*x+b. When considering the relationship matrix, I do not think we should fix the other variables and consider the relationship with only single variable changing; instead we should consider the relationship when every variable is changing.

RussTedrake commented 8 years ago

You’re right. The distinction between LINEAR and AFFINE does not make sense if you are considering each variable independently.

For polynomials, which covers all of your examples, we do have good tools which do this (handling composition, etc). One of them is spotless, which you presumably have in your drake externals. The msspoly class (https://github.com/spot-toolbox/spotless/tree/a04a87d43cff3cc231ffb6913abfdaaf8ef21222/%40msspoly ) stores the polynomial by holding the degree of each monomial, and updates it through composition, etc.

For single-input, single-output function (which is what had already implemented in the matlab version), the composition rules are also simple and the distinction between linear and affine, etc all makes sense.

But you’re right — I was too quick to assume this scaled naturally to the multi-input case. I have to think some more about it, but it seems that two potential paths would be: a) to just treat everything as a single-input, single-output (e.g. multi-inputs just get concatenated into a single input vector). I’d like to avoid this.. I think we want more structure, but it is a case we understand. b) we extend the data structure to something more like msspoly (potentially even keeping the degree instead of just saying “polynomail”, since that should be no more difficult and even more useful), but then extending it so that e.g. degree -1 encodes DIFFERENTIABLE and -2 encodes ARBITRARY, etc. But again, this means keeping track of every monomial in order to get around what is, at the core, the problem that you’re seeing with the linear vs affine example

but perhaps there is something better still.

RussTedrake commented 8 years ago

ok. thought a little more about it.

everything should still work in the multi-variate case even if we assign just a single simple label to each input-output pair (e.g. LINEAR,AFFINE,POLYNOMIAL) . the rules of composition are still clear (I think). The only limitation, I think, is conservatism. As you point out, we might end up calling functions AFFINE which actually end up being LINEAR. And this is true in general, we could have two nonlinear functions that get composed and become linear due to term-cancellation, etc. I think conservatism is ok in general; we can always make it tighter later.

The way to make it tighter for the linear/affine/polynomial case is to store information for every monomial in the polynomial, which is what's happening in msspoly. But that is also what happens in c++ when we instantiate a method with a ScalarType using the Polynomial class. So I wouldn't try to reproduce that logic again here. We can recover that structure at runtime if we need it.

Recommendation: let's stick with the simple labels per input-output pair. Linear + Linear (on the same variable) => Linear. Linear + Linear (on a different variable) => Affine, etc, etc. Zero is an important distinction vs Constant, because Linear + Zero => Linear, but Linear + Constant => Affine.

Happy to talk more.

bradking commented 8 years ago

we should consider the relationship when every variable is changing ... let's stick with the simple labels per input-output pair

The problem for both of these is that when in g(f(x)) the g function has mixed terms we are not recording in its form which terms are mixed. This means that we cannot know when combinations of separate f outputs cause its x inputs to be mixed into a higher order, and thus mis-classify it. That is the main point of the example I posted previously.

In order to help me gain some context on this, is there an example application of function composition (and function form composition) in the source tree somewhere already (if only in the matlab part)?

RussTedrake commented 8 years ago

I think everything should be well-defined, but that it will be conservative (we have to assume terms will be mixed). Here's an edited version of your example above, with my understanding:

{y1,y2} = f({x1,x2}) = {x1, x1 + x2}

with relationship matrix:

{{LINEAR, CONSTANT},
 {AFFINE, AFFINE }}

and

z = g({y1,y2}) = y1*y2 + c

with relationship matrix:

{{AFFINE, AFFINE}}

From the expanded form

z = g(f({x1,x2})) = (x1)*(x1 + x2) + c = x1^2 + x1*x2 + c

we should get

{{POLYNOMIAL, POLYNOMIAL}}

because even LINEAR(LINEAR) has to result in POLYNOMIAL. In our case, specifically, we have

AFFINE(LINEAR) -> POLYNOMIAL
AFFINE(CONSTANT) -> POLYNOMIAL

AFFINE(AFFINE) -> POLYNOMIAL
AFFINE(AFFINE) -> POLYNOMIAL

Agreed that ideally the relationship between z and x2 would be labeled as LINEAR, but we've given that up for now.

I agree that we are not writing the coupling down directly, but I think we don't need to. If y is linear is x1 (or polynomial, etc) and linear/polynomial n x2, then it can only be polynomial in their combinations. Again, we don't have examples yet for combinations of multi-input functions, but anticipate having them soon.

The important case, which still works, is e.g. y = f(x) = Ax, z = g(y) = By which results in z = g(f(x)) = BAx getting labeled as LINEAR, which it should. The single input case does not need to be as conservative as the multi-input case.

Agree?

bradking commented 8 years ago

we have to assume terms will be mixed ... The single input case does not need to be as conservative as the multi-input case.

This means that the composition operation will need to assume that every component of y participates in terms with every other component of y and we quickly lose any meaningful information from component-wise forms.

Without deep information about the function's evaluation we cannot compose relationship matrices meaningfully. Perhaps the composition operation should be part of the abstract interface:

void eval(Vec& y, Vec const& x);
void composeInOutRel(InOutRel& y, InOutRel const& x);

For polynomials we could offer a helper that implements both methods automatically based on a symbolic representation of the polynomial expression. For more complex functions the user would be required to implement the compose operation to help take advantage of problem structure, and otherwise just return ARBITRARY for everything.

Side note:

{{LINEAR, ZERO}, {AFFINE, AFFINE }}

The ZERO term was CONSTANT in my example because the x1 term is a not-necessarily-zero constant when held fixed and considering the form with respect to x2. ZERO is only valid when an entire row is ZERO because the function is actually 0 for that component.

RussTedrake commented 8 years ago

You're right. My zero should have been constant. I've updated it above.

I like having the ability to overload the composition; that could work. All of the conservative compositions we're talking about here would be the default behavior.

the most important use case for the multi-input composition is going to be when we feedback or cascade combine two dynamical systems. Take a look at the dynamics and output methods on those classes in System.h . In particular, we want to know if the feedback system is time invariant, control affine, and/or direct feedthrough from looking at those same properties on the individual systems.

RussTedrake commented 8 years ago

I believe that the feedback and cascade combinations should work just fine with our more conservative approach.

All of our other use cases so far are for the single-input single-output cases. (You can find them in drake/solvers/+drakeFunction directory). Those will continue to be the main use cases for the near future.

So I think getting the conservative version we’ve spec’d implemented is the right thing to do right now. As soon as it’s in, we can start migrating over a lot of functionality.

bradking commented 8 years ago

feedback or cascade combine two dynamical systems

Should the System concept specify that an implementation isA Function?

I believe that the feedback and cascade combinations should work just fine with our more conservative approach.

Hopefully. Getting to that point will at least clarify things further.

Meanwhile I've started to look at the Function / DifferentiableFunction and corresponding Constraint hierarchy. I think we need to consider switching from inheritance to type erasure or full templating (much like Eigen). Whether a function/constraint is differentiable is only one trait we need to represent, and in multi-component cases the trait can vary per-component.

RussTedrake commented 8 years ago

I don't think that a system isa function. But it contains functions which implement dynamics, output (and soon update, stateConstraints, zeroCrossings,...).

i do want to talk out the inheritance <-> template trade-offs for the function/constraint classes. I pushed hard on the templates/concepts for the system class hierarchy, but have more recently felt like many of the systems that I'm implementing have runtime types which can't take advantage of all that work. I think I might even back off a bit in the systems hierarchy by having a System base class, and a derived class which can template on the InputVector,StateVector, and OutputVector types (so that one has the option to use that, but is not required to).

The other frustrating thing about the highly templated designs is that we will almost certainly have runtime lists of systems (and functions/constraints). So I'm imagining something similar happening in the function/constraint classes.

RussTedrake commented 8 years ago

@mposa commented that it would be useful to have an explicit functional form for DifferentiableAlmostEverywhere between Differentiable and Arbitrary… these are functions for which there is some hope of doing nonlinear optimization on.

ggould-tri commented 8 years ago

I'm starting to look into implementing this now, and I want to clear up a few items about the consensus here before I start.

Brad said above, and Russ did not contradict, that g({y1,y2}) = y1*y2 + c is {{AFFINE, AFFINE}}. However this contradicts my understanding from in-person discussions on Wednesday and Thursday. The position there was that we cannot understand these relations as partial, holding-all-else-constant relations (this also makes sense in that a y1*y2 constraint is effectively POLYNOMIAL if we also elsewhere have a constraint on y2 - y1 -- I don't know much about solver internals but I suspect an AFFINE marking would be misleading in that case).

I believe that the only approach that will survive composition is to define our relations based on additive terms of the expression that treats multilinear expressions as DIFFERENTIABLE at best. A candidate defintion of ya AFFINE in xa might be, "AFFINE iff the expression could be written as h(x_other) + m*xa, with m a constant and x_other defined as the elements of x other than xa" Thus x1*x2 is not AFFINE.

Taking an approach based on the presence or absence of additive terms with particular variables in them may also clarify the distinctions between ZERO/CONSTANT and LINEAR/AFFINE, to the extent those are useful.

I would like to fully nail down these definitions, and even pseudocode them out, before implementing them, particularly to figure out if the traits hierarchy can or should be implemented in the type system.

bradking commented 8 years ago

@ggould-tri thanks:

g({y1,y2}) = y1*y2 + c is {{AFFINE, AFFINE}}...this contradicts my understanding

That was based on defining the form for each variable assuming all others are fixed. I think we now all agree that this is not what we need.

AFFINE iff the expression could be written as h(x_other) + m*xa

Yes, that may be better! That can likely be generalized for the other forms. We need to be able to express the definitions clearly so that function authors can report their forms correctly.

RussTedrake commented 8 years ago

I think one of the sources of confusion (besides me :), is the notion of a "control affine" system. In the literature, we call a system "control affine" if the dynamics are of the form: xdot = dynamics(t,x,u) = f1(x) + f2(x)*u. While control-affine is typically describing the dynamics, maybe a simple case is thinking about two systems where the output is "affine" -- by this definition -- in the input:

y = output1(t,x,u) = g1(x) + h1(x)*u 
y = output2(t,x,u) = g2(x) + h2(x)*u

The cascade combination of these systems gives the new output:

y2 = output2(t,x2, output1(t,x1,u)) = g2(x2) + h2(x2)*g1(x1) + h2(x2)*h1(x1)*u

should we be aiming to labeled the new output method as u -> y2 is AFFINE. If not, then we also lose the ability to track "control affine" as a property of the dynamics in a feedback/cascade systems.

sherm1 commented 8 years ago

"control affine"

That makes a lot of sense to me -- affine in u, but not x. Maybe it would reduce confusion to designate those functions as ControlAffine rather than just Affine?

ggould-tri commented 8 years ago

Please bear with me here as my control theory is all 20 years out of date.

It seems like what you are saying about cascading control affine systems is at any time step n, the output of the cascaded system remains affine (zero second partial derivative) with respect to that time-step's input un (which in turn makes finding optimal 'u' for some objective relatively straightforward).

That seems to be a statement about the additive terms that make up an output, and whether a given input variable participates in each term (and in an arbitrary or linear manner). Thus in your example we have:

y1 = output1(t,x1,u1) = [term with arbitrary function of x1] + [term with arbitrary function of x1, linear in u1]
y2 = output2(t,x2,u2) = [term with arbitrary function of x2] + [term with arbitrary function of x2, linear in u2]

Cascading with u2 = y1, this information composes nicely:

y2 = [arbitrary in x2] + [arbitrary in x2, arbitrary in x1] + [arbitrary in x2, arbitrary in x1, linear in u2]

I think that does what you want, which is to preserve the information that the partial derivative of yn with respect to un remains constant under cascading.

I think such an additive-terms-list representation of function traits could be implemented fairly straightforwardly (ie, without having to implement the entirety of algebra in C++) and would allow us to write accessors that inferred affineness, constantness, etc. as discussed above.

I do worry that trying to implement it in the C++ type system would be unreadable, though. We might be better off storing the implementation in an ordinary class and doing special-case dynamic casting when needed for any needed trait-specific accessors.

bradking commented 8 years ago

I realized that there is no one form for each input/output pair when multiple inputs are allowed. Instead the form of an output is defined w.r.t. a specific (sub)set of the inputs. For example, y=x*u is linear in {u}, linear in {x}, but polynomial in {x,u}. This naturally handles mixed terms. Many of the representations previously discussed here may be suitable as the output of a form query but not as the underlying representation itself.

With that in mind I originally started designing an approach based on @ggould-tri's additive term idea. I didn't end up using it but I'll post it here for completeness:

Each output (y) has a list of terms (to be added together).
Each term has a list of sub-terms (to be multiplied together).
Each sub-term has a list of inputs (x) together with their form in the sub-term (e.g. poly({x1,x2})*lin({u1,u2})). A sub-term may also be of form cons and have no inputs listed.
No input may appear in more than one sub-term of a single term. Instead all such sub-terms should be combined into the containing form (e.g. lin(x)*lin(x) ==> poly(x)).

This seemed to work but the data structures needed to represent it and perform operations (such as composition) quickly became non-trivial. Furthermore, asking implementers of the Function concept to build the representation for their functions would require them to re-express a lot of detail about their functions through some API that builds the representation. Instead I continued looking for another approach.

bradking commented 8 years ago

Implementers of the Function concept already provide the eval<ScalarType> method:

template <typename ScalarType>
void eval(VecIn<ScalarType> const& x, VecOut<ScalarType>& y) {
  y(0) = x(0)*x(1);
}

This expresses the function parameterized over the scalar type. We can use an approach similar to Eigen's AutoDiffScalar to record the form of the eval function's operations on the scalar type. I've implemented a FunctionalForm type for this. For example, if we want to know the form of the above 2-input function with respect to some input combinations:

Eigen::Matrix<FunctionalForm, Eigen::Dynamic, 1> in(2);
in(0) = FunctionalForm::Constant();
in(1) = FunctionalForm::Linear({"u"});
Eigen::Matrix<FunctionalForm, Eigen::Dynamic, 1> out(1);
eval(Drake::VecIn<FunctionalForm>(in), out);
std::cout << out(0) << "\n"; // prints "lin(u)"
in(0) = FunctionalForm::Linear({"x"});
eval(Drake::VecIn<FunctionalForm>(in), out);
std::cout << out(0) << "\n"; // prints "poly(x,u)"

This composes naturally through g(f(x)) and will return the overall form with respect to the input form(s) x.

We should also be able to use this to extract sparsity information:

for (std::size_t i = 0; i < num_inputs; ++i) {
   x(i) = FunctionalForm::Linear({i});
}
FunctionalForm y = f(x);
// check for use of each variable index `i` in the form of `y`

PR #2322 implements a FunctionalForm type that is sufficient to detect that Drake::AffineSystem<> is affine in u only by invoking its output method instantiated with a FunctionalForm scalar type. This primitive can then be used to implement the queries we need.

bradking commented 8 years ago

mposa commented that it would be useful to have an explicit functional form for DifferentiableAlmostEverywhere between Differentiable and Arbitrary… these are functions for which there is some hope of doing nonlinear optimization on.

In the FunctionalForm implementation the "differentiable" category is actually documented as meaning "differentiable almost everywhere". In practice I think this may be all we need, and it reduces the number of categories.

ggould-tri commented 8 years ago

The actionable parts of this issue have been extracted into other issues and PRs (notably #2322, #2346). Because there does not seem to be a single consensus actionable issue beyond those, I am closing this issue.

RobotLocomotion / drake

review/improve/finish optimization class design #1666