Open brandonwillard opened 3 years ago
Here's some more info on the situation with get_scalar_constant_value
:
The function theano.tensor.basic.get_scalar_constant_value
is supposed to work like numpy.ndarray.item
and return the scalar constant value from a symbolic scalar or tensor with only one entry. It only works when the graph is comprised of certain Op
s with constant inputs. Unfortunately, its implementation is extremely limited and inconsistent—even for the Op
s it's supposed to support.
For example, the following provides some simple cases in which the Subtensor
Op
isn't handled correctly:
import theano.tensor as tt
test_array = tt.as_tensor([[1], [4]])
test_val = test_array.shape[-1] # i.e. 1
tt.get_scalar_constant_value(test_val) # works
test_val = test_array.shape[0] # i.e. 2
tt.get_scalar_constant_value(test_val) # doesn't work
test_val = test_array.shape[:-1] # i.e. [2]
tt.get_scalar_constant_value(test_val) # doesn't work
Now, we could fix these specific cases, but the entire idea behind get_scalar_constant_value
is flawed, since it's essentially just constant folding. The code inside get_scalar_constant_value
walks the graphs manually and performs the same checks as theano.tensor.opt.topo_constant_folding
, except the latter uses the Op
s own code instead of duplicating it like get_scalar_constant_value
does.
Here's a simple demonstration for the last example above:
import theano
test_val = test_array.shape[:-1]
fgraph = theano.gof.fg.FunctionGraph(theano.gof.graph.inputs([test_val]), [test_val], clone=True)
res = tt.opt.topo_constant_folding.optimize(fgraph)
>>> tt_dprint(fgraph)
TensorConstant{(1,) of 2} [id A]
The only downsides to this approach (that I can think of right now) involve the overhead for cloning and possibly some cases of overly cautious Op
s that don't allow constant folding (per their Op.do_constant_folding
methods). The latter can be fixed via an override, and the former could be addressed by broader improvements to the graph and optimization framework.
First off, graph cloning should be very cheap, and, if it isn't, then we need to fix that ASAP—regardless of get_scalar_constant_value
. Furthermore, we could—and probably should—clone the graphs "lazily" (i.e. on-demand during traversal; see https://github.com/aesara-devs/aesara/issues/362). Something like this (or a more functional approach in general) would reduce unnecessary sub-graph cloning when the optimization stops/aborts. Speaking of which, it seems like this particular use of theano.tensor.opt.constant_folding
should abort immediately when it finds a non-foldable sub-graph.
Took a stab at this and came up with this
def get_constant_value(
v, scalar=True, *args, **kwargs
):
"""Return the constant value underlying variable `v`.
If `v` is the output of dimshuffles, fills, allocs, rebroadcasts,
cast, OutputGuard, DeepCopyOp, ScalarFromTensor, ScalarOp, Elemwise
and some pattern with Subtensor, this function digs through them.
If `v` is not some view of constant data, then raise a
NotConstantError. If specified `v` as scalar, this will raise
a NonScalarConstantError
Parameters
----------
v: Variable
Variable to be evaluated
scalar: bool
Specify if the returned value should be a scalar
"""
if v is None:
raise NotConstantError()
v = as_tensor_variable(v)
if not isinstance(v, Constant) and v.owner is not None:
from aesara.graph.opt_utils import optimize_graph
v_fgraph_clone = FunctionGraph([*graph_inputs([v])], [v], clone=True)
optimize_graph(v_fgraph_clone)
v = v_fgraph_clone.outputs[0]
if not isinstance(v, Constant):
raise NotConstantError()
elif (scalar and v.ndim != 0):
raise NotScalarConstantError()
else:
unique_value = get_unique_value(v)
if unique_value is not None:
v = unique_value
else:
v = v.data
return v
The issue with using 'only' topo_constant_folding.optimize(fgraph)
for optimizing the FunctionGraph
s clone is that it will probably be unable to do constant folding on its own for cases in which topo_constant_folding
optimizes the graph in conjunction with other optimizations
For instance:
from aesara.tensor.basic_opt import topo_constant_folding
from aesara.graph.fg import FunctionGraph
from aesara.tensor.type import iscalar
import aesara.tensor.basic as at
import aesara
from aesara.graph.basic import graph_inputs
def get_constant_value(v):
v = at.as_tensor_variable(v)
v_fgraph_clone = FunctionGraph([*graph_inputs([v])], [v], clone=True)
topo_constant_folding.optimize(v_fgraph_clone)
v = v_fgraph_clone.outputs[0]
return v
b = iscalar()
a = at.stack([b, 2, 3])
aesara.dprint(get_constant_value(a[1]))
# Isn't able to convert it into TensorConstant{2} by itself
# Subtensor{int64} [id A] ''
# |MakeVector{dtype='int32'} [id B] ''
# | |<TensorType(int32, scalar)> [id C]
# | |TensorConstant{2} [id D]
# | |TensorConstant{3} [id E]
# |ScalarConstant{1} [id F]
I do agree that calling the entire set of optimizations i.e. optimize_graph(v_fgraph_clone)
might be unnecessary but we definitely need a certain 'set' of optimizations along with topo_constant_folding
if we are to go ahead with this approach.
For instance:
from aesara.tensor.basic_opt import topo_constant_folding from aesara.graph.fg import FunctionGraph from aesara.tensor.type import iscalar import aesara.tensor.basic as at import aesara from aesara.graph.basic import graph_inputs def get_constant_value(v): v = at.as_tensor_variable(v) v_fgraph_clone = FunctionGraph([*graph_inputs([v])], [v], clone=True) topo_constant_folding.optimize(v_fgraph_clone) v = v_fgraph_clone.outputs[0] return v b = iscalar() a = at.stack([b, 2, 3]) aesara.dprint(get_constant_value(a[1])) # Isn't able to convert it into TensorConstant{2} by itself # Subtensor{int64} [id A] '' # |MakeVector{dtype='int32'} [id B] '' # | |<TensorType(int32, scalar)> [id C] # | |TensorConstant{2} [id D] # | |TensorConstant{3} [id E] # |ScalarConstant{1} [id F]
I do agree that calling the entire set of optimizations i.e.
optimize_graph(v_fgraph_clone)
might be unnecessary but we definitely need a certain 'set' of optimizations along withtopo_constant_folding
if we are to go ahead with this approach.
That's a great example, and it could also be telling us that perhaps we need to extend/improve our notion (and implementation) of constant folding.
I say this because there's a weird dependency chain in this situation: the non-constant folding rewrite that performs that Subtensor
simplification needs to determine whether or not the appropriate inputs are constant (e.g. the indices), which means that it could be/is dependent on something like get_constant_value
, which is—in turn—dependent on constant folding (or needs to employ some form/subset of it, at least).
Instead, it might make more sense to consider such a rewrite as a type of constant folding and somehow incorporate it into that framework, then there's no drawn out dependency chain.
Aside from that consideration, I think you have the correct overall approach, but that we need to start working out a way to make these rewrites-within-rewrites more efficient. This is also a rising issue for our future Op
automation and shape inference prospects (see #732).
If we could continue to use in-place updates in these situations, that would be great, but we really can't circumvent the rewrite framework like that without incurring some problems/challenges.
Actually, I'm starting to wonder how much we even need get_scalar_constant_value
anymore. There might be a general design principle that—if we follow sufficiently—completely obviates it.
We've updated and added a lot of missing canonicalizations recently, and most—if not all—of the steps performed by get_scalar_constant_value
are canonicalizations.
If you think of it that way, using get_scalar_constant_value
within a rewrite is simply and attempt to preempt those canonicalization steps, but I'm not sure that's useful at all, because it's not guaranteed that by preempting them in any given rewrite you're likely to avoid needing to run those same canonicalization steps later. In other words, if rewrites simply wait for the canonicalizations to run and reduce eligible subgraphs to constants, they can achieve the same results and avoid walking the graph and folding constants that will need to be walked and folded anyway.
The only instances where that's not true is when a rewrite uses get_scalar_constant_value
on a subgraph before the canonicalizations are applied to it and also happens to entirely remove said subgraph.
We can just keep the above get_constant_value
as a helper function while removing all it's occurrences in graph optimizers.
We can just keep the above
get_constant_value
as a helper function while removing all it's occurrences in graph optimizers.
Yeah, that's likely the best course of action.
@brandonwillard Can you describe the cache based memoization approach you proposed in the meet for this particular functionality.
I remember it being something like to be able to call certain constant folding optimizations at proper 'checkpoint's do=uring graph optimizations.
@brandonwillard Can you describe the cache based memoization approach you proposed in the meet for this particular functionality.
I remember it being something like to be able to call certain constant folding optimizations at proper 'checkpoint's do=uring graph optimizations.
I was really just thinking of using something like functools.lrucache
. We would need to create a single entrance point to our constant folding and use that everywhere; otherwise, if we used a cache for only only constant folding that occurs outside of the rewrite passes, we would still end up re-constant folding some sub-graphs when the rewrite passes are applied and that cache isn't used.
The question is really about how we can effectively use caching in these cases given that Apply
nodes are updated in-place. If we assume that constant sub-graphs are never replaced with non-constant sub-graphs, then we're fine.
aesara.tensor.get_scalar_constant_value
is a utility function that contains unreasonably long conditional statements and introduces more unnecessary cross-module—and ultimately circular—dependencies (e.g.aesara.tensor.subtensor.Subtensor.get_constant_idx
callsaesara.tensor.get_scalar_constant_value
and vice-versa).Let's fix this situation by ~moving this utility function into its own module and/or reimplementing it using some form of dispatch (e.g. single-dispatch, if possible). With the latter, the
Op
-specific parts ofget_scalar_constant_value
can be defined within eachOp
's own module and utilized only when relevant (i.e. when/if theOp
itself is imported)~ using the existing constant folding capabilities—perhaps with some additional features (e.g. more fine-grained folding conditions).