aesara-devs / aesara

Aesara is a Python library for defining, optimizing, and efficiently evaluating mathematical expressions involving multi-dimensional arrays.
https://aesara.readthedocs.io
Other
1.18k stars 155 forks source link

Refactor/replace/remove `aesara.tensor.basic.get_scalar_constant_value` #96

Open brandonwillard opened 3 years ago

brandonwillard commented 3 years ago

aesara.tensor.get_scalar_constant_value is a utility function that contains unreasonably long conditional statements and introduces more unnecessary cross-module—and ultimately circular—dependencies (e.g. aesara.tensor.subtensor.Subtensor.get_constant_idx calls aesara.tensor.get_scalar_constant_value and vice-versa).

Let's fix this situation by ~moving this utility function into its own module and/or reimplementing it using some form of dispatch (e.g. single-dispatch, if possible). With the latter, the Op-specific parts of get_scalar_constant_value can be defined within each Op's own module and utilized only when relevant (i.e. when/if the Op itself is imported)~ using the existing constant folding capabilities—perhaps with some additional features (e.g. more fine-grained folding conditions).

brandonwillard commented 3 years ago

Here's some more info on the situation with get_scalar_constant_value:

The function theano.tensor.basic.get_scalar_constant_value is supposed to work like numpy.ndarray.item and return the scalar constant value from a symbolic scalar or tensor with only one entry. It only works when the graph is comprised of certain Ops with constant inputs. Unfortunately, its implementation is extremely limited and inconsistent—even for the Ops it's supposed to support.

For example, the following provides some simple cases in which the Subtensor Op isn't handled correctly:

import theano.tensor as tt

test_array = tt.as_tensor([[1], [4]])

test_val = test_array.shape[-1]  # i.e. 1
tt.get_scalar_constant_value(test_val)  # works

test_val = test_array.shape[0]  # i.e. 2
tt.get_scalar_constant_value(test_val)  # doesn't work

test_val = test_array.shape[:-1]  # i.e. [2]
tt.get_scalar_constant_value(test_val)  # doesn't work

Now, we could fix these specific cases, but the entire idea behind get_scalar_constant_value is flawed, since it's essentially just constant folding. The code inside get_scalar_constant_value walks the graphs manually and performs the same checks as theano.tensor.opt.topo_constant_folding, except the latter uses the Ops own code instead of duplicating it like get_scalar_constant_value does.

Here's a simple demonstration for the last example above:

import theano

test_val = test_array.shape[:-1]

fgraph = theano.gof.fg.FunctionGraph(theano.gof.graph.inputs([test_val]), [test_val], clone=True)
res = tt.opt.topo_constant_folding.optimize(fgraph)
>>> tt_dprint(fgraph)
TensorConstant{(1,) of 2} [id A]

The only downsides to this approach (that I can think of right now) involve the overhead for cloning and possibly some cases of overly cautious Ops that don't allow constant folding (per their Op.do_constant_folding methods). The latter can be fixed via an override, and the former could be addressed by broader improvements to the graph and optimization framework.

First off, graph cloning should be very cheap, and, if it isn't, then we need to fix that ASAP—regardless of get_scalar_constant_value. Furthermore, we could—and probably should—clone the graphs "lazily" (i.e. on-demand during traversal; see https://github.com/aesara-devs/aesara/issues/362). Something like this (or a more functional approach in general) would reduce unnecessary sub-graph cloning when the optimization stops/aborts. Speaking of which, it seems like this particular use of theano.tensor.opt.constant_folding should abort immediately when it finds a non-foldable sub-graph.

kc611 commented 2 years ago

Took a stab at this and came up with this

def get_constant_value(
    v, scalar=True, *args, **kwargs
):
    """Return the constant value underlying variable `v`.

    If `v` is the output of dimshuffles, fills, allocs, rebroadcasts,
    cast, OutputGuard, DeepCopyOp, ScalarFromTensor, ScalarOp, Elemwise
    and some pattern with Subtensor, this function digs through them.

    If `v` is not some view of constant data, then raise a
    NotConstantError. If specified `v` as scalar, this will raise
    a NonScalarConstantError

    Parameters
    ----------
    v: Variable
        Variable to be evaluated
    scalar: bool
        Specify if the returned value should be a scalar

    """

    if v is None:
        raise NotConstantError()

    v = as_tensor_variable(v)
    if not isinstance(v, Constant) and v.owner is not None:
        from aesara.graph.opt_utils import optimize_graph
        v_fgraph_clone = FunctionGraph([*graph_inputs([v])], [v], clone=True)
        optimize_graph(v_fgraph_clone)
        v = v_fgraph_clone.outputs[0]

    if not isinstance(v, Constant):
        raise NotConstantError()
    elif (scalar and v.ndim != 0):
        raise NotScalarConstantError()
    else:
        unique_value = get_unique_value(v)
        if unique_value is not None:
            v = unique_value
        else:
            v = v.data
        return v

The issue with using 'only' topo_constant_folding.optimize(fgraph) for optimizing the FunctionGraphs clone is that it will probably be unable to do constant folding on its own for cases in which topo_constant_folding optimizes the graph in conjunction with other optimizations

For instance:

from aesara.tensor.basic_opt import topo_constant_folding
from aesara.graph.fg import FunctionGraph
from aesara.tensor.type import iscalar
import aesara.tensor.basic as at
import aesara
from aesara.graph.basic import graph_inputs

def get_constant_value(v):
    v = at.as_tensor_variable(v)
    v_fgraph_clone = FunctionGraph([*graph_inputs([v])], [v], clone=True)
    topo_constant_folding.optimize(v_fgraph_clone)
    v = v_fgraph_clone.outputs[0]
    return v

b = iscalar()
a = at.stack([b, 2, 3])
aesara.dprint(get_constant_value(a[1]))
# Isn't able to convert it into TensorConstant{2} by itself
# Subtensor{int64} [id A] ''   
#  |MakeVector{dtype='int32'} [id B] ''   
#  | |<TensorType(int32, scalar)> [id C]
#  | |TensorConstant{2} [id D]
#  | |TensorConstant{3} [id E]
#  |ScalarConstant{1} [id F]

I do agree that calling the entire set of optimizations i.e. optimize_graph(v_fgraph_clone) might be unnecessary but we definitely need a certain 'set' of optimizations along with topo_constant_folding if we are to go ahead with this approach.

brandonwillard commented 2 years ago

For instance:

from aesara.tensor.basic_opt import topo_constant_folding
from aesara.graph.fg import FunctionGraph
from aesara.tensor.type import iscalar
import aesara.tensor.basic as at
import aesara
from aesara.graph.basic import graph_inputs

def get_constant_value(v):
    v = at.as_tensor_variable(v)
    v_fgraph_clone = FunctionGraph([*graph_inputs([v])], [v], clone=True)
    topo_constant_folding.optimize(v_fgraph_clone)
    v = v_fgraph_clone.outputs[0]
    return v

b = iscalar()
a = at.stack([b, 2, 3])
aesara.dprint(get_constant_value(a[1]))
# Isn't able to convert it into TensorConstant{2} by itself
# Subtensor{int64} [id A] ''   
#  |MakeVector{dtype='int32'} [id B] ''   
#  | |<TensorType(int32, scalar)> [id C]
#  | |TensorConstant{2} [id D]
#  | |TensorConstant{3} [id E]
#  |ScalarConstant{1} [id F]

I do agree that calling the entire set of optimizations i.e. optimize_graph(v_fgraph_clone) might be unnecessary but we definitely need a certain 'set' of optimizations along with topo_constant_folding if we are to go ahead with this approach.

That's a great example, and it could also be telling us that perhaps we need to extend/improve our notion (and implementation) of constant folding.

I say this because there's a weird dependency chain in this situation: the non-constant folding rewrite that performs that Subtensor simplification needs to determine whether or not the appropriate inputs are constant (e.g. the indices), which means that it could be/is dependent on something like get_constant_value, which is—in turn—dependent on constant folding (or needs to employ some form/subset of it, at least).

Instead, it might make more sense to consider such a rewrite as a type of constant folding and somehow incorporate it into that framework, then there's no drawn out dependency chain.

Aside from that consideration, I think you have the correct overall approach, but that we need to start working out a way to make these rewrites-within-rewrites more efficient. This is also a rising issue for our future Op automation and shape inference prospects (see #732).

If we could continue to use in-place updates in these situations, that would be great, but we really can't circumvent the rewrite framework like that without incurring some problems/challenges.

brandonwillard commented 2 years ago

Actually, I'm starting to wonder how much we even need get_scalar_constant_value anymore. There might be a general design principle that—if we follow sufficiently—completely obviates it.

We've updated and added a lot of missing canonicalizations recently, and most—if not all—of the steps performed by get_scalar_constant_value are canonicalizations.

If you think of it that way, using get_scalar_constant_value within a rewrite is simply and attempt to preempt those canonicalization steps, but I'm not sure that's useful at all, because it's not guaranteed that by preempting them in any given rewrite you're likely to avoid needing to run those same canonicalization steps later. In other words, if rewrites simply wait for the canonicalizations to run and reduce eligible subgraphs to constants, they can achieve the same results and avoid walking the graph and folding constants that will need to be walked and folded anyway.

The only instances where that's not true is when a rewrite uses get_scalar_constant_value on a subgraph before the canonicalizations are applied to it and also happens to entirely remove said subgraph.

kc611 commented 2 years ago

We can just keep the above get_constant_value as a helper function while removing all it's occurrences in graph optimizers.

brandonwillard commented 2 years ago

We can just keep the above get_constant_value as a helper function while removing all it's occurrences in graph optimizers.

Yeah, that's likely the best course of action.

kc611 commented 2 years ago

@brandonwillard Can you describe the cache based memoization approach you proposed in the meet for this particular functionality.

I remember it being something like to be able to call certain constant folding optimizations at proper 'checkpoint's do=uring graph optimizations.

brandonwillard commented 2 years ago

@brandonwillard Can you describe the cache based memoization approach you proposed in the meet for this particular functionality.

I remember it being something like to be able to call certain constant folding optimizations at proper 'checkpoint's do=uring graph optimizations.

I was really just thinking of using something like functools.lrucache. We would need to create a single entrance point to our constant folding and use that everywhere; otherwise, if we used a cache for only only constant folding that occurs outside of the rewrite passes, we would still end up re-constant folding some sub-graphs when the rewrite passes are applied and that cache isn't used.

The question is really about how we can effectively use caching in these cases given that Apply nodes are updated in-place. If we assume that constant sub-graphs are never replaced with non-constant sub-graphs, then we're fine.