Allow merging graphs with multiple clients in `FusionOptimizer`

ricardoV94 commented 2 years ago

Right now, Elemwise composite graphs will be split at nodes that have multiple clients. This can be avoided by allowing the fusion rewriter to replace multiple outputs. One example where this is very common is in value_and_grad functions.

import aesara
import aesara.tensor as at

x = at.vector("x")
y = at.exp(x/5)
w = y + 1
z = y * 2

f = aesara.function([x], [z, w])
aesara.dprint(f)

Elemwise{Mul}[(0, 1)] [id A] 2
 |TensorConstant{(1,) of 2.0} [id B]
 |Elemwise{Composite{exp((i0 * i1))}} [id C] 0
   |TensorConstant{(1,) of 0.2} [id D]
   |x [id E]
Elemwise{add,no_inplace} [id F] 1
 |TensorConstant{(1,) of 1.0} [id G]
 |Elemwise{Composite{exp((i0 * i1))}} [id C] 0

f = aesara.function([x], [z])
aesara.dprint(f)

Elemwise{Composite{(i0 * exp((i1 * i2)))}} [id A] 0
 |TensorConstant{(1,) of 2.0} [id B]
 |TensorConstant{(1,) of 0.2} [id C]
 |x [id D]

Discussed with @aseyboldt and @aesara-devs/aesara

See rationale and proposed changes in https://github.com/aesara-devs/aesara/issues/1237#issuecomment-1283599010

ricardoV94 commented 2 years ago

@brandonwillard Do you have any pointers for non-local rewriters that replace multiple nodes? It sounds like the functionality for multiple outputs is already present in the scalar Composite, the only thing we need is to ramp up the FusionOptimizer

brandonwillard commented 2 years ago

@brandonwillard Do you have any pointers for non-local rewriters that replace multiple nodes? It sounds like the functionality for multiple outputs is already present in the scalar Composite, the only thing we need is to ramp up the FusionOptimizer

Some of the dict-based replacement code from the non-global rewriters might come in handy; otherwise, there shouldn't be anything special about the local vs. global approaches.

brandonwillard commented 1 year ago

Elemwise{Composite{(i0 exp((i1 i2)))}} [id A] 0 |TensorConstant{(1,) of 2.0} [id B] |TensorConstant{(1,) of 0.2} [id C] |x [id D]

We should probably look into these kinds of fusions that turn scalar constants into inputs. It's not clear to me that this is any better than—say—making i0 and i1 constants and removing those two inputs.

In other words, why don't we produce the following instead?

Elemwise{Composite{(2.0 * exp((0.2 * i0)))}} [id A] 0
 |x [id B]

brandonwillard commented 1 year ago

Also, can you clarify the kinds of graphs you think we should be producing in these multi-client cases?

ricardoV94 commented 1 year ago

In other words, why don't we produce the following instead?

I was thinking the same. I can only think of wanting to reuse the same c_code across different Applys, but that's not as relevant for Composites

ricardoV94 commented 1 year ago

Clarifying what is the idea behind allowing Composites with multiple clients. The following example:

import aesara
import aesara.tensor as at

x = at.vector("x")
y = at.exp(x/5)
w = y + 1
z = y * 2

f1 = aesara.function([x], [z, w])
aesara.dprint(f1)

Right now produces this graph:

Elemwise{Mul}[(0, 1)] [id A] 2
 |TensorConstant{(1,) of 2.0} [id B]
 |Elemwise{Composite{exp((i0 * i1))}} [id C] 0
   |TensorConstant{(1,) of 0.2} [id D]
   |x [id E]
Elemwise{add,no_inplace} [id F] 1
 |TensorConstant{(1,) of 1.0} [id G]
 |Elemwise{Composite{exp((i0 * i1))}} [id C] 0

After #1242, it produces this graph:

Elemwise{Composite{(i2 * exp((i0 * i1))), (i3 + exp((i0 * i1)))}}.0 [id A] 0
 |TensorConstant{(1,) of 0.2} [id B]
 |x [id C]
 |TensorConstant{(1,) of 2.0} [id D]
 |TensorConstant{(1,) of 1.0} [id E]
Elemwise{Composite{(i2 * exp((i0 * i1))), (i3 + exp((i0 * i1)))}}.1 [id A] 0

The expected savings come from having to iterate only once over the data vector x, vs having to iterate 3 times before, once over x and twice over exp(i0 * i1).

Another example is the following graph:

x = at.dvector("x")
mu = at.dvector("mu")
logp = (- ((x - mu) **2) / 2)
grad = at.grad(logp.sum(), x)
f2 = aesara.function([mu, x], [logp, grad])
aesara.dprint(f2)

Before:

Elemwise{Composite{(i0 * sqr(i1))}}[(0, 1)] [id A] 2
 |TensorConstant{(1,) of -0.5} [id B]
 |Elemwise{sub,no_inplace} [id C] 0
   |x [id D]
   |mu [id E]
Elemwise{neg,no_inplace} [id F] 1
 |Elemwise{sub,no_inplace} [id C] 0

After:

Elemwise{Composite{(-(i0 - i1)), (i2 * sqr((i0 - i1)))}}.1 [id A] 0
 |x [id B]
 |mu [id C]
 |TensorConstant{(1,) of -0.5} [id D]
Elemwise{Composite{(-(i0 - i1)), (i2 * sqr((i0 - i1)))}}.0 [id A] 0

Again we replace 3 loops by 1.

Regarding CSE, I don't think anything is changed. The code generated by the scalar Composite seems to be clever enough and reuse repeated computations in intermediate variables. Here is the C code of the Composite scalar in the first function:

// function 1
{
npy_float64 V13_tmp1;
// i0 * i1
V13_tmp1 = V11_i * V9_i;

npy_float64 V13_tmp2;
// exp(i0 * i1)
V13_tmp2 = exp((npy_float64)V13_tmp1);

// First output
// i2 * exp(i0 * i1)
V1_i = V7_i * V13_tmp2;

// Second output
// i3 + exp(i0 * i1)
V3_i = V5_i + V13_tmp2;
}

// function 2
{
npy_float64 V11_tmp1;
// i0 - i1
V11_tmp1 = V9_i - V7_i;

// First output
// - (i0 - i1)
V1_i = -V11_tmp1;

npy_float64 V11_tmp2;
// sqr(i0 - i1)
V11_tmp2 = V11_tmp1 * V11_tmp1;

// Second output
// i2 * sqr(i0 - i1)
V3_i = V5_i * V11_tmp2;
}

You can see that it avoids recomputing the same sub-expressions in both cases. It does store more intermediate variables than needed, but hopefully the compiler will be able to get rid of those.

I think the idea of inplacing scalar constants is worth investigating, but deserves a separate issue. It was already relevant for the old Composite graphs. Edit: I see it was already separated into #1270

brandonwillard commented 1 year ago

Do you have a comparison like https://github.com/aesara-devs/aesara/pull/1242#issuecomment-1282335537 for these new CSE-related changes to your approach?

aesara-devs / aesara

Allow merging graphs with multiple clients in `FusionOptimizer` #1237