Closed rlouf closed 3 years ago
@brandonwillard it looks like I still have a problem with a RandomStream
node that I do not understand. The stack trace I guet when executing test_dynamic_expansion
in test_trajectory.py
seems to indicates that the problems lies comes from the dynamic integration. However the test test_dynamic_integration_divergence
does pass. The problem would appear when I share the RandomStream
between two functions. Is there a workaround?
The following simplified example does work as expected:
from typing import Callable
import aesara
import aesara.tensor as aet
from aesara.tensor.random.utils import RandomStream
rng = RandomStream(59)
def bernoulli_adder(srng: RandomStream):
def add_bernoulli(b):
def one_step(x):
a = aet.where(srng.bernoulli(0.5), 1, -1)
c = a + x
return c
res, _ = aesara.scan(one_step, outputs_info=(b,), n_steps=10)
return res
return add_bernoulli
def normal_adder(srng: RandomStream, add_bernoulli: Callable):
def add_normal(init):
def one_step(a):
res = add_bernoulli(a)
final = res[-1]
norm = srng.normal(0, 1)
return final + norm
res, _ = aesara.scan(one_step, outputs_info=(init,), n_steps=10)
return res
return add_normal
init = aet.scalar("init")
add_bernoulli = bernoulli_adder(rng)
add_normal = normal_adder(rng, add_bernoulli)
result = add_normal(init)
fn = aesara.function((init,), result)
print(fn(10))
Which makes me think that the issue may be more trivial. But at this point I need another pair of eyes to look at the code.
I don't see a test_dynamic_expansion
function.
test_multiplicative_expansion
, sorry.
The missing input variable appears to be introduced by/within new_potential_energy_grad
. If you set a name for that (e.g. new_potential_energy_grad.name = "new_potential_energy_grad"
) the error will say that the missing variable is new_potential_energy_grad[t-1]
(the [t-1]
suffix usually means that Scan
is using that variable as an initial value) .
OK, I think the issue might have to do with the outputs_info
arguments to the scan
here. It looks like quite a few arguments are duplicated—including the new_potential_energy_grad
variable in the error message, and Scan
might not be handling duplicate arguments well, or the inputs could be misspecified in some way that Scan
isn't catching.
@brandonwillard it seems like we don't have any problems with the updates inside the scan
operators. I fixed a bunch of type mismatch errors, and I am now stuck with the following error:
TypeError: ('The following error happened while compiling the node', do_whileall_inplace,cpu,scan_fn}(TensorConstant{10}, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709EB79D60>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F70BC80F3C0>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709C9FE9E0>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709C6BF9E0>)), '\n', "Inconsistency in the inner graph of scan 'scan_fn' : an input and an output are associated with the same recurrent state and should have the same type but have type 'TensorType(float64, (True,))' and 'TensorType(float64, vector)' respectively.")
As you can see in the test_trajectory.py
file I did print the inputs to expand
, and the output of expand_once
(commented out). Inputs and outputs have the same shape. I don't really know what else to try.
@brandonwillard it seems like we don't have any problems with the updates inside the
scan
operators. I fixed a bunch of type mismatch errors, and I am now stuck with the following error:TypeError: ('The following error happened while compiling the node', do_whileall_inplace,cpu,scan_fn}(TensorConstant{10}, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709EB79D60>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F70BC80F3C0>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709C9FE9E0>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709C6BF9E0>)), '\n', "Inconsistency in the inner graph of scan 'scan_fn' : an input and an output are associated with the same recurrent state and should have the same type but have type 'TensorType(float64, (True,))' and 'TensorType(float64, vector)' respectively.")
As you can see in the
test_trajectory.py
file I did print the inputs toexpand
, and the output ofexpand_once
(commented out). Inputs and outputs have the same shape. I don't really know what else to try.
OK, I'll take a look at it now; otherwise, that's an awful error message! We need to fix that in Aesara.
(The error message is fixed in https://github.com/aesara-devs/aesara/pull/537.)
A note about that kind of error:
Whenever you see an error involving a discrepancy between two symbolic variables like TensorType(float64, (True,))
and TensorType(float64, vector)
, it's often because to the former variable is an "over-specialized" instance of the latter.
For example, if we define a generic symbolic vector x = at.vector()
, it will have an x.type
of TensorType(float64, vector)
. If we create a very specific instance of a vector via at.as_tensor(np.r_[1.0])
, we'll get a y.type
of TensorType(float64, (True,))
.
Both x
and y
are vectors, but the irreconcilable difference between the two types is the broadcast patterns (i.e. the dimensions that are known to be one). In other words, x.type.broadcastable == [False]
(i.e. the vector
part in the str
output of x.type
) and y.type.broadcastable == [True]
(i.e. the (True,)
part in the str
output of y.type
).
aesara.tensor.as_tensor[_variable]
is sometimes the cause of such errors—as the example above illustrates how the discrepancy can arise from its use—but it can also be due to an Op
that does a poor job of inferring the broadcast pattern(s) of its output(s), or a number of other related things.
FYI: If we stop in the debugger at that error, we can see exactly where one of the offending variables (specifically, the latter TensorType(float64, vector)
, which is one of the Scan
's "inner-output" variables) was created:
pdb> pp self.outputs[inner_oidx].tag.trace[-1][-1]
('.../aehmc/aehmc/trajectory.py',
463,
'where_state',
'q = aet.where(do_pick_left, q_left, q_right)')
Don't forget: if you name these variables, it can be a lot easier to identify their origins (and read the graph output).
I was able to track down the source of the bug. The trajectory is always interrupted early because the termination criterion returns True
at the first step. This is related to what happens when the step index is 0, I'll investigate.
Other curious thing: after making the step index start from 0 instead of 1 I get the following error:
if n_fixed_steps in [1, -1]:
# We do not need to use the scan op anymore, so we can just return
# the outputs and updates we have
if condition is not None:
_logger.warning(
> (
"When the number of steps is fixed and equal "
"to 1, the provided stopping condition, {} is ignored",
).format(condition)
)
E AttributeError: 'tuple' object has no attribute 'format'
The error disappears when I delete the call to _logger.warning
. I tried to reproduce this on a simpler example but could not.
I was able to track down the source of the bug. The trajectory is always interrupted early because the termination criterion returns
True
at the first step. This is related to what happens when the step index is 0, I'll investigate.Other curious thing: after making the step index start from 0 instead of 1 I get the following error:
if n_fixed_steps in [1, -1]: # We do not need to use the scan op anymore, so we can just return # the outputs and updates we have if condition is not None: _logger.warning( > ( "When the number of steps is fixed and equal " "to 1, the provided stopping condition, {} is ignored", ).format(condition) ) E AttributeError: 'tuple' object has no attribute 'format'
The error disappears when I delete the call to
_logger.warning
. I tried to reproduce this on a simpler example but could not.
Where is this code coming from?
It comes from the implementation of the scan
function.
Ah, yeah, that's a simple bug to fix.
@brandonwillard @ricardoV94 the tests are failing because of the aesara bug mentioned above. It is otherwise good for review.
@brandonwillard @ricardoV94 the tests are failing because of the aesara bug mentioned above. It is otherwise good for review.
I'll create a new release of Aesara with that fix in just a minute.
Merging #7 (995656d) into main (21b460d) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## main #7 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 6 9 +3
Lines 125 343 +218
Branches 11 14 +3
==========================================
+ Hits 125 343 +218
Impacted Files | Coverage Δ | |
---|---|---|
aehmc/hmc.py | 100.00% <ø> (ø) |
|
aehmc/integrators.py | 100.00% <100.00%> (ø) |
|
aehmc/metrics.py | 100.00% <100.00%> (ø) |
|
aehmc/nuts.py | 100.00% <100.00%> (ø) |
|
aehmc/proposals.py | 100.00% <100.00%> (ø) |
|
aehmc/termination.py | 100.00% <100.00%> (ø) |
|
aehmc/trajectory.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 21b460d...995656d. Read the comment docs.
The tests are now passing locally, but it looks like there is a problem with the aesara
installation in the CI. Any idea why?
Will add a logistic regression example to the notebook once #19 is merged.
In this PR we add the NUTS kernel.