aesara-devs / aehmc

An HMC/NUTS implementation in Aesara
MIT License
33 stars 6 forks source link

add NUTS kernel #7

Closed rlouf closed 3 years ago

rlouf commented 3 years ago

In this PR we add the NUTS kernel.

rlouf commented 3 years ago

@brandonwillard it looks like I still have a problem with a RandomStream node that I do not understand. The stack trace I guet when executing test_dynamic_expansion in test_trajectory.py seems to indicates that the problems lies comes from the dynamic integration. However the test test_dynamic_integration_divergence does pass. The problem would appear when I share the RandomStream between two functions. Is there a workaround?

Stack trace ``` WARNING (aesara.tensor.blas): Using NumPy C-API based implementation for BLAS functions. Traceback (most recent call last): File "/home/remi/projects/ampersand/aehmc/example.py", line 64, in fn = aesara.function((), result) File "/home/remi/.virtualenvs/aehmc/lib/python3.9/site-packages/aesara/compile/function/__init__.py", line 337, in function fn = pfunc( File "/home/remi/.virtualenvs/aehmc/lib/python3.9/site-packages/aesara/compile/function/pfunc.py", line 524, in pfunc return orig_function( File "/home/remi/.virtualenvs/aehmc/lib/python3.9/site-packages/aesara/compile/function/types.py", line 1972, in orig_function m = Maker( File "/home/remi/.virtualenvs/aehmc/lib/python3.9/site-packages/aesara/compile/function/types.py", line 1586, in __init__ fgraph, additional_outputs = std_fgraph(inputs, outputs, accept_inplace) File "/home/remi/.virtualenvs/aehmc/lib/python3.9/site-packages/aesara/compile/function/types.py", line 190, in std_fgraph fgraph = FunctionGraph(orig_inputs, orig_outputs, update_mapping=update_mapping) File "/home/remi/.virtualenvs/aehmc/lib/python3.9/site-packages/aesara/graph/fg.py", line 167, in __init__ self.import_var(output, reason="init") File "/home/remi/.virtualenvs/aehmc/lib/python3.9/site-packages/aesara/graph/fg.py", line 337, in import_var self.import_node(var.owner, reason=reason, import_missing=import_missing) File "/home/remi/.virtualenvs/aehmc/lib/python3.9/site-packages/aesara/graph/fg.py", line 402, in import_node raise MissingInputError(error_msg, variable=var) aesara.graph.fg.MissingInputError: Input 1 () of the graph (indices start from 0), used to compute Elemwise{mul,no_inplace}(Inplac eDimShuffle{x}.0, ), was not provided and not given a value. Use the Aesara flag exception_verbosity='high', for more information on this error. Backtrace when that variable is created: File "/home/remi/projects/ampersand/aehmc/example.py", line 63, in result = expand(proposal, state, state, state[1], termination_state, energy) File "/home/remi/projects/ampersand/aehmc/aehmc/trajectory.py", line 428, in expand ), updates = trajectory_integrator( File "/home/remi/projects/ampersand/aehmc/aehmc/trajectory.py", line 236, in integrate traj, updates = aesara.scan( ```

The following simplified example does work as expected:

from typing import Callable

import aesara
import aesara.tensor as aet
from aesara.tensor.random.utils import RandomStream

rng = RandomStream(59)

def bernoulli_adder(srng: RandomStream):
    def add_bernoulli(b):
        def one_step(x):
            a = aet.where(srng.bernoulli(0.5), 1, -1)
            c = a + x
            return c

        res, _ = aesara.scan(one_step, outputs_info=(b,), n_steps=10)
        return res

    return add_bernoulli

def normal_adder(srng: RandomStream, add_bernoulli: Callable):
    def add_normal(init):
        def one_step(a):
            res = add_bernoulli(a)
            final = res[-1]
            norm = srng.normal(0, 1)
            return final + norm

        res, _ = aesara.scan(one_step, outputs_info=(init,), n_steps=10)
        return res

    return add_normal

init = aet.scalar("init")
add_bernoulli = bernoulli_adder(rng)
add_normal = normal_adder(rng, add_bernoulli)
result = add_normal(init)

fn = aesara.function((init,), result)

print(fn(10))

Which makes me think that the issue may be more trivial. But at this point I need another pair of eyes to look at the code.

brandonwillard commented 3 years ago

I don't see a test_dynamic_expansion function.

rlouf commented 3 years ago

test_multiplicative_expansion, sorry.

brandonwillard commented 3 years ago

The missing input variable appears to be introduced by/within new_potential_energy_grad. If you set a name for that (e.g. new_potential_energy_grad.name = "new_potential_energy_grad") the error will say that the missing variable is new_potential_energy_grad[t-1] (the [t-1] suffix usually means that Scan is using that variable as an initial value) .

brandonwillard commented 3 years ago

OK, I think the issue might have to do with the outputs_info arguments to the scan here. It looks like quite a few arguments are duplicated—including the new_potential_energy_grad variable in the error message, and Scan might not be handling duplicate arguments well, or the inputs could be misspecified in some way that Scan isn't catching.

rlouf commented 3 years ago

@brandonwillard it seems like we don't have any problems with the updates inside the scan operators. I fixed a bunch of type mismatch errors, and I am now stuck with the following error:

TypeError: ('The following error happened while compiling the node', do_whileall_inplace,cpu,scan_fn}(TensorConstant{10}, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709EB79D60>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F70BC80F3C0>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709C9FE9E0>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709C6BF9E0>)), '\n', "Inconsistency in the inner graph of scan 'scan_fn' : an input and an output are associated with the same recurrent state and should have the same type but have type 'TensorType(float64, (True,))' and 'TensorType(float64, vector)' respectively.")

As you can see in the test_trajectory.py file I did print the inputs to expand, and the output of expand_once (commented out). Inputs and outputs have the same shape. I don't really know what else to try.

brandonwillard commented 3 years ago

@brandonwillard it seems like we don't have any problems with the updates inside the scan operators. I fixed a bunch of type mismatch errors, and I am now stuck with the following error:

TypeError: ('The following error happened while compiling the node', do_whileall_inplace,cpu,scan_fn}(TensorConstant{10}, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, DeepCopyOp.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709EB79D60>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F70BC80F3C0>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709C9FE9E0>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F709C6BF9E0>)), '\n', "Inconsistency in the inner graph of scan 'scan_fn' : an input and an output are associated with the same recurrent state and should have the same type but have type 'TensorType(float64, (True,))' and 'TensorType(float64, vector)' respectively.")

As you can see in the test_trajectory.py file I did print the inputs to expand, and the output of expand_once (commented out). Inputs and outputs have the same shape. I don't really know what else to try.

OK, I'll take a look at it now; otherwise, that's an awful error message! We need to fix that in Aesara.

(The error message is fixed in https://github.com/aesara-devs/aesara/pull/537.)

brandonwillard commented 3 years ago

A note about that kind of error:

Whenever you see an error involving a discrepancy between two symbolic variables like TensorType(float64, (True,)) and TensorType(float64, vector), it's often because to the former variable is an "over-specialized" instance of the latter.

For example, if we define a generic symbolic vector x = at.vector(), it will have an x.type of TensorType(float64, vector). If we create a very specific instance of a vector via at.as_tensor(np.r_[1.0]), we'll get a y.type of TensorType(float64, (True,)).

Both x and y are vectors, but the irreconcilable difference between the two types is the broadcast patterns (i.e. the dimensions that are known to be one). In other words, x.type.broadcastable == [False] (i.e. the vector part in the str output of x.type) and y.type.broadcastable == [True] (i.e. the (True,) part in the str output of y.type).

aesara.tensor.as_tensor[_variable] is sometimes the cause of such errors—as the example above illustrates how the discrepancy can arise from its use—but it can also be due to an Op that does a poor job of inferring the broadcast pattern(s) of its output(s), or a number of other related things.

brandonwillard commented 3 years ago

FYI: If we stop in the debugger at that error, we can see exactly where one of the offending variables (specifically, the latter TensorType(float64, vector), which is one of the Scan's "inner-output" variables) was created:

pdb> pp self.outputs[inner_oidx].tag.trace[-1][-1]
('.../aehmc/aehmc/trajectory.py',
 463,
 'where_state',
 'q = aet.where(do_pick_left, q_left, q_right)')

Don't forget: if you name these variables, it can be a lot easier to identify their origins (and read the graph output).

rlouf commented 3 years ago

I was able to track down the source of the bug. The trajectory is always interrupted early because the termination criterion returns True at the first step. This is related to what happens when the step index is 0, I'll investigate.

Other curious thing: after making the step index start from 0 instead of 1 I get the following error:

if n_fixed_steps in [1, -1]:
    # We do not need to use the scan op anymore, so we can just return
    # the outputs and updates we have
     if condition is not None:
         _logger.warning(
>           (
             "When the number of steps is fixed and equal "
             "to 1, the provided stopping condition, {} is ignored",
             ).format(condition)
        )
E     AttributeError: 'tuple' object has no attribute 'format'

The error disappears when I delete the call to _logger.warning. I tried to reproduce this on a simpler example but could not.

brandonwillard commented 3 years ago

I was able to track down the source of the bug. The trajectory is always interrupted early because the termination criterion returns True at the first step. This is related to what happens when the step index is 0, I'll investigate.

Other curious thing: after making the step index start from 0 instead of 1 I get the following error:

if n_fixed_steps in [1, -1]:
    # We do not need to use the scan op anymore, so we can just return
    # the outputs and updates we have
     if condition is not None:
         _logger.warning(
>           (
             "When the number of steps is fixed and equal "
             "to 1, the provided stopping condition, {} is ignored",
             ).format(condition)
        )
E     AttributeError: 'tuple' object has no attribute 'format'

The error disappears when I delete the call to _logger.warning. I tried to reproduce this on a simpler example but could not.

Where is this code coming from?

rlouf commented 3 years ago

It comes from the implementation of the scan function.

brandonwillard commented 3 years ago

Ah, yeah, that's a simple bug to fix.

rlouf commented 3 years ago

@brandonwillard @ricardoV94 the tests are failing because of the aesara bug mentioned above. It is otherwise good for review.

brandonwillard commented 3 years ago

@brandonwillard @ricardoV94 the tests are failing because of the aesara bug mentioned above. It is otherwise good for review.

I'll create a new release of Aesara with that fix in just a minute.

codecov[bot] commented 3 years ago

Codecov Report

Merging #7 (995656d) into main (21b460d) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              main        #7    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            6         9     +3     
  Lines          125       343   +218     
  Branches        11        14     +3     
==========================================
+ Hits           125       343   +218     
Impacted Files Coverage Δ
aehmc/hmc.py 100.00% <ø> (ø)
aehmc/integrators.py 100.00% <100.00%> (ø)
aehmc/metrics.py 100.00% <100.00%> (ø)
aehmc/nuts.py 100.00% <100.00%> (ø)
aehmc/proposals.py 100.00% <100.00%> (ø)
aehmc/termination.py 100.00% <100.00%> (ø)
aehmc/trajectory.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 21b460d...995656d. Read the comment docs.

rlouf commented 3 years ago

The tests are now passing locally, but it looks like there is a problem with the aesara installation in the CI. Any idea why?

rlouf commented 3 years ago

Will add a logistic regression example to the notebook once #19 is merged.