EnzymeAD / oxide-enzyme

Enzyme integration into Rust. Experimental, do not use.
Apache License 2.0
102 stars 4 forks source link

Structure of user-facing macros #4

Closed bytesnake closed 2 years ago

bytesnake commented 2 years ago

Enzyme's integration into Rust should give the user

There several ways how to structure macros.

fn test(a: f32, b: f32) -> f32 {
    a * b
}

derive_diff!(test, test_first, a: const, b: dup);
derive_diff!(test_first, test_second, a: const, b: dup);

/*
Created by the derive_diff! proc-macro:
fn test_second(a: f32, b: f32, b_dup: f32) -> f32 {
 unreachable!();
}

*/

or

#[diff(
   test_diff: (const, dup) -> active
)]
fn test(a: f32, b: f32) -> f32 {
    a * b
}

or

diff fn test(a: (f32, const), b: (f32, dup)) -> (f32, const) {
}
ZuseZ4 commented 2 years ago

Playground for internal use / design tests, probably broken most of the time:

import simple_dep;

fn f(x: f32, y: f32) -> f32 {
 x * 2 + y
}

#[enzymefn(f)]
#[enzymeIn(Active, Active)]
#[enzymeout(Active)]
fn df1(x: f32, y: f32) -> { f32, f32 } {
unreachable!()
}

#[enzyme(simple_dep::f, Active, Const)]
fn df2(x: f32, y: f32) -> { f32 } {
unreachable!()
}

Look at how #[test] passes information, We also have to verify that our df functions won't be inlined. (We have to use lto anyways, so that's not an issue).

bytesnake commented 2 years ago

looking how track_callee is doing stuff, perhaps we can do it the following way:

  1. attr macro for function to differentiate

the rosenbrock function

#[differentiate]
fn rosenbrock(a: f32, b: f32, x: f32, y: f32) -> f32 {
    (a - x).powf(2.0) + b*(y-x*x).powf(2.0)
}

becomes

mod rosenbrock {
    use super::DiffeType;

    pub trait TypeInfo {
        const RET: DiffeType = DiffeType::Const;
        const A: DiffeType = DiffeType::Const;
        const B: DiffeType = DiffeType::Const;
        const X: DiffeType = DiffeType::Const;
        const Y: DiffeType = DiffeType::Const;
    }

    pub struct Forward;
    impl TypeInfo for Forward {}

    pub fn generic_body<T: TypeInfo>(a: f32, b: f32, x: f32, y: f32, ty: T) -> f32 {
        (a - x).powf(2.0) + b*(y-x*x).powf(2.0)
    }

}

fn rosenbrock(a: f32, b: f32, x: f32, y: f32) -> f32 {
    rosenbrock::generic_body(a, b, x, y, rosenbrock::Forward)
}
  1. the callee macro creates new variants of this generic function
derive_diff!(rosenbrock, rosenbrock_grad, a: const, b: const, x: dup, y: dup);

would become

pub struct RosenbrockGrad {}
impl rosenbrock::TypeInfo for RosenbrockGrad {
        const RET: DiffeType = DiffeType::Const;
        const A: DiffeType = DiffeType::Const;
        const B: DiffeType = DiffeType::Const;
        const X: DiffeType = DiffeType::Dup;
        const Y: DiffeType = DiffeType::Dup;
}

fn rosenbrock_grad(a: f32, b: f32, x: f32, _x_dup: &mut f32, y: f32, _y_dup: &mut f32) {
    rosenbrock::generic_body(a, b, x, y, RosenbrockGrad)
}
  1. During monomorphization of the trait TypeInfo several copies of the function in question are generated with different diffe type information. Those can then be caught after MIR in the LLVM codegen to replace them with their differentiated counterparts. We are basically re-using the type system for our cause.
bytesnake commented 2 years ago

it may also make sense to automatically generate structures containing all output parameters

bytesnake commented 2 years ago

we could also define a simpler call macro, only allowing constant and output parameters, which directly infers const and output parameters

let (x,y) = (1.0, 1.0);
let (ret, xd, yd) = call_diff!(rosenbrock, 0.0, 1.0, x, y);

and is similar to derive_diff but inlines the additional struct definition. It also never assumes that we need the derivative w.r.t the returned value.

bytesnake commented 2 years ago

the names differentiate, derive_diff and call_diff are highly subjective and should be more expressive

wsmoses commented 2 years ago

One thing I'll add to your plate to think about for UX/syntactic sugar.

We're shortly finishing up "forward mode" AD in addition to the existing "reverse mode AD". Reverse mode computes the derivatives of all inputs with respect to a single output (or more specifically can do with respect to a linear combination of outputs). Forward mode does all outputs with respect to a single input.

Forward mode is called in much the same way, except all inputs are duplicated (e.g. an f32 would be duplicated not active).

We're still thinking through what good syntax should exist for it for high level languages (have a tentative __enzyme_fwddiff, but wanted to make sure you were aware of for future naming conventions/collisions.

ZuseZ4 commented 2 years ago

I've been reconsidering what we need for an interface. For simplicity, I assumed that we are allowed to implement a macro with extra rights (similar to concat!). We require that the macro can look up a function header. Type handling is usually done after macro expansion. However, we could even accept the header as a string, since that is enough to calculate how the header of our generated function will look like. This also aims to be exhaustive. It's hard to claim that an interface is future-proof. At least we have an advantage here since the underlying theory based on the chain-rule is unlikely to change. I'm also making use of #[non_exhaustive] enums. That is hopefully sufficient to be prepared for other AD tools handling other codegen backends. We still need to specify the internal representation and the interface on THIR / MIR level w.r.t. the different cg_backend. But that's probably less critical, as it is internal (although we obviously still have to take it serious).

differentiate!(primary_fnc : FncPointer, gradientName : str, 
    inputActivity : Activities, outputActivity : Activities);
 // Will expand to 
 // fn gradientName (...) { unreachable!() }
 // and parse it's input into some rustc-metadata.

#[non_exhaustive]
enum Activities {
AllFloats,
PerEntry(Vec<Activity>) // one per input / output parameter
}

enum Activity {
Active,      // calculate the primary and the gradient
Gradient,  // calculate the gradient but not the primary
Constant, // calculate the primary  but not the gradient
}

#[non_exhaustive]
enum Mode {
Forward,
Reverse,
// ReverseSplit
// Mixed(more-details)
}

A minimal codegen implementation should then support ForwardMode and expects all (non-int) inputs to be active. Unsupported Types (like currently globals or dyn trait for Enzyme) should then lead to a panic if there is no good falll-back.

There are a few open questions. 1) Can we remove the gradientname and default to d ? That will collide if we differentiate the same function in the same module multiple times (with different arguments). That's probably quite rare, so we can ask users to work around it using wrappers or modules, that's easy. Libraries might even be able to do that. 2) I am not sure about atomicAdd and uncacheable args. I feel like we don't need either in Rust, but I might be wrong. Needs to be verified. 3) We can support higher order derivatives by (later) adding new Modes. Mode::ForwardForward implies a Hessian, but there are probably much better ways to name Modes. We can also generate higher orders recursively, macros already support an ordered expansion. 4) I marked Activities as non_exhaustive, since people later might want to support finer-than-typelevel activities. An example would be a request for gradients only for the first half of an array. 5) The exact activity specification of input/output parameters (and the primary_ret) probably requires some more love. As mentioned by william above there are some implications based on the Mode. 6) I propose ignoring Enzyme's OUT_DIFF vs DUP_ARG option. If we have an active f32 input we can just set it as DUP_ARG and add &mut f32. This makes the interface simpler and more consistent. It is nice that enzyme zero-initializes scalars for you, but imho that can also be done by a user-space wrapper macro. In the backend we can easily wrap functions generated by enzyme, I already wrote two LLVM-IR wrappers in the past one of which does almost the same job. 7) Are we missing information in this interface which either Enzyme or another AD tool for a differend codegen backend might require? TypeTrees can be generated internally. 8) ReverseModeSplit should be matched by the differentiate! macro. It needs to change it's output to return two declarations. There are however some open questions to the tape/cache handling in that case, so I propose to postpone this. Adding it later is non-breaking.

Libraries might then provide more convenient wrappers like forward!(fnc) or reverse!(fnc) which might generate d_ while assuming that all floats are active.

Looking at our earlier comments:

precise control of what are constant, primary and adjoint variables Not only activity, but also of modes.

make it easy to mark a function as a candidate for differentiation We skipped this requirement by requesting a macro with extra capabilities. This simplifies user code. Also, we can't mark primary functions if there are defined in dependencies. So that's a plus.

inline derivatives into existing structs or calls directly in code You might need to help me with the first part, but I feel like the parameter handling of this interface is quite consistent. The second part can most likely be handled by a user-level macro which wraps the reverse!(foo) call with some brackets and directly calls / returns the newly declared function: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=803af8e8a3fdcde2eda904ad67b5d4d2 That might even be do-able by a decl macro / macro2.0 instead of a proc-macro and it's probably doable in a hygenic way. But I don't know that much about macro-hygiene, so it's probably better if I leave that to others. Also, unimplemented!() won't lead to a build issue and won't create runtime-panics, since the implementation will be replaced.

bytesnake commented 2 years ago

@wsmoses I wrote this reply two weeks ago via Email but apparently it never came through :sweat_smile:

Reverse mode computes the derivatives of all inputs with respect to a single output [..]

those are mutual exclusive and running in forward or backward accumulation? I'm not sure about application for "forward mode" AD in situ or even for Jacobi matrices

Forward mode is called in much the same way, except all inputs are duplicated [..]

can you explain this a bit more, I thought that you only have a single input in forward mode

We're still thinking through what good syntax should exist for it for high level languages

From an UX standpoint they can be put at the end of a parameter list, though this magic will probably confuse users a bit too much, cause there is no clear association anymore. We also have to bail out for MIMO functions. Don't have much free-time at the moment but will take this into consideration, thanks!

bytesnake commented 2 years ago

For simplicity, I assumed that we are allowed to implement a macro with extra rights (similar to concat!). We require that the macro can look up a function header. Type handling is usually done after macro expansion. However, we could even accept the header as a string, since that is enough to calculate how the header of our generated function will look like.

best we can catch unaligned arguments as early as possible and that should happen during macro expansion, so there is a bit of magic involved

This also aims to be exhaustive. It's hard to claim that an interface is future-proof. At least we have an advantage here since the underlying theory based on the chain-rule is unlikely to change. I'm also making use of #[non_exhaustive] enums. That is hopefully sufficient to be prepared for other AD tools handling other codegen backends.

but matching against enums will only happen for codegen's and for user-facing part it's a breaking change anyways

  1. That will collide if we differentiate the same function in the same module multiple times (with different arguments). That's probably quite rare, so we can ask users to work around it using wrappers or modules, that's easy. Libraries might even be able to do that.

but this is the kind of magic, which should be avoided and will decrease chance to get accepted into rustc

.. will revise that tomorrow ..

wsmoses commented 2 years ago

those are mutual exclusive and running in forward or backward accumulation? I'm not sure about application for "forward mode" AD in situ or even for Jacobi matrices

At minimum this can be useful for controlling/reducing allocation. E.g. you can store the derivative result in an existing location.

Forward mode is called in much the same way, except all inputs are duplicated [..]

can you explain this a bit more, I thought that you only have a single input in forward mode

Yeah: Suppose you have a multi-input, multi-output function out[:] = f(in[:]), where the dimension of in is I and the dimension of out is O.

There are I * O potential derivatives one might want to individually compute (e.g. the derivative of every output with respect to every input): J[i, j] = dout_j/din_i.

Reverse mode can get you J[:, j] for any individual j (e.g. the derivative of any input with respect to a given output) in a single call, and forward mode can get you J[i, :] for any individual i (e.g. the derivative of a given input with respect to all outputs).

Enzyme for both modes, actually implements a more general (and thus more useful) version of both of these. Specifically, reverse mode computes the adjoints. That is to say given any vector v, it computes \sum_j v[j] J[i, j] = \sum_j v[j] dout_j/din_i. In other words it can compute the sum of gradients with respect to a vector of outputs. If v is set to 1 at one index and 0 elsewhere this gets you the "traditional" gradient wrt one output.

We have a similar implementation for forward-mode in which say we're given a vector u, it computes \sum_u u[j] J[i, j] = \sum_i u[i] dout_j/din_i. In other words it can compute the sum of derivatives with respect to a vector of inputs. Again, if u is set to 1 at one index and 0 elsewhere this gets you the "traditional" derivative. This can also be thought of as the directional derivative.

ZuseZ4 commented 2 years ago

I've implemented a differentiate attribute proc-macro here: https://github.com/ZuseZ4/autodiff At some point we should probably discuss the naming of some parameters, but that's easy to update. Using it revealed that one of of my llvm wrappers here doesn't cover all cases, once I fixed that I'll use that macro for this repo, as it's much more convenient.

ZuseZ4 commented 2 years ago

I just merged the new macro, I guess now it's time for more testing / documenting, to see if we want to have another iteration on the user-frontend.

ZuseZ4 commented 2 years ago

The frontend seems to work fine, despite needing some smaller updates for fwd-mode (-vector). I guess atm. there is no reason to have larger discussions on that, so closing here.