Quansight-Labs / udiff

Automatic differentiation with uarray/unumpy.
BSD 3-Clause "New" or "Revised" License
16 stars 11 forks source link

Design the “separation” mechanism #14

Open sangyx opened 4 years ago

sangyx commented 4 years ago

Hi, I think the task of designing the “separation” between the data dimensions and the differentiation dimensions is the key part of our project. So we need to discuss and reach a good consensus. I want to add an attribute such as data_ndim to the object DiffArray to indicate the input type. And I can imagine a simple mechanism as follows:

# target: df(v)/dv
if f(v).data_ndim == 0:
    if v.data_ndim == 0:
       diff = cal_diff_ss(f(v), v, f) # diff is a scalar 
   elif v.data_ndim == 1: 
      diff = cal_diff_st(f(v), v, f) # diff is a tensor
    elif v.data_ndim == 2:
       diff = cal_diff_sm(f(v), v, f) # diff is a matrix
elif f(v).data_ndim == 1:
    if v.data_ndim == 0:
       diff = cal_diff_vs(f(v), v, f) # diff is a vector
    elif v.data_ndim == 1:
       diff = cal_diff_vt(f(v), v, f) # diff is a matrix
    elif v.data_ndim == 2:
       diff = cal_diff_vm(f(v), v, f) # diff is a matrix
elif f(v).data_ndim == 2:
    if v.data_ndim == 0:
       diff = cal_diff_ms(f(v), v, f) # diff is a matrix
    elif v.data_ndim == 1:
       diff = cal_diff_mt(f(v), v, f) # diff is a matrix
    elif v.data_ndim == 2:
       diff = cal_diff_mm(f(v), v, f) # diff is a matrix
else:
    raise ValueError("The data_type is not supported")

But I noticed the historical information in gitter: "Depends... I’d like to do something like say that out of an n dimensional array, k dimensions are for differentiation and n - k are for data." I can't understand it. Could you give more details or examples to illustrate it?

hameerabbasi commented 4 years ago

Note: All shapes are in Python tuple notation, and all operations on them are to be considered operations on Python tuples. I will use the words tensor and array interchangeably, arrays don't just mean 1-dimensional arrays.

So, suppose we have an n dimensional array, with a shape s. We can say that it is divided as an k dimensional array (with shape s[:-k]) of n - k dimensional arrays (each of shape s[k:]).

Now, consider that we are not just doing vector/matrix calculus but tensor calculus. In this form, we want to take the derivative of the inner k dimensional arrays with respect to some function. If the function is element-wise, the derivative will have the shape s[k:] * 2. This will be repeated over the all the inner arrays, until the whole array will be filled. Again, if the function is element-wise, the derivative will be of shape s[:-k] + (s[k:] * 2). If a dimension is added or dropped as a result of an operation (for example, matrix multiplication drops some dimensions), then the corresponding dimension will be dropped from the output shape as well.

You can think of reducing this to vector calculus in the following way: ravel/flatten the tensor to a vector, take the derivative, and reshape it back.

This way, we avoid excessive if statements and have generality in our code.

sangyx commented 4 years ago

I understand a part. Suppose we have a tensor x:

x = [[[1, 2], [3, 4]],
     [[5, 6], [7, 8]]]

Then we can calculus the derivative of element-wise function by flattening the x to a vector:

x_v = [1, 2, 3, 4, 5, 6, 7, 8]
y = np.exp(x_v)
x_diff = y.diff[x].reshape(x.shape)

But I have two problems. First, We can get x.shape=(2, 2, 2). This can be divided as two arrays x1 and x2:

x1 = [[1, 2], [3, 4]] # x1.shape=(2, 2)
x2 = [[5, 6], [7, 8]] # x2.shape=(2, 2)

If the function is np.exp, why the derivative will have the shape s[:-k] + (s[k:] * 2)? In this case, it will be 2 + (2, 2) * 2, while I think it should be 2 * 2 * 2. Second, I think the reducing method can solve the six kinds of derivation mentioned by Matrix_calculus. But I don’t know whether this method can be applied to the other three types of derivation: matrix-by-vector, vector-by-matrix, and matrix-by-matrix. For example, if the operation is matrix multiplication:

u = [[u_11, u_12], [u_21, u_22]]
v = [[v_11], [v_21]]
y = np.dot(u, v) = [[u_11 * v_11+u_12*v12], [u_21*v_11 + u_22*v_21]]

In this case, how should we add or drop the dimension?

hameerabbasi commented 4 years ago

So let's say that x is an array with x.shape == (2, 3, 4) with x.data_ndim == 1 and x.diff_ndim == 2. Let's say it's your preferred array np.arange(24).reshape((2, 3, 4))

Then x.data_shape == (2,) and x.diff_shape == (3, 4). That means that x is to be treated as a stack of 3x4 matrices at each step.

This means that dx / dx (which I will call x.diffs[x]) will have shape x.data_ndim + 2 * x.diff_ndim == (2, 3, 4, 3, 4)

In this array, the following statement will hold, in python: np.all(x.diffs[x].reshape((2, 12, 12)) == [np.eye(12), np.eye(12)]).

When you do np.exp(x), for its derivative, d(np.exp(x[idx1]))/d(x[idx2]) for all possible sets of valid indices idx1 and idx2 into arrays of size x.diff_shape. This will be np.exp(x[idx1]) for idx1 == idx2 and 0 otherwise.

Thus, the following python can be used for element-wise operations:

# [Assuming y = np.exp(x)]
diff_size = functools.reduce(operator.mul, x.diff_shape, 1)
data_size = functools.reduce(operator.mul, x.data_shape, 1)
y.diffs[x] = x.diffs[x].reshape((data_size, diff_size, diff_size)) @ (np.zeros(data_size, diff_size, diff_size) * np.exp(x.diffs[x]))
y.diffs[x].resize(x.diffs[x].shape)

I will write another comment later explaining how matmul is handled.

sangyx commented 4 years ago

Thanks for your explanation. I confuse scalar derivation with matrix derivation. For an element at one position, consider its derivative for all elements at all positions. So y.diffs[x][0][1][2][3][4] means d(x[0][1][2])/d(x[0][3][4]). But the last line in the codeblock will output all zeros. Should it be like this:

y.diffs[x] = x.diffs[x].reshape((data_size, diff_size, diff_size)) @ (x.diffs[x] * np.exp(x.diffs[x]))
hameerabbasi commented 4 years ago

I still think there's an error, as x.diffs[x] is multiplied twice. Although this holds in this case, it might not be true in the general case. So the second x.diffs[x] should be np.eye(diff_size).

sangyx commented 4 years ago

I run the code in jupyter:

x = np.arange(24).reshape((2, 3, 4))
diff_size = functools.reduce(operator.mul, (3, 4), 1)
data_size = functools.reduce(operator.mul, (2, ), 1)
x_diff = np.stack([np.eye(diff_size), np.eye(diff_size)]).reshape((2, 3, 4, 3, 4)) # x.diffs[x]

We can get that np.eye(diff_size).shape = (12, 12), np.exp(x.diffs[x]).shape=(2, 3, 4, 3, 4). These two matrices cannot be multiplied. What's more, the values of y.diffs[x] are wrong. These values should be np.exp(x), not np.exp(x.diffs[x]). So I think the correct code maybe:

y_diff = np.repeat(np.exp(x), diff_size).reshape(x_diff.shape) * x_diff # y.diffs[x]

np.repeat(np.exp(x), diff_size).reshape(x_diff.shape) is the derivative of the function np.exp, np.eye(diff_size) is the chain rule.

hameerabbasi commented 4 years ago

Sounds good! I'd make a couple of small but important changes for performance:

x_diff = np.eye(diff_size).reshape((3, 4, 3, 4)).broadcast_to((2, 3, 4, 3, 4)) # x.diffs[x]

And

y_diff = np.exp(x)[..., None, None, :, :] * x_diff # y.diffs[x]
sangyx commented 4 years ago

Thanks for your explanation. I confuse scalar derivation with matrix derivation. For an element at one position, consider its derivative for all elements at all positions. So y.diffs[x][0][1][2][3][4] means d(x[0][1][2])/d(x[0][3][4]). But the last line in the codeblock will output all zeros. Should it be like this:

y.diffs[x] = x.diffs[x].reshape((data_size, diff_size, diff_size)) @ (x.diffs[x] * np.exp(x.diffs[x]))

I made some mistakes here. y.diffs[x][0][1][2][3][4] means d(y[1][2])/d(x[0][3][4]). What's more, I think we should add a dimension to indicate the data_dim of y. For matrix multiplication, suppose we have two arrays u.shape=(2, 2, 2) and v.shape=(2, 2, 1), we will get that y=np.dot(u, v) and y.shape=(4, 2, 1). Then y.diffs[u][1][1][2][0][3][4] means d(y[1][1][2])/d(u[0][3][4]).

hameerabbasi commented 4 years ago

For matrix multiplication, suppose we have two arrays u.shape=(2, 2, 2) and v.shape=(2, 2, 1), we will get that y=np.dot(u, v) and y.shape=(4, 2, 1). Then y.diffs[u][1][1][2][0][3][4] means d(y[1][1][2])/d(u[0][3][4]).

I don't think that's right... We'll get y.shape=(2, 2, 1) for np.matmul, and y.shape = (2, 2, 2, 1) for np.dot.

hameerabbasi commented 4 years ago

I made some mistakes here. y.diffs[x][0][1][2][3][4] means d(y[1][2])/d(x[0][3][4]).

y has the same shape as x, so it should be d(y[0][1][2])/d(x[0][3][4]).

sangyx commented 4 years ago

I was wrong.( ̄ 'i  ̄;) I am trying to figure out how to deal with matrix multiplication. Can you share some of your thoughts?

hameerabbasi commented 4 years ago

I was wrong.( ̄ 'i  ̄;)

That's okay, I was a few times, too. 😄

So, let's talk about matrix multiplication (and, in general, tensor operations). For these operations, we must be aware that some dimensions will be "dropped".

Let's consider a few cases:

u.shape = (D, M, N) and v.shape = (D, N, P) in all cases, and y = u @ v = np.matmul(u, v)

First, u.diff_ndim = 2 and v.diff_ndim = 2. In this case, y.shape = (D, M, P) in all cases. But what happens to y.diffs[u] and y.diffs[v]? To think about this, we have to think how dimensions are "dropped", so to speak. We can see, from the output shape, that the dimension N is dropped from both inputs.

So let's think about the output then. y.diffs[x] = u.diffs[x] @ v + v.diffs[x][d] @ u (For the 2-D case). y.diffs[u][d, m1, p1, m2, n2] = d(y[d, m1, p1])/(u[d, m2, n2]) and y.diffs[v][d, m1, p1, n2, p2] = d(y[d, m1, p1])/(v[d, n2, p2]).

Then one would need to convert these expressions into sums and express them as matrix multiplies themselves (hint: it would be a sum over the dimension N in each case).

sangyx commented 4 years ago

Then one would need to convert these expressions into sums and express them as matrix multiplies themselves (hint: it would be a sum over the dimension N in each case).

Can you give some sample code? I tried for a long time and didn't find how to express this process as code.

I consider this problem in this way:

image

Then the code will be:

# u.shape = (D, M, N), v.shape = (D, N, P), y.shape = (D, M, P)
tmp = np.zeros((M * M, P * N))
indices = np.arange(0, M * M, step=M+1)
tmp * v.T.flatten() # obtain the results shown in the figure above

But this method cannot integrate the chain rule(u.diffs[u]).

hameerabbasi commented 4 years ago

I'm making this up as I go along, but, here's the solution:

(Side note: Maybe we can switch around xidx and yidx in y.diffs[x], and make this uniform. That is, y.diffs[x][xidx + yidx] = d(y[yidx])/(x[xidx]), instead of y.diffs[x][yidx + xidx] = d(y[yidx])/(x[xidx])). Leaving yidx last allows us to write expressions more simply in terms of y, as we'll see, and broadcast over x.

Let's think of this another way... Tensors! y = sum_k(u[..., i, k] * v[..., k, j]). So dy/dx = sum_k((du/dx)[..., i, k] * (dv/dx)[..., k, j]). But this is just a matrix product (!). will also be dy/dx = u @ v.diffs[x] + v @ u.diffs[x].

We just need to consider four cases, matrix-matrix, matrix-vector, vector-matrix, and vector-vector. In this case, we just insert a dimension in the appropriate places to make it into matrix-matrix and remove it again in the answer.

sangyx commented 4 years ago

I think we need a little more time to find an elegant way to do it. I will work on it and always share the latest ideas in this issue this week. I submitted some simple test cases in #15 , you can check whether my understanding of the derivation mechanism is correct.

hameerabbasi commented 4 years ago

Maybe we can explore other APIs.

sangyx commented 4 years ago

Maybe we can explore other APIs.

Do you mean that we can call other matrix derivation libraries?

hameerabbasi commented 4 years ago

Do you mean that we can call other matrix derivation libraries?

I mean survey other libraries to see what they do in these cases.

dhirschfeld commented 4 years ago

Not sure I understand so this might be OT, but is there anything to learn from the jax vmap operator?

hameerabbasi commented 4 years ago

That seems to be more-or-less correct: https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html#How-it's-made:-two-foundational-autodiff-functions

sangyx commented 4 years ago

Thanks!I will learn it.

sangyx commented 4 years ago

Hi, I studied The-Autodiff-Cookbook and wrote what I learned in the blog.

In fact, we need two functions: jvp (or vjp) and vmap.

We can refer to autodidact to complete jvp (or vjp), and then imitate Jax to complete vmap.

hameerabbasi commented 4 years ago

Is this fully implemented?

sangyx commented 4 years ago

We don’t need to write rules now, but use the same method as JAX, using basic differential operators and calculation graphs to get the results. If u=np.random.randn(4, 3, 2), v=np.random.randn(4, 2, 3), y = np.matmul(u, v), so y.to(u, jacobian=True) will return a jacobian matrix of shape (4, 3, 3, 4, 3, 2). Because the backward_jacobian needs to be modified to get higher-order derivatives, the diff of np.stack needs to be added now. I'm working on it.