Open sangyx opened 4 years ago
Note: All shapes are in Python tuple notation, and all operations on them are to be considered operations on Python tuples. I will use the words tensor and array interchangeably, arrays don't just mean 1-dimensional arrays.
So, suppose we have an n
dimensional array, with a shape s
. We can say that it is divided as an k
dimensional array (with shape s[:-k]
) of n - k
dimensional arrays (each of shape s[k:]
).
Now, consider that we are not just doing vector/matrix calculus but tensor calculus. In this form, we want to take the derivative of the inner k
dimensional arrays with respect to some function. If the function is element-wise, the derivative will have the shape s[k:] * 2
. This will be repeated over the all the inner arrays, until the whole array will be filled. Again, if the function is element-wise, the derivative will be of shape s[:-k] + (s[k:] * 2)
. If a dimension is added or dropped as a result of an operation (for example, matrix multiplication drops some dimensions), then the corresponding dimension will be dropped from the output shape as well.
You can think of reducing this to vector calculus in the following way: ravel/flatten the tensor to a vector, take the derivative, and reshape it back.
This way, we avoid excessive if
statements and have generality in our code.
I understand a part. Suppose we have a tensor x
:
x = [[[1, 2], [3, 4]],
[[5, 6], [7, 8]]]
Then we can calculus the derivative of element-wise function by flattening the x to a vector:
x_v = [1, 2, 3, 4, 5, 6, 7, 8]
y = np.exp(x_v)
x_diff = y.diff[x].reshape(x.shape)
But I have two problems. First, We can get x.shape=(2, 2, 2)
. This can be divided as two arrays x1
and x2
:
x1 = [[1, 2], [3, 4]] # x1.shape=(2, 2)
x2 = [[5, 6], [7, 8]] # x2.shape=(2, 2)
If the function is np.exp
, why the derivative will have the shape s[:-k] + (s[k:] * 2)
? In this case, it will be 2 + (2, 2) * 2
, while I think it should be 2 * 2 * 2
.
Second, I think the reducing method can solve the six kinds of derivation mentioned by Matrix_calculus. But I don’t know whether this method can be applied to the other three types of derivation: matrix-by-vector, vector-by-matrix, and matrix-by-matrix.
For example, if the operation is matrix multiplication:
u = [[u_11, u_12], [u_21, u_22]]
v = [[v_11], [v_21]]
y = np.dot(u, v) = [[u_11 * v_11+u_12*v12], [u_21*v_11 + u_22*v_21]]
In this case, how should we add or drop the dimension?
So let's say that x
is an array with x.shape == (2, 3, 4)
with x.data_ndim == 1
and x.diff_ndim == 2
. Let's say it's your preferred array np.arange(24).reshape((2, 3, 4))
Then x.data_shape == (2,)
and x.diff_shape == (3, 4)
. That means that x
is to be treated as a stack of 3x4 matrices at each step.
This means that dx / dx
(which I will call x.diffs[x]
) will have shape x.data_ndim + 2 * x.diff_ndim == (2, 3, 4, 3, 4)
In this array, the following statement will hold, in python: np.all(x.diffs[x].reshape((2, 12, 12)) == [np.eye(12), np.eye(12)])
.
When you do np.exp(x)
, for its derivative, d(np.exp(x[idx1]))/d(x[idx2])
for all possible sets of valid indices idx1
and idx2
into arrays of size x.diff_shape
. This will be np.exp(x[idx1])
for idx1 == idx2
and 0
otherwise.
Thus, the following python can be used for element-wise operations:
# [Assuming y = np.exp(x)]
diff_size = functools.reduce(operator.mul, x.diff_shape, 1)
data_size = functools.reduce(operator.mul, x.data_shape, 1)
y.diffs[x] = x.diffs[x].reshape((data_size, diff_size, diff_size)) @ (np.zeros(data_size, diff_size, diff_size) * np.exp(x.diffs[x]))
y.diffs[x].resize(x.diffs[x].shape)
I will write another comment later explaining how matmul is handled.
Thanks for your explanation. I confuse scalar derivation with matrix derivation. For an element at one position, consider its derivative for all elements at all positions. So y.diffs[x][0][1][2][3][4]
means d(x[0][1][2])/d(x[0][3][4])
. But the last line in the codeblock will output all zeros.
Should it be like this:
y.diffs[x] = x.diffs[x].reshape((data_size, diff_size, diff_size)) @ (x.diffs[x] * np.exp(x.diffs[x]))
I still think there's an error, as x.diffs[x]
is multiplied twice. Although this holds in this case, it might not be true in the general case. So the second x.diffs[x]
should be np.eye(diff_size)
.
I run the code in jupyter:
x = np.arange(24).reshape((2, 3, 4))
diff_size = functools.reduce(operator.mul, (3, 4), 1)
data_size = functools.reduce(operator.mul, (2, ), 1)
x_diff = np.stack([np.eye(diff_size), np.eye(diff_size)]).reshape((2, 3, 4, 3, 4)) # x.diffs[x]
We can get that np.eye(diff_size).shape = (12, 12)
, np.exp(x.diffs[x]).shape=(2, 3, 4, 3, 4)
. These two matrices cannot be multiplied. What's more, the values of y.diffs[x]
are wrong. These values should be np.exp(x)
, not np.exp(x.diffs[x])
. So I think the correct code maybe:
y_diff = np.repeat(np.exp(x), diff_size).reshape(x_diff.shape) * x_diff # y.diffs[x]
np.repeat(np.exp(x), diff_size).reshape(x_diff.shape)
is the derivative of the function np.exp
, np.eye(diff_size)
is the chain rule.
Sounds good! I'd make a couple of small but important changes for performance:
x_diff = np.eye(diff_size).reshape((3, 4, 3, 4)).broadcast_to((2, 3, 4, 3, 4)) # x.diffs[x]
And
y_diff = np.exp(x)[..., None, None, :, :] * x_diff # y.diffs[x]
Thanks for your explanation. I confuse scalar derivation with matrix derivation. For an element at one position, consider its derivative for all elements at all positions. So
y.diffs[x][0][1][2][3][4]
meansd(x[0][1][2])/d(x[0][3][4])
. But the last line in the codeblock will output all zeros. Should it be like this:y.diffs[x] = x.diffs[x].reshape((data_size, diff_size, diff_size)) @ (x.diffs[x] * np.exp(x.diffs[x]))
I made some mistakes here. y.diffs[x][0][1][2][3][4]
means d(y[1][2])/d(x[0][3][4])
.
What's more, I think we should add a dimension to indicate the data_dim
of y
. For matrix multiplication, suppose we have two arrays u.shape=(2, 2, 2)
and v.shape=(2, 2, 1)
, we will get that y=np.dot(u, v)
and y.shape=(4, 2, 1)
. Then y.diffs[u][1][1][2][0][3][4]
means d(y[1][1][2])/d(u[0][3][4])
.
For matrix multiplication, suppose we have two arrays
u.shape=(2, 2, 2)
andv.shape=(2, 2, 1)
, we will get thaty=np.dot(u, v)
andy.shape=(4, 2, 1)
. Theny.diffs[u][1][1][2][0][3][4]
meansd(y[1][1][2])/d(u[0][3][4])
.
I don't think that's right... We'll get y.shape=(2, 2, 1)
for np.matmul
, and y.shape = (2, 2, 2, 1)
for np.dot
.
I made some mistakes here.
y.diffs[x][0][1][2][3][4]
meansd(y[1][2])/d(x[0][3][4])
.
y
has the same shape as x
, so it should be d(y[0][1][2])/d(x[0][3][4])
.
I was wrong.( ̄ 'i  ̄;) I am trying to figure out how to deal with matrix multiplication. Can you share some of your thoughts?
I was wrong.( ̄ 'i  ̄;)
That's okay, I was a few times, too. 😄
So, let's talk about matrix multiplication (and, in general, tensor operations). For these operations, we must be aware that some dimensions will be "dropped".
Let's consider a few cases:
u.shape = (D, M, N)
and v.shape = (D, N, P)
in all cases, and y = u @ v = np.matmul(u, v)
First, u.diff_ndim = 2
and v.diff_ndim = 2
. In this case, y.shape = (D, M, P)
in all cases. But what happens to y.diffs[u]
and y.diffs[v]
? To think about this, we have to think how dimensions are "dropped", so to speak. We can see, from the output shape, that the dimension N
is dropped from both inputs.
So let's think about the output then. y.diffs[x] = u.diffs[x] @ v + v.diffs[x][d] @ u
(For the 2-D case). y.diffs[u][d, m1, p1, m2, n2] = d(y[d, m1, p1])/(u[d, m2, n2])
and y.diffs[v][d, m1, p1, n2, p2] = d(y[d, m1, p1])/(v[d, n2, p2])
.
Then one would need to convert these expressions into sums and express them as matrix multiplies themselves (hint: it would be a sum over the dimension N
in each case).
Then one would need to convert these expressions into sums and express them as matrix multiplies themselves (hint: it would be a sum over the dimension N in each case).
Can you give some sample code? I tried for a long time and didn't find how to express this process as code.
I consider this problem in this way:
Then the code will be:
# u.shape = (D, M, N), v.shape = (D, N, P), y.shape = (D, M, P)
tmp = np.zeros((M * M, P * N))
indices = np.arange(0, M * M, step=M+1)
tmp * v.T.flatten() # obtain the results shown in the figure above
But this method cannot integrate the chain rule(u.diffs[u]
).
I'm making this up as I go along, but, here's the solution:
(Side note: Maybe we can switch around xidx
and yidx
in y.diffs[x]
, and make this uniform. That is, y.diffs[x][xidx + yidx] = d(y[yidx])/(x[xidx])
, instead of y.diffs[x][yidx + xidx] = d(y[yidx])/(x[xidx])
). Leaving yidx
last allows us to write expressions more simply in terms of y
, as we'll see, and broadcast over x
.
Let's think of this another way... Tensors! y = sum_k(u[..., i, k] * v[..., k, j])
. So dy/dx = sum_k((du/dx)[..., i, k] * (dv/dx)[..., k, j])
. But this is just a matrix product (!). will also be dy/dx = u @ v.diffs[x] + v @ u.diffs[x]
.
We just need to consider four cases, matrix-matrix, matrix-vector, vector-matrix, and vector-vector. In this case, we just insert a dimension in the appropriate places to make it into matrix-matrix and remove it again in the answer.
I think we need a little more time to find an elegant way to do it. I will work on it and always share the latest ideas in this issue this week. I submitted some simple test cases in #15 , you can check whether my understanding of the derivation mechanism is correct.
Maybe we can explore other APIs.
Maybe we can explore other APIs.
Do you mean that we can call other matrix derivation libraries?
Do you mean that we can call other matrix derivation libraries?
I mean survey other libraries to see what they do in these cases.
Not sure I understand so this might be OT, but is there anything to learn from the jax
vmap
operator?
That seems to be more-or-less correct: https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html#How-it's-made:-two-foundational-autodiff-functions
Thanks!I will learn it.
Hi, I studied The-Autodiff-Cookbook and wrote what I learned in the blog.
In fact, we need two functions: jvp
(or vjp
) and vmap
.
We can refer to autodidact to complete jvp
(or vjp
), and then imitate Jax to complete vmap
.
Is this fully implemented?
We don’t need to write rules now, but use the same method as JAX, using basic differential operators and calculation graphs to get the results. If u=np.random.randn(4, 3, 2)
, v=np.random.randn(4, 2, 3)
, y = np.matmul(u, v)
, so y.to(u, jacobian=True)
will return a jacobian matrix of shape (4, 3, 3, 4, 3, 2)
.
Because the backward_jacobian
needs to be modified to get higher-order derivatives, the diff of np.stack
needs to be added now. I'm working on it.
Hi, I think the task of designing the “separation” between the data dimensions and the differentiation dimensions is the key part of our project. So we need to discuss and reach a good consensus. I want to add an attribute such as
data_ndim
to the object DiffArray to indicate the input type. And I can imagine a simple mechanism as follows:But I noticed the historical information in gitter: "Depends... I’d like to do something like say that out of an n dimensional array, k dimensions are for differentiation and n - k are for data." I can't understand it. Could you give more details or examples to illustrate it?