d2l-ai / d2l-en

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
https://D2L.ai
Other
23.92k stars 4.36k forks source link

Inconsistency of matrix dimensions in the sections softmax-regression / multilayer-perceptrons #1412

Closed phupe closed 4 years ago

phupe commented 4 years ago

Hi,

It seems that there some inconsistencies in some matrice dimensions.

In the section 3.4.2. Network Architecture, it is written: $\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$

From the notations introduced in section 3.4.4. Vectorization for Minibatches, we have:

The size of the matrix \mathbf{W} in the section 3.4.2. Network Architecture is (q,d) but it is written in section 3.4.4. Vectorization for Minibatches that; $\mathbf{W} \in \mathbb{R}^{d \times q}$

It means that for the same notation \mathbf{W}, the matrices are not the same in both sections 3.4.2 and 3.4.4 that is confusing.

Maybe, it could be written in the section 3.4.2. Network Architecture, something like this: $\mathbf{o} = \mathbf{W}^T \mathbf{x} + \mathbf{b}$

and $$ \begin{aligned} o_1 &= x1 w{11} + x2 w{21} + x3 w{31} + x4 w{41} + b_1,\ o_2 &= x1 w{12} + x2 w{22} + x3 w{32} + x4 w{42} + b_2,\ o_3 &= x1 w{13} + x2 w{23} + x3 w{33} + x4 w{43} + b_3. \end{aligned} $$

There is also another issue in section 3.4.4. Vectorization for Minibatches: \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}

\mathbf{O} is (n, q) while $\mathbf{b} \in \mathbb{R}^{1\times q}$

similar issue is also present in section 4.1.1.3. From Linear to Nonlinear.

astonzhang commented 4 years ago

The shape of W depends on whether the input is a column vector (x, one data example) or a matrix (X, a minibatch of examples). In matrix X, each row (not column) is a data example.

See http://en.d2l.ai.s3-website-us-west-2.amazonaws.com/chapter_preliminaries/linear-algebra.html:

Matrices are useful data structures: they allow us to organize data that have different modalities of variation. For example, rows in our matrix might correspond to different houses (data examples), while columns might correspond to different attributes. This should sound familiar if you have ever used spreadsheet software or have read Section 2.2. Thus, although the default orientation of a single vector is a column vector, in a matrix that represents a tabular dataset, it is more conventional to treat each data example as a row vector in the matrix. And, as we will see in later chapters, this convention will enable common deep learning practices. For example, along the outermost axis of a tensor, we can access or enumerate minibatches of data examples, or just data examples if no minibatch exists.

phupe commented 4 years ago

Thanks. Sure, that's right.

You will see in the statement below that you have O (n, q) = X (n, d) * W (d, q) + b (1, q) and you cannot sum a (1, q) vector with a (n, q) matrix:

matrix-dimension

you should have O (n, q) = X (n, d) * W (d, q) + B (n, q) with each line of B being the vector b.

I think that O (n, q) = X (n, d) * W (d, q) + b (1, q) is just an abusive notation since in python both give the same results:

X = tf.ones(shape = (10, 3))

W = tf.ones(shape = (3, 4))

b = tf.Variable(tf.ones(4))

B = tf.ones(shape = (10, 4))

M = tf.matmul(X, W)

M + b == M + B
astonzhang commented 4 years ago

you have O (n, q) = X (n, d) * W (d, q) + b (1, q) and you cannot sum a (1, q) vector with a (n, q) matrix:

Broadcasting is applied here. See explanations that just follow (3.4.5) in https://d2l.ai/chapter_linear-networks/softmax-regression.html#softmax-operation:

"Triggering broadcasting during the summation ... "

phupe commented 4 years ago

Dear Aston, Great, thanks a lot for you reply.