Closed phupe closed 4 years ago
The shape of W depends on whether the input is a column vector (x, one data example) or a matrix (X, a minibatch of examples). In matrix X, each row (not column) is a data example.
See http://en.d2l.ai.s3-website-us-west-2.amazonaws.com/chapter_preliminaries/linear-algebra.html:
Matrices are useful data structures: they allow us to organize data that have different modalities of variation. For example, rows in our matrix might correspond to different houses (data examples), while columns might correspond to different attributes. This should sound familiar if you have ever used spreadsheet software or have read Section 2.2. Thus, although the default orientation of a single vector is a column vector, in a matrix that represents a tabular dataset, it is more conventional to treat each data example as a row vector in the matrix. And, as we will see in later chapters, this convention will enable common deep learning practices. For example, along the outermost axis of a tensor, we can access or enumerate minibatches of data examples, or just data examples if no minibatch exists.
Thanks. Sure, that's right.
You will see in the statement below that you have O (n, q) = X (n, d) * W (d, q) + b (1, q) and you cannot sum a (1, q) vector with a (n, q) matrix:
you should have O (n, q) = X (n, d) * W (d, q) + B (n, q) with each line of B being the vector b.
I think that O (n, q) = X (n, d) * W (d, q) + b (1, q) is just an abusive notation since in python both give the same results:
X = tf.ones(shape = (10, 3))
W = tf.ones(shape = (3, 4))
b = tf.Variable(tf.ones(4))
B = tf.ones(shape = (10, 4))
M = tf.matmul(X, W)
M + b == M + B
you have O (n, q) = X (n, d) * W (d, q) + b (1, q) and you cannot sum a (1, q) vector with a (n, q) matrix:
Broadcasting is applied here. See explanations that just follow (3.4.5) in https://d2l.ai/chapter_linear-networks/softmax-regression.html#softmax-operation:
"Triggering broadcasting during the summation ... "
Dear Aston, Great, thanks a lot for you reply.
Hi,
It seems that there some inconsistencies in some matrice dimensions.
In the section 3.4.2. Network Architecture, it is written:
$\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$
From the notations introduced in section 3.4.4. Vectorization for Minibatches, we have:
The size of the matrix
\mathbf{W}
in the section 3.4.2. Network Architecture is (q,d) but it is written in section 3.4.4. Vectorization for Minibatches that;$\mathbf{W} \in \mathbb{R}^{d \times q}$
It means that for the same notation
\mathbf{W}
, the matrices are not the same in both sections 3.4.2 and 3.4.4 that is confusing.Maybe, it could be written in the section 3.4.2. Network Architecture, something like this:
$\mathbf{o} = \mathbf{W}^T \mathbf{x} + \mathbf{b}$
and $$ \begin{aligned} o_1 &= x1 w{11} + x2 w{21} + x3 w{31} + x4 w{41} + b_1,\ o_2 &= x1 w{12} + x2 w{22} + x3 w{32} + x4 w{42} + b_2,\ o_3 &= x1 w{13} + x2 w{23} + x3 w{33} + x4 w{43} + b_3. \end{aligned} $$
There is also another issue in section 3.4.4. Vectorization for Minibatches:
\mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}
\mathbf{O}
is (n, q) while $\mathbf{b} \in \mathbb{R}^{1\times q}$similar issue is also present in section 4.1.1.3. From Linear to Nonlinear.