chenghong-lin-nu commented 6 years ago

Intro to Recurrent Neural Networks

Intro to RNN(循环神经网络)

现在，隐藏层的总输入就是输入层和前面隐藏层h(t-1)的线性组合。

LSTMs(长短期记忆)

4个黄色框内显示的是网络层，每个层都有它们自己的权重。
以σ标记的层是sigmoid层。tanh是双曲正切函数，它也像sigmoid一样，会压缩输入值，但是最终不是在0-1之间，而是在-1到1之间。
后面的那个Gates有点没听明白。。。待补。。。

字符RNN

就是根据一个字符来生成后面可能的字符？？

序列分批

通过取一个序列，并将其拆分成较小的序列，我们可以通过矩阵计算来提高训练的效率。
RNN是对多个序列进行并行训练的。
batch size对应于我们所使用的序列数。（下图的batch size就是2）

chenghong-lin-nu commented 6 years ago

Understanding LSTM Networks

RNN可以解决传统神经网络对于文本的问题。它们是network，并且里面有loop，从而能够使得information可以persistent（持久）。
上面是神经网络的一部分，A接收一个输入Xt，然后产生一个输出ht。

A loop allows information to be passed from one step of the network to the next.
循环允许信息从网络的一个部分传到另一个部分。
也就是说循环可以让RNN分批次读取inputs，然后学习；然后再把当前学到的东西利用循环传递给下一个RNN；

RNN可以被认为是同一个网络的多次拷贝，每次都传递信息给它的后继者。
为什么RNN很成功呢？其中有个很重要的原因是因为使用了LSTMs。

The Problem of Long-Term Dependencies

RNN一个很有魅力的地方就是它能够连接先前的信息到现在的信息。（例如使用之前的影像信息能够使它更好地理解现在的）
但是有一个问题，gap between the relevant information and the point where it is needed to become very large.
当gap增大的时候，RNN不能学会怎样去连接信息。
但是LSTMs并没有这个问题(long-term dependencies)。

LSTM Networks

LSTM = Long Short Term Memory networks
它是RNN的一种，capable of learning long-term dependencies.

Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
所有的RNN都有一个chain of repeating modules of a neural network
在标准的RNN中，这个repeating modules有一个十分简单的结构，（例如单个的tanh layer）。
LSTM网络也有chain like structure，但是repeating module有不同的结构。它不是单个的神经网络层，而是有4个，并且互相交互。
了解一下我们将会使用到的注释。
每一条线包含了一个完整的vector；粉红色的圆圈代表了计算的操作；黄色的方框是学习的神经网络层；一条线分叉的话就说明它的内容被拷贝了，然后拷贝的东西走到不同的位置去。

The Core Idea Behind LSTMs

LSTM的核心是cell state (the horizontal line running through the top of the diagram)
cell state就像是一个传送带一样，它贯穿整个链条。
LSTM 有能力去增加/删除一些信息在cell state上，这是通过一个叫 Gates** 的结构来控制的。
Gates是一种让信息选择性通过的方式。它由sigmoid神经网络层和一个乘法操作组合而成。
sigmoid神经网络层的输出是在0-1的范围内，描述能够进入多少的components。（value=0代表什么都不允许进入，value=1代表所有的都可以进入）
一个LSTM有3个这样的Gates来保护和控制cell state

Step-by-Step LSTM Walk Through

终于有机会进行补完了！
一口气看完，感觉这篇文章讲的还是挺清晰的。
1.LSTM的第一步就是决定什么样的消息我们是要从cell state中扔掉的，因为你看下图，从第二个LSTM Cell开始每一次的cell state的输入都是从上一个得到的，所以肯定要决定要扔掉什么。
扔掉是由一个sigmoid layer叫“forget gate layer”。

It looks at ht−1 and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”

2.决定我们要在cell state中保存什么新的消息。

This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.
然后我们将会更新Ct-1(这个old cell state)，并把它更新成Ct。
我们将先前的old state乘以ft，forgetting the things we decided to forget earlier.
Then we add it∗C~t.
3.最后呢，我们要决定我们要输出什么。

First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

chenghong-lin-nu commented 6 years ago

Hyperparameters

超参数是我们在将学习算法应用于数据集之前需要进行设置的变量。
每个任务和数据集的最佳数字是不同的
超参数可以分成2类。
第一类是 Optimizer Hyperparameters
优化器超参数，它们是与优化和训练过程相关性更大的变量，而非模型本身。
这些包括学习率，mini-batch size，训练迭代或epoch次数。
第二类是 Model Hyperparameters
模型超参数，它们是与模型结构关系更密切的变量。
包括层数、隐藏单元的数量以及RNN等架构的模型设定超参数。

Learning Rate

learning rate是最重要的超参数了。
Good Starting Point = 0.01
learning rate是将权重推向正确方向的乘数。
以下是learning rate的普遍情况。
learning rate decay：就是说当learning rate太大的时候，然后会导致无法收敛，然后我们可以采用的方法就是降低学习率。

Exponential Decay

指数级的下降learning rate。

Adaptive Learning Rate(自适应学习率)

这个算法可以根据目前知道的问题以及目前为止所看到的数据调整学习率；在学习率太高时降低它，太低时升高它。

Minibatch Size

一般设置值如下所示，32一般是一个不错的选择。
虽然大的batch会使训练时间缩短，但是小的batch更不容易卡在局部最小值点，而大的batch会产生这种情况。
这是batch size和训练结果的图示。所以batch size太大也不好。

Number of Training Iterations / Epochs(训练次数)

直观的方式是让模型训练尽可能多的epoch或迭代次数，只要验证误差还在下降就不停下来。
可以用 Early Stopping 来决定何时停止训练模型。
ValidationMonitor
它的作用：not only monitor the progress of training, but to also stop the training when certain conditions are met.
感觉这个功能好棒，那不就可以知道训练多少次的时候自己的模型是最棒的了吗！
现在更新的版本有点弃用上面那个monitor，而使用 SessionRunHook

Number of Hidden Units / Layers

隐藏层的数量和架构是衡量模型学习能力的主要标准。
但是如果给的太多，那么又会发生过拟合，网络只会记住training set。
如果在发生过拟合，会出现你的网络的Training Accuracy 高于 Validation Accuracy.
将第一个hidden layer unit的数量设置为大于输入数量的一个数是非常好的。
layers呢，一般3层会比较好，再加好像没啥用，但是CNN除外（CNN是越深效果越好）。

RNN Hyperparameters

1. 选择一个cell type：LSTM, GRU(gated RNN unit), vanilla RNN Cell。
1. how deep the model is? (how many layers will we stack) 最好是2层。
GRU待补。。。。。

LSTM vs GRU

比较 LSTM 和 GRU 方面并非决定性的，可能取决于dataset和task。

chenghong-lin-nu commented 6 years ago

Embeddings and Word2vec

Embeddings Intro

Word Embeddings

这个星期，我们要介绍 Embeddings。它是一种深度神经网络的方法，可以用大量的类来更高效的展示数据。
它极大的提升了网络从数据中学习的能力。
word embedding能够从单词中学习到语义关系。
These word embeddings are learned using a model called Word2vec.
Embedding Layer其实就是hidden layer!!!

Implementing Word2Vec

Embeddings are just a shortcut for doing this matrix multiplication.
Embedding其实就是前面Andrew讲的矩阵乘法的便捷方式，就是如果其他都是0，只有一个是1，然后乘以weights矩阵的时候就不用真的去做乘法，而是可以就直接读取那一行就可以了。
embedding lookup 我们把weight matrix作为一个lookup table. 我们把所有的words都编码成为数字，然后要用的时候直接找对应words的序号就可以了。

for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th row of the embedding matrix.
embedding dimension：the number of hidden units

Word2Vec

Word2Vec is a 2 layer neural net trained on a big label text.
它是一个已经预先训练好的model，你也可以进行下载。
它接收word为input，然后产生一个vector作为它的output，one vector per word.
还有另一种方法是count based，一个popular algorithm for that是Glove（Global Vectors）
Glove比Word2Vec快一点。

chenghong-lin-nu commented 6 years ago

Q&A with FloydHub Founders

Start off as a generalist, try out a few things, and if you really like an area, become an specialist in that.

chenghong-lin-nu commented 6 years ago

Sentiment Prediction RNN

Sentiment RNN

我们要做的就是传入影评中的一些词，然后为这些影评加上标签。(正面/负面的)
网络的结构如下所示：
词汇会进入embedding layer，然后再进入LSTM Cells(就相当于是hidden layers)，对于输出层，我们只有一个单元，并通过sigmoid函数激活输出。
并且我们真正关心的只是最后的输出。

chenghong-lin-nu / blog

DLND-Week4 #10

Intro to Recurrent Neural Networks

Intro to RNN(循环神经网络)

LSTMs(长短期记忆)

字符RNN

序列分批

Understanding LSTM Networks

The Problem of Long-Term Dependencies

LSTM Networks

The Core Idea Behind LSTMs

Step-by-Step LSTM Walk Through

Hyperparameters

Learning Rate

Exponential Decay

Adaptive Learning Rate(自适应学习率)

Minibatch Size

Number of Training Iterations / Epochs(训练次数)

Number of Hidden Units / Layers

RNN Hyperparameters

LSTM vs GRU

Embeddings and Word2vec

Embeddings Intro

Word Embeddings

Implementing Word2Vec

Word2Vec

Q&A with FloydHub Founders

Sentiment Prediction RNN

Sentiment RNN