Langzzx / Deep-Learning---course-Note-Ex

based <Deep Learning Nanodegree Foundation - udacity>
0 stars 0 forks source link

Deep recommend links: #1

Open Langzzx opened 7 years ago

Langzzx commented 7 years ago

  1. cheat sheet about DL/ML architectures

  2. http://deeplearninggallery.com/ - Deep Learning Gallery - a curated list of awesome deep learning projects

Langzzx commented 7 years ago

Word2Vec/ Embedding:

  1. http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/#embeddings
  2. http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ study blog: http://www.thushv.com/
Langzzx commented 7 years ago

RNN relates:

Important Links for this Project: Basic http://sebastianruder.com/word-embeddings-1/ http://monik.in/a-noobs-guide-to-implementing-rnn-lstm-using-tensorflow/ http://suriyadeepan.github.io/2016-12-31-practical-seq2seq/ http://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html Understanding Truncation in RNN https://indico.io/blog/sequence-modeling-neuralnets-part1/ Intermidiate https://chunml.github.io/ChunML.github.io/project/Sequence-To-Sequence/ https://indico.io/blog/sequence-modeling-neural-networks-part2-attention-models/ Highly Involved

A good course: Deep Learning for Natural Language Processing! Further reading material(After you understood all that is happening in this project.)

Use of attention for better translation: (http://stanford.edu/~lmthang/data/papers/emnlp15_attn.pdf) If we want to implement more complex mechanism: Now, what if we want to implement more complex mechanic like when we want decoder to receive previously generated tokens as input at every timestamp (instead of lagged target sequence)? Or when we want to implement soft attention, where at every timestep we add additional fixed-len representation, derived from query produced by previous step's hidden state? tf.nn.raw_rnn is a way to solve this problem. http://selfdrivingcars.mit.edu/ https://in.udacity.com/course/self-driving-car-engineer-nanodegree--nd013/ I am including this section to cover Topic in good depth. (I hope you will appreciate it, if not I will remove it after getting some feedback from you all) FAQ (Frequently asked questions)

What the difference between an LSTM memory cell and an LSTM layer? Answer: Link!

What a tf.nn.dynamic_rnn requires? Answer: Remember that standard tf.nn.dynamic_rnn requires all inputs (t, ..., t+n) be passed in advance as a single tensor. "Dynamic" part of its name refers to the fact that n can change from batch to batch.

Difference between RNN and LSTM. Why to prefer LSTM? Answer: All RNNs have feedback loops in the recurrent layer. This lets them maintain information in 'memory' over time. But, it can be difficult to train standard RNNs to solve problems that require learning long-term temporal dependencies. This is because the gradient of the loss function decays exponentially with time (called the vanishing gradient problem). LSTM networks are a type of RNN that uses special units in addition to standard units. LSTM units include a 'memory cell' that can maintain information in memory for long periods of time. A set of gates is used to control when information enters the memory, when it's output, and when it's forgotten. This architecture lets them learn longer-term dependencies. (Reference Link!)

READING MATERIAL

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Why do we need Clipping of gradients? Answer: As we know that LSTM solves our problem by learning long term dependencies according to activation function used can also create another problem. It is easy to imagine that, depending on our activation functions and network parameters, we could get exploding gradients instead of vanishing gradients if the values of the Jacobian matrix are large. Indeed, that’s called the exploding gradient problem. The reason that vanishing gradients have received more attention than exploding gradients is two-fold. For one, exploding gradients are obvious. Your gradients will become NaN (not a number) and your program will crash. Secondly, clipping the gradients at a pre-defined threshold (as discussed in this paper!) is a very simple and effective solution to exploding gradients.

READING MATERIAL

https://cs224d.stanford.edu/lecture_notes/LectureNotes4.pdf https://arxiv.org/pdf/1211.5063v2.pdf Why do we have to do the mapping anyway? Answer: Because it’s better to input numeric training data into the Networks (as well as other learning algorithms). And we also need a different dictionary to convert the numbers back to the original characters. That’s why we created the two dictionaries in previous project.

What is Word Embedding? Answer: Word Embedding is a technique for learning dense representation of words in a low dimensional vector space. Each word can be seen as a point in this space, represented by a fixed length vector. Semantic relations between words are captured by this technique. The word vectors have some interesting properties. Word Embedding is typically done in the first layer of the network : Embedding layer, that maps a word (index to word in vocabulary) from vocabulary to a dense vector of given size. In the seq2seq model, the weights of the embedding layer are jointly trained with the other parameters of the model. Follow this tutorial! by Sebastian Ruder to learn about different models used for word embedding and its importance in NLP.

Have you ever thought how google translator works?(please read following details) Answer

Screen Shot 2017-04-17 at 12.57.02 AM.png

Lets first talk about GNMT model. The model architecture of GNMT, Google’s Neural Machine Translation system. On the left is the encoder network, on the right is the decoder network, in the middle is the attention module. The bottom encoder layer is bi-directional: the pink nodes gather information from left to right while the green nodes gather information from right to left. The other layers of the encoder are uni-directional. Residual connections start from the layer third from the bottom in the encoder and decoder. The model is partitioned into multiple GPUs to speed up training. In our setup, we have 8 encoder LSTM layers (1 bi-directional layer and 7 uni-directional layers), and 8 decoder layers. Reference

But now google done a very smart thing. Google addressed this challenge by extending their previous GNMT system, allowing for a single system to translate between multiple languages. Their proposed architecture requires no change in the base GNMT system, but instead uses an additional “token” at the beginning of the input sentence to specify the required target language to translate to. In addition to improving translation quality, their method also enables “Zero-Shot Translation” — translation between language pairs never seen explicitly by the system.

image01.gif

Here’s how it works. Let’s say we train a multilingual system with Japanese⇄English and Korean⇄English examples, shown by the solid blue lines in the animation. Our multilingual system, with the same size as a single GNMT system, shares its parameters to translate between these four different language pairs. This sharing enables the system to transfer the “translation knowledge” from one language pair to the others. This transfer learning and the need to translate between multiple languages forces the system to better use its modeling power.

Now, Can we translate between a language pair which the system has never seen before? An example of this would be translations between Korean and Japanese where Korean⇄Japanese examples were not shown to the system. Impressively, the answer is yes — it can generate reasonable Korean⇄Japanese translations, even though it has never been taught to do so. We call this “zero-shot” translation, shown by the yellow dotted lines in the animation. To the best of our knowledge, this is the first time this type of transfer learning has worked in Machine Translation. Reference

Screen Shot 2017-04-17 at 1.11.20 AM.png Now look at the change. The model architecture of the Multilingual GNMT system. In addition to what is described in GMNT, our input has an artificial token to indicate the required target language. In this example, the token “<2es>” indicates that the target sentence is in Spanish, and the source sentence is reversed as a processing step.

Langzzx commented 7 years ago

GAN:

  1. 生成式模型 & 生成对抗网络——资料梳理(专访资料 + 论文分类)-CSDN
  1. AdversarialNetsPapers,

    罗列相关论文及源代码。

  2. Goodfellow paper ref code

  3. Image Completion with Deep Learning in TensorFlow

    build the GAN use tf.

  4. How to Train a GAN? Tips and tricks - github

    starter from "How to Train a GAN?" at NIPS2016

  5. http://blog.evjang.com/2016/06/generative-adversarial-nets-in.html

  6. Collection of generative models

    e.g. GAN, VAE in Pytorch and Tensorflow

  7. wGAN, 非常好的解释w

Langzzx commented 7 years ago

Autoencode

  1. 深度学习入门教程UFLDL学习实验笔记一:稀疏自编码器
Langzzx commented 7 years ago

Basic knowledge

  1. Distill

    Distill 是一个很好动态展示基础的地方,但是相对资源偏少

Langzzx commented 7 years ago

Reinforcement Learning

  1. Python code for Reinforcement Learning: An Introduction

  2. Deep Learning Research Review Week 2: Reinforcement Learning

    summarizing and explaining research papers in specific subfields of deep learning, like AlphaGo

Langzzx commented 7 years ago

CNN

  1. Convolutional Neural Networks (CNNs / ConvNets)

    Intro architectures

Langzzx commented 7 years ago

GAN project review ref link:

  1. momentum term to stabilize training. Greater momentum can results in the training oscillation and instability (https://arxiv.org/pdf/1511.06434.pdf).

  2. To improve the results, you can try to actually call the generator optimization step twice:

sess.run(d_opt,...) sess.run(g_opt,...) sess.run(g_opt,...)

  1. great explanation why should we tell the graph to update the ops: http://ruishu.io/2016/12/27/batchnorm/

    1. use dropout with 50 - 80% keep_prob after batch normalization (both in discriminator and generator).
  2. improve results in GAN:

https://github.com/soumith/ganhacks#how-to-train-a-gan-tips-and-tricks-to-make-gans-work http://blog.otoro.net/2016/04/01/generating-large-images-from-latent-vectors/

Langzzx commented 7 years ago

hyperparameters

these are some great resources on the topic:

  1. Practical recommendations for gradient-based training of deep architectures by Yoshua Bengio
  2. Deep Learning book - chapter 11.4: Selecting Hyperparameters by Ian Goodfellow, Yoshua Bengio, Aaron Courville
  3. Neural Networks and Deep Learning book - Chapter 3: How to choose a neural network's hyper-parameters? by Michael Nielsen
  4. Efficient BackProp (pdf) by Yann LeCun

More specialized sources:

  1. How to Generate a Good Word Embedding? by Siwei Lai, Kang Liu, Liheng Xu, Jun Zhao
  2. Systematic evaluation of CNN advances on the ImageNet by Dmytro Mishkin, Nikolay Sergievskiy, Jiri Matas
  3. Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin Johnson, Li Fei-Fei

mini-batch: Systematic evaluation of CNN advances on the ImageNet

Langzzx commented 7 years ago

ValidationMonitor (Deprecated)

  1. In tensorflow, we can use a ValidationMonitor with tf.contrib.learn

Langzzx commented 7 years ago

Reinforcement learning

UCL Course on RL by David Silver (Alpha Go)

Langzzx commented 6 years ago

Understanding the backward pass through Batch Normalization Layer