JRC1995 / Abstractive-Summarization

Implementation of abstractive summarization using LSTM in the encoder-decoder architecture with local attention.
MIT License
167 stars 59 forks source link

Word2vec function and glove #4

Closed hack1234567 closed 6 years ago

hack1234567 commented 6 years ago

the word2vec function is used to find out vector representation of a word using glove right? so what happens if the particular word is not found in glove? so is the nearest neighbor used to find the nearest vector? Is glove only used during preprocessing? during training are we substituting the actual word with closest match? Thank you

JRC1995 commented 6 years ago

Yes, I used the word2vec function to find out vector representation of a word using glove. (But remember there is actually a word2vec algorithm which is different than glove, though the function is same. I didn't use the word2vec algorithm here, I simply named the function word2vec because it transforms word to vector.).

If a word is not found, the vector representation of 'unk' is returned in this code. Normally, the vector representation of a special word '' is used to represent a word that is not in glove. here means unknown or unavailable. But, there are some more alternate methods to circumnavigate this issue.

I used the nearest neighbor function in the vec2word function which I used to find word representation of a vector. That is, if a vector is not available in glove, the word representation of the closest matching vector (found from the nearest neighbor function) is used. Doing this (finding the nearest neighbor) may not be exactly necessary in this specific code, since we will only be dealing with vectors that are in Glove, but I did it anyway.

Is glove only used during preprocessing

It depends. You can train the glove model to learn word vector representations or use pre-trained data to get an embedding matrix (which contains the word and its vector representative). You can use the embedding matrix as the first layer of the main network - in this case, the embedding layer (which converts word to vector) can be conceived as a part of the main network as opposed to a pre-processing. In case you are writing a program about glove, then, of course, the glove algorithm will be the main part of the program not merely pre-processing. But most of the time, glove\word2vec are used early on in the program.

hack1234567 commented 6 years ago

So during training vec_summaries is the main data right? and vec_head is used only for loss finding?

JRC1995 commented 6 years ago

vec_texts and vec_summaries both contain the main data.

vec_texts contain the input (the texts that are to be summarized).

vec_summaries contain the target outputs (the corresponding summaries of the texts from vec_texts).

You can, however, say that vec_summaries is only for loss finding - because we can only find loss by comparing the predicted output to the target outputs (from vec_summaries) - but the direction and magnitude of this calculated loss are critical to the learning process. Both vec_texts and vec_summaries come from the Amazon dataset.

Can't find vec_head. I don't remember creating it. Can you provide some clues about where it is, in case I have forgotten?

hack1234567 commented 6 years ago

It was vec_texts not heads, my bad, I apologize.

hack1234567 commented 6 years ago

`for summary in summaries:

vec_summary = []

for word in summary:
    vec_summary.append(word2vec(word))

vec_summary.append(word2vec('eos'))

vec_summary = np.asarray(vec_summary)
vec_summary = vec_summary.astype(np.float32)

vec_summaries.append(vec_summary)

...is it code just taking first line so as to store the target output?

JRC1995 commented 6 years ago

Summaries contain a list of summaries.

for summary in summaries:

This will allow you to iterate through each and every summary (not just the first) that is present in summaries. It's similar to using a for loop for going through the elements of an array.

'summary' as written in the code is basically the variable name used to represent the element of summaries at a particular iteration step. So now in the next step, using 'summary' I can refer to the elements within summaries (the elements are the summaries/target outputs).

vec_summary = []

For each summary in summaries, an empty vec_summary list is initialized.

for word in summary: vec_summary.append(word2vec(word))

each summary can again be considered as a list of word.

"for word in summary:" iterates through each and every words of a particular summary. Each word is converted to its vector form which is attached to vec_summary. At the end of this iteration, vec_summary will contain the vectorized form of a summary.

vec_summary.append(word2vec('eos'))

The vectorized form of 'eos' denotes end of sentence. eos is appended at the end of each summary.

vec_summary = np.asarray(vec_summary) vec_summary = vec_summary.astype(np.float32) vec_summaries.append(vec_summary)

The newly formed vectorized summary vec_summary is converted to float32 ndarray format and added to the list of vectorized summary called vec_summaries.

Thanks to 'for summary in summaries:' this whole process will happen for each summary, so that in end we can have a list of all the summaries in its vectorized form within vec_summaries.

hack1234567 commented 6 years ago

Great, thank you. Understood the preprocessing properly. But During the summarization, how is the data fed to encoder? word by word? but the window is created for p,p-d,p+d so that has to take into account more than one word. say " Karnataka is one of India's states and has huge population" in this, if the window size is 3 , then the 3 words "Karnataka is one" is fed into the encoder. so how is it fed.. is it like - first "karnataka", then it waits for "is", then it waits for "one" then the attention tells which is the context. But how is it summarised ...just taking the probability distribution of the words ie 10 words with each other won't amount to much. So how is this input word manipulated?

JRC1995 commented 6 years ago

The window size is related to the local attention (a type of attention mechanism). It is something that is relevant when the attention mechanism works (it works in the decoder part). The window size doesn't have much to do with what is fed to the encoder. The only relevant factor is that if the input text's no. of words is less than the window size, there will be some problems in the local attention part.

Here's more details about local attention: https://arxiv.org/abs/1508.04025

A whole text is fed into the encoder. But the encoder processes the text word by word. So in that sense, you can say that ultimately the encoded is fed word by word. The important thing is the encoder is processing the input text word by word; more precisely, word vector by word vector.

When the text is "Karnataka is one of India's states and has huge population", the encoder will start processing from Karnata and end at population. Since I used bi-directional encoder, there is another encoder that starts from population and end processing at Karnataka. The final result is the combined result of the two encoders processing from opposite directions.

If you peek inside the encoder functions, there is a tensorflow while loop. You can use that to loop through each words in a given text for encoding.

The text can be fed during training using Tensorflow placeholders.

hack1234567 commented 6 years ago

so when this bi-directional rnn operates, the local attention mechanism is used to find context right? But how does this context help during decoding phase(i presume encoders doesn't utllize rra)? what mechanism did u use to find the relationship between this context vector and the previous words? also did you use pos tagging, I was not able to find it when I went through the code. again, thanks a lot for explaining.

JRC1995 commented 6 years ago

which attention mechanism you use doesn't have much to do with if you are using bi-directional rnn or not. You can use local attention while using a single rnn for encoding, or may just use global attention instead.

In my code encoders did utilize RRA, but I wasn't talking about RRA. RRA includes the addition of a weighted addition previous k hidden states to the RNN equation. In name, it's a form attention but it's not the typical kind of attention mechanism. I did implement RRA following a paper early on. But the paper isn't that popular, not very much cited either. The quality of the paper is of suspect. In my personal opinion, LSTMN sounds conceptually more rich than RRA. I am not sure. I may suggest ignoring RRA.

also did you use pos tagging, I was not able to find it when I went through the code.

I didn't use pos tagging in this code.

I have used it in some other codes, for example: https://github.com/JRC1995/TextRank-Keyword-Extraction https://github.com/JRC1995/auto-tldr-TextRank

so when this bi-directional rnn operates, the local attention mechanism is used to find context right? But how does this context help during decoding phase(i presume encoders doesn't utllize rra)? what mechanism did u use to find the relationship between this context vector and the previous words?

As you may know, RNN uses the context of the previous words in encoding the current word. The initial vector representation of a word fed to the encoder doesn't contain the context of the text to which it belongs. The encoder is supposed to help to encode some of that context information in each word.

In some ways, the final word should contain the essential context information from all the previous words. plus the word vector of the final word. SO the final encoded word (or the final hidden state of the rrnn) can be used to represent the whole input text. This context-encoded word vector can be fed to the decoder which brings out the desired output. This is the encoder-decoder model. This was how it used to be done.

As you may imagine, this method is very limited. Using one vector (the final hidden state) to represent the whole text can be a bit too much. It may still work for representing small texts, but larger texts would have no hope.

Attention mechanism solves this problem. Attention mechanism allows the use of all the encoded versions of input word vectors, not just the final one. You can say that attention mechanism connects the encoder and decoder.

I will try to explain it with an example. Let's assume we have a very well trained network that uses the attention mechanism.

Now, say this is the input text:

"this is a very healthy dog food. good for their digestion. also good for small puppies. my dog eats her required amount at every feeding."

The program will first encode the line taking context into account. Like while encoding 'healthy', the context 'this is a very' will be to some extent accounted.

Now during decoding, the program may start with starter word like or GO which serve initial context before anything is predicted. But to make a prediction in this case, a summary the program needs to refer to the encoded version of the text ("this is a very healthy dog food. good for their digestion. also good for small puppies. my dog eats her required amount at every feeding.").

Now, all words of the source sentence aren't equally important. For example prepositions, and such aren't that important to note while summarizing - as in they don't always say as much about what you should summarize. Also while summarizing a long post you may attend first only the first couple of lines of the text which are summarizing. The attention mechanism is based on a similar principle.

In addition to source text (that which you are summarizing), you also have to refer to what you have summarized so far. You have to check if you are following grammatical rules and stuff. For that you have to check if the current word fits with the rest of the summary.

Now during decoding the text to produce the summary, at first, there is no context from the rest of the summary (because no word for the summary is yet predicted). All the program will have is some vector representing an initial context (start of sentence). This still can be an useful context, because given the context of the program will assign more likelihood to words that are used at the beginning of the sentence of the example summaries which were used for training.

Anyway, in addition to , the program will refer to the encoded version of the text. The attention mechanism BASED on a formula which takes into account the current context of summary, will assign 'weights' to every word in the encoded text. For example, the network may provide higher weights to encoded versions of word vectors of words like 'very', ' healthy', 'dog', 'food', 'puppies', 'good' etc. and low weights to 'also', 'this', 'is'. etc.

The weighted encoded word vectors are then added to create the 'context vector' which contains the context of the text.

Local attention instead of assigning weights to EVERY word of the encoded text, it only assigns attention weights to some weights within a window. The window position is chosen based on a formula which depends on the context of the state of the summary so far.

Both the context vector (created as a result of attention mechanism) and the context of the previous words of the summary is then used for predicting the next word for the summary.

For example, the word 'very' may have higher weights while creating the context vector i.e 'very' may be 'attended' more. So the context vector may contain a vector that is closer to 'very'. From the context of , and the context of the encoded vector the program may then predict 'very' to be a likely first word for the summary.

Similarly, next from the context of the 'very' the program may attend more highly (assign higher attention weight) to words like 'healthy', and such in the encoded word. In addition, the program may find that 'healthy' is a good candidate to follow after 'very'....and so it may predict 'healthy' as a very likely candidate. And so on. Finally, it may predict "Very healthy dog food. Helps digestion" or something. like that.

hack1234567 commented 6 years ago

now i get it, thank you. Also i have noticed you are dumping all the distict vocab in vocab_limit, why is that? we are not using it for training as the train set is vec_texts and vec_summaries. vocab_limit contains "eos", u mentioned for probability distribution but isn't probability calculated with the first decoder hidden state with other encodings hidden state? how does vocab_limit get used? thank you

JRC1995 commented 6 years ago

vocab_limit contain all the distinct words from the text and summary.

It is mainly used for predicting the probability distribution.

The decoder first predicts a probability distribution over all the words in the vocab_limit (each word each assigned a probability) and the word with highest predicted is chosen in this code at each decoding step.

If the predicted summary is "Healthy dog food", the decoder will first result in a probability distribution over all words in vocab_limit with healthy having highest probability. In the next timestep, the decoder will assign highest probability to dog and so on. I followed a greedy policy here of choosing the most probable word at each step. But note, that's not the best method. Using beam search would be better, but it's a bit more complicated.

The softmax layer results in the probability distribution.

"eos" doesn't have much to do with probability distribution. "eos" marks the end of statement. If the decoder output "eos" then one will know this is where decoding is to be stopped. Otherwise, there will no know when to stop decoding\predicting words for he summary.

In target examples "eos" is necessary for the program to learn when to predict the eos.

EOS and PADDING has further uses when used with mini-batching.

This particular code doesn't make any actual use of EOS. It seems that this code calculates the target outpute length, and use that info to determine the size of the predicted outpurt. It's not ideal. This program can then not even produce any result without a target output. I removed eos, because I was having problems - the machine was learning to spam eos because all target examples contained eos - the machine thought that eos is an important word or something. It was for testing purposes. This code instead of deciding for itself where the ending (eos) should be will rely for the user input on the desired output size.

I have used eos, padding, mini-batching in my later codes such as: https://github.com/JRC1995/Machine-Translation-Transformers