JRC1995 / Abstractive-Summarization

Implementation of abstractive summarization using LSTM in the encoder-decoder architecture with local attention.

MIT License

167 stars 59 forks source link

vec_summaries_reduced and vec_texts_reduced returning blank list [] #5

Closed hack1234567 closed 6 years ago

hack1234567 commented 6 years ago

when i am printing the imported summaries from the pickled file, its alright. It's fully printed in its encoded form. Vec_texts also prints fine. But vec_summaries_reduced and vec_texts_reduced returns blank list. therefore the train_len is printed as 0. also Percentage of the dataset with text length less that window size: 99.848 isn't that bad since 99% is less than window size so it is getting reduced?. when d=1 window size 3, then it is 44%, yet no data outputted from vec_summaries reduced ...please tell me what to do? p.s-> don't know if it has any significance but I had not properly implemented the clean function from your code(it gave some errors). So I only just lowercased everything, nothing else.

update: when i set maxlen_summary as 80 instead of 7, trainlen = 14400. But is setting it as 80 wrong?

JRC1995 commented 6 years ago

Which dataset are you using? Are you using my code?

But vec_summaries_reduced and vec_texts_reduced returns blank list.

Show me the code

isn't that bad since 99% is less than window size so it is getting reduced?.

Yes, that's bad.

It may be possible that you have accidentally switched summaries and text contents. Most summaries are probably less than window size unless d=1 or 2. But window size should be relevant only for vec_texts. So you may check if by chance the content of your vec_text is actually the summaries. But, I don't know. That's just one possibility.

update: when i set maxlen_summary as 80 instead of 7, trainlen = 14400. But is setting it as 80 wrong?

If I remember correctly, I used the maxlen_summary as a limiter to summary length. So only if summary length < maxlen_summary, I took the summary into consideration, and ignored the summary otherwise.

So again if your summary_vec by mistake actually contains what should be vec_texts then almost no data will satisfy that condition since most (probably all) texts are greater than length 7. Which may be why your vec_summaries_reduced were blank. Since no data satisfied that condition.

hack1234567 commented 6 years ago

Yes I messed up text and summary. Thank you. Also what classifier did u use if any?

JRC1995 commented 6 years ago

I am not sure what you mean as a 'classifier' here. In a sense, the whole network is a classifier, and you can think of all the words in vocab_limited as the classes. The immediate decoder output is a probability distribution across all the classes (word). Then using greedy search, the most highly scored class/word is selected at each timestep in the output.

hack1234567 commented 6 years ago

I meant classifier after reading about handwriting recognition. In that case, we have 10 classes, 0....9. But I presume from what you had said earlier that the total words might be the classes. vocab_limit contains all the words in glove which is in texts(list) . So each time in the decoder what happens? is it like we are taking probability distribution of the current word with respect to all words in vocab_limit? But we had already used concatenation in encoding phase to get context between the words in the window? Sorry, still not able to understand as to what information from encoder is used in decoder?

JRC1995 commented 6 years ago

Usually the network\model itself is the 'classifier' if it's used in a classification problem. For example. If you use a Neural Network for handwriting recognition, then in that context the neural network is the classifier.

In this case, I am not sure that 'summarization' even is a classification problem - so calling the model used here as a 'classifier' would be inappropriate.

A typical classification constitutes determining the 'type' or 'class' or something of the input. For example. 0,1,2.......9,A....Z can be thought of as classes. And handwriting recognition can then be thought of choosing in which class a particular handwritten alphanumeric character belongs to.

Sentiment analysis is, for example, a text classification problem. A sentiment analysis program may classify words in a text as 'angry', 'happy', 'sad', 'neutral' or something like that.

Usually classifiction is about predicting something about the input.

In summarization, we aren't exactly telling the characteristics of the input in predicting, but simply creating a summarized version of it.

So, it's probably not a classification problem in the first place. However, you can draw some parallels to the multiclassification process of neural networks if you consider each word in vocabulary as classes (as detailed in my previous post). But overall the words in vocabulary doesn't really have the properties of a 'class', and the whole thing (the summarization) doesn't really seem to be a classification problem. So the right answer should be there is no classifier here.

JRC1995 commented 6 years ago

So each time in the decoder what happens? is it like we are taking probability distribution of the current word with respect to all words in vocab_limit?

probability distribution over all words in vocab_limit means: Each word in vocab_limit will be assigned a score (the score denotes the probability for the word to be the next word) in such a way that summation of all scores of all the words amounts to 1.

So if there are three words in vocab_limit: Hungry, food and dog. A probability distribution may look like this: Hungry: 0.8 food: 0.1 dog: 0.1

In that case, Hungry will be considered as most probable and chosen to be the predicted next word.

But we had already used concatenation in encoding phase to get context between the words in the window?

Window should have no relevance in encoding phase. I am not which part are you talking about regarding concatenation. If it's this:

encoded_hidden = tf.concat([hidden_forward,hidden_backward],1)

Then it is simply a method of combing the output of forward and backward encoder.

I used bi-directional encoder.

The forward and backward encoder are part of the bi-directional encoder.

Forward encoder encodes each word in context of the previous words. It starts from the first word and ends at the last word.

Backward encoder encodes each word in context of the later words. It starts from the last word and ends at the first word.

The final result of the bi-directional encoder is the combination of the output of forward and backward encoder.

Here concatenation is used for combining the results.

The result of the concatenation is the final encoder output.

Sorry, still not able to understand as to what information from encoder is used in decoder?

The final encoder output.

LOCAL ATTENTION

    G,pt = align(encoded_hidden,decoded_hidden,Wp,Vp,Wa,tf_seq_len)
    local_encoded_hidden = encoded_hidden[pt-D:pt+D+1]
    weighted_encoded_hidden = tf.multiply(local_encoded_hidden,G)
    context_vector = tf.reduce_sum(weighted_encoded_hidden,0)
    context_vector = tf.reshape(context_vector,[1,2*hidden_size])

    attended_hidden = tf.tanh(tf.matmul(tf.concat([context_vector,decoded_hidden],1),Wc))

encoder_hidden is the final encoder output. The window starts from pt-D and ends at pt+D. The content of the window can be accessed with encoded_hidden[pt-D:pt+D+1]. The window is simply some consecutive words from the encoder output. local_encoded_hidden contains the window content. The attended_hidden is calculated indirectly from local_encoded_hidden.

Note: decoded_hidden provides the context of previous decoded words. So the context of previous decoded words (ecoded_hidden) and the context of the final encoder output (encoded_hidden) were used to calculate G, and pt. G, and pt were in turn used to predict the next word after going through some equations.

The next decoder output word is predicted using:

y = tf.matmul(attended_hidden,Ws)

After that the decoder updates the decoder context using RNN (More specifically LSTM). If a new word is predicted the context of previous decoder words is to be updated for predicting the next word.

decoded_hidden_next,cell_d = decoder(y,decoded_hidden,cell_d, wf_d,uf_d,bf_d, wi_d,ui_d,bf_d, wo_d,uo_d,bf_d, wc_d,uc_d,bc_d, RRA)

hack1234567 commented 6 years ago

thanks a lot. are you creating test data...like say 20% text data which have not been trained upon used for testing.

JRC1995 commented 6 years ago

Yes, I created test data.

train_len = int((.7)*len(vec_summaries_reduced))

train_texts = vec_texts_reduced[0:train_len] train_summaries = vec_summaries_reduced[0:train_len]

val_len = int((.15)*len(vec_summaries_reduced))

val_texts = vec_texts_reduced[train_len:train_len+val_len] val_summaries = vec_summaries_reduced[train_len:train_len+val_len]

test_texts = vec_texts_reduced[train_len+val_len:len(vec_summaries_reduced)] test_summaries = vec_summaries_reduced[train_len+val_len:len(vec_summaries_reduced)]

70% (.7) is training data. 15% (.15) is validation data. Rest (15%) is test data.

hack1234567 commented 6 years ago

is the value of D found randomnly? d=10...wont it be too long? also based on loss value , are you adjusting weights to reduce the loss next time?

JRC1995 commented 6 years ago

D is a hyperparameter. D determines the size of the window. It's not completely random. For example, if you randomly put D = 100, window size (2D+1) will be 201, yet most texts aren't that long. So there will be almost no training data left which is greater than the window size. So you can't just randomly put anything. You have to use your discernment. But, there isn't some specific algorithm or rule of thumb to determine the value of D either.

The research paper which introduced local attention (https://arxiv.org/pdf/1508.04025.pdf) used D=10 in some experiments. So, I used D=10 too.

also based on loss value , are you adjusting weights to reduce the loss next time?

Yes. But it is mostly done using Tensorflow functions. So you can't actually see what's happening inside.

These are the codes that specify how the cost is calculated and which optimizer is used during backpropagation.

OPTIMIZER

cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output, labels=tf_summary)) optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

During training, this code performs backpropagation to update weights, calculates loss, and make prediction.

  # Run optimization operation (backpropagation)
        _,loss,pred = sess.run([optimizer,cost,prediction],feed_dict={tf_text: train_texts[i], 
                                                tf_seq_len: len(train_texts[i]), 
                                                tf_summary: train_out,
                                                tf_output_len: len(train_out)})

"sess.run([optimizer,....]" this part is mainly responsible for backpropagation. The program already knows HOW to adjust the loss, as its defined previously to use Adam Ompitizer. Other processes like differentiation and everything is automated in tensorflow. So all you need to do is sess.run(optimizer) to run backpropagation.

hack1234567 commented 6 years ago

so if d=10, then if we are not filtering non alphabetic characters , they might also be ending up in the window. normal text length is 30 words. so how many layers for that? how would the decoder consider d=10, ie 20 words at a time? we are not inputing it as bag of words rit, but as sequence one word at a time.

JRC1995 commented 6 years ago

so if d=10, then if we are not filtering non alphabetic characters , they might also be ending up in the window.

Yes.

we are not inputing it as bag of words rit, but as sequence one word at a time.

Not exactly. The initial input is a bag of words. The encoder PROCESSES one word at a time in a particular sequence.

Decoding starts when encoding is finished.

The attention mechanism can process MULTIPLE WORDS in the ENCODED TEXT at the SAME TIME.

In global attention, for example, ALL WORDS in the encoded text is processed at the same time. In local attention, only the words in windows are considered at a time in a specific time step.

How does it happen?

First of all the words, using some formulas and linear algebras all the words in the window are 'rated' or 'scored'.

The score is like a probability distribution, in the sense that they end up as 1 upon summation.

For example, if the encoded text represents the sentence: "I am going home"

The program may represent the encoded representations of the words as:

I: 0.01 am: 0.02 going: 0.30 home: 0.67

Now a 'context vector' is created by adding the 'weighted' version of the encoded words.

That is,

context vector = 0.01(I) + 0.02(am) + 0.30(going) + 0.67(home)

As you know the encoded representations of the words, I, am, home are word vectors i.e they contains numbers and mathematical operations can be done on them.

The ratings are considered to signify how much 'attention' is given to a word. For example here 'home' is highest rated. So the result of the summation will be more closer to the encoded word vector of 'home' than any other words. That is, it can be said the more highly attended word has more influence on the result, as it should be.

Henceforth, the context vector (which is a single vector) is used for further calculations. So this is how multiple words (encoded word vectors) are used in each timestep of decoding.

Decoder predicts one word at a time, but at each time step it considers the context of multiple encoded words.

You can use your intuition to understand why so. For example if you as a human are trying to summarize something, to determine what your next word in the summary shall be, you usually will need to check a 'portion' of the original text rather than just one or two words in it.

Attention mechanism is created based on similar intuitions.

hack1234567 commented 6 years ago

again a bit confused, isnt attention process with all the window manipulation taking place during decoding stage...where the tf.concantenate of the forwad ad backward encoders are considered during this time, and window is defined and he context is taken from tht, the encoding phase doesnt do much other than going through from sos to eos forward and backward and embedding them rit?

JRC1995 commented 6 years ago

isnt attention process with all the window manipulation taking place during decoding stage..

More or less.

where the tf.concantenate of the forwad ad backward encoders are considered during this time, and window is defined and he context is taken from tht,

Yes. Window size is predefined (the value of D). Window position is determined at every time step from some formula depending on decoder context (what has been predicted so far) and the encoded context ( the tf.concatenated version of forward and backward encoders). After that attention mechanism is used over the contents of the window (a portion of the encoded text) to create the context vector (which contains the overall attended encoder context from the window) in a single vector. The context vector is then used with the decoder context to make a prediction for the next word. The predicted word is then used to update the decoder context for predicting the next word in the next timestep.

the encoding phase doesnt do much other than going through from sos to eos forward and backward and embedding them rit?

More or less.

Note, I only used the SOS representation as the starting word for the decoder output. I don't think I used SOS symbol anywhere else.

hack1234567 commented 6 years ago

i see that you have implemented eos to vec_summary but not for vec_text ....how would you know the end of the text has reached during the forward and backward process.

JRC1995 commented 6 years ago

There can be multiple ways. You can use len() to find the length of the text. Let's say that the length of text is put into a variable seq_len. The vec_text can be conceived as an array of word vectors. Then you can simply run a loop from i = 0 to i = seq_len-1.

Psuedocode: for(i = 0, i<seq_len-1,i++) { encode(vec_text[i]) }

eos isn't really used to know the end of summary here. It's main purpose is to teach the program to predict the end for predicted summary. From the eos in the example summary data the model learns where the end of sentence is likely to be while predicting.

hack1234567 commented 6 years ago

def forward_encoder(inp,hidden,cell, wf,uf,bf, wi,ui,bi, wo,uo,bo, wc,uc,bc, Wattention,seq_len,inp_dim):

the input is a tensorflow placeholder....how are u feeding the placeholder with the data? also what does the encoder parameter wf,uf,bf etc .. actually mean? u are creating tensorflow primitive tf.variable, but what does these parameters actually mean? are they randomnly initialized to fit the input sequence? Also wouldn't it be possible to create encoder by calling the LSTMcell function in tensorflow... bit confused with where you are creating the lstm cell. Also where should I refer to learn more about what you are doing in the forward and backard encoder functions. thank you

JRC1995 commented 6 years ago

The placeholders are fed during training, validation, and testing.

       # Run optimization operation (backpropagation)
        _,loss,pred = sess.run([optimizer,cost,prediction],feed_dict={tf_text: train_texts[i], 
                                                tf_seq_len: len(train_texts[i]), 
                                                tf_summary: train_out,
                                                tf_output_len: len(train_out)})

wf, uf, bf are weights and biases for the network. tf.variables are usually treated as weights and biases for the network. Their values are updated during backpropagation at each training step. Tf.variables are usually treated as the actual trainable parameters.

You may like to take a peek at the introduction to Tensorflow to understand how it works. Some of the things in Tensorflow are fundamentally different from other languages like C, Java, or even base Python itself.

https://github.com/aymericdamien/TensorFlow-Examples

Also where should I refer to learn more about what you are doing in the forward and backard encoder functions.

I used a LSTM with RRA for the encoders. You can ignore RRA. I implemented my own personal LSTM without using libraries. To understand LSTM, you can google about it. I can help if you can't understand.

Before LSTM, I will suggest learning about a vanilla RNN, if you don't know already.

You can check this site for RNN: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

You can check these for LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ https://arxiv.org/pdf/1412.3555.pdf

^check the paper for LSTM and GRU (a slightly simpler version of LSTM in a way). You can find similarities between my code and the mathematical equations as published in it.

hack1234567 commented 6 years ago

thank you. During training in each iteration the loss has to be reduced rit? here during training loss is 10 then 3 then 1 then back to 12 .... so does iterating throughout help much?

JRC1995 commented 6 years ago

Training loss can fluctuate, especially in the early stages. It fluctuates even more here because of no batching. It's not unnatural.

The program may optimize the weights learning from data #1. But in the next training step, the optimization learned from data # 1 may result in an increased loss in data #2. That's why we train the model again and again through the multiple data set so that it can ideally finally learn some parameter values that are good for most of the data set.

If you do batching, each training step will calculate the average loss of all the data in the batch. The average loss probably won't fluctuate as much.

If the training loss keeps on erratically fluctuating even after lot of training steps, something may be wrong with the model. Adjustments to hyperparameters and such may be advisable in that case.

Yes, it needs to be iterated again and again until at least some level of convergence is reached (when the loss only changes negligibly if at all for some training steps). Also, ideally we need to use validation periodically to check for overfitting and such. ( I didn't use validation here because, but it should be used)

hack1234567 commented 6 years ago

Wattention_f = tf.Variable(tf.zeros([K,1]),dtype=tf.float32)

then

Wattention = tf.nn.softmax(Wattention,0)

softmax is ocnverting it into probability distribution rit? but wattention contains tf.zeros[5,1] what does the softmax of that signify. how many hiddden layers? 500? on what basis is that selected , each state denotes a word?

also ` hidden_forward = tf.TensorArray(size=seq_len,dtype=tf.float32)

hidden_residuals = tf.TensorArray(size=K,dynamic_size=True,dtype=tf.float32,clear_after_read=False)
hidden_residuals = hidden_residuals.unstack(tf.zeros([K,hidden_size],dtype=tf.float32))`

hidden residuals contain the previous 5 hidden states, but how does it know which all states to look at?

JRC1995 commented 6 years ago

You are right that softmax is converting it into a probability distribution. Wattention_f may be all zeros initially (i.e softmax won't make a difference), but it is a trainable parameter. So its values will be updated in later training change, and then softmax will be necessary. Also, Wattention part in encoders is a part of the RRA mechanism, which you may ignore.

hidden_forward is like a list which will contain all the hidden states of the forward encoder. The hidden states are the encodings. So you can say that the hidden_forward will store all the encoded words resulted from the forward encoder. Hidden_backward is the same but for backward encoder. They are dynamic lists. First they are empty, but they are filled with hidden states as they are generated by the encoder. seq_len (no. of words in the sequence fed into the encoder) is the max size, since the no. of encoded words will be same as the seq_len.

hidden_size = 500 ....hidden_size denotes the dimension size of each hidden states that are being used throughout the encoders. Decoders will be using twice the size of encoder hidden size. Decoder hidden size will be 100. Another way to think of hidden_size is that it denotes the number of neurons in the hidden layer. Since the encoder hidden state will be treated as the encoded versions of the input words, hidden_size denotes the dimensions of the encoded word vector output from each encoders. The final output of the encoder after concatenating the forward encoder output and backward encoder output will be then of course of size 500+500 = 1000. I initialized the decoder hidden state with the first encoder hidden state (of size 1000). That is why decoder hidden size is also kept at 1000.

There is no strict science behind choosing 500 for hidden size. Hidden size, These are hyperparameters that are initialized somewhat based on intuitions and then some trial and error are used. At best if you have enough processing power you can use some code to randomly generate a lot of hyperparameters and test the first couple of results and then choose the set of hyperparameters that gives the best result. You may use 100, 200, 300 or such for hidden_size. Using greater size means there will be more dimensions to store more encoded information, which may result in better result. But that can also mean slower processing, or storing useless information which may result in bad result. So it's tricky. The best we can do as a beginner is refer to how the experts initialize the hyperparameters for similar programs. Research papers about different models often come with details about the values that were used for hyperparameters and so on.

Hidden_residuals is again a part of RRA which you may ignore. The logic behind the codes for hidden_residual and RRA is a bit complicated. RRA itself is not complicated but implementing it in Tensorflow was a bit complicated for me, due to some limitations of Tensorflow. So explaining it will take a bit of time. I can try to explain it later if you want to use RRA, but for now I can give a brief description.

Hidden_resiudals is roughly used similar to the hidden_forward or hidden_backward tensorarrays which are lists that are updated with the hidden states\encoded words as they are generated. In the end they will contain roughly all encoded words.

RRA will be using last five hidden states. The last five hidden states (placed in the last 5 filled indices) of hidden_residual will be used for that.

K denotes how many previous hidden states are to be used for RRA. Here, K=5. J is initialized as 5. So at first step by iterating from j-k to j-1 (0 to 4) numbered indices from hidden_resuiduals you can find the first 5 hidden states for RRA. At each encoding step J is increased by one while a new hidden state is also added to the list of hidden_residuals. So at the next step j will be 6, and K will be 5, so iterating from j-k to j-1 indices in hidden_residuals is like iterating from index 1 to index 5 where index 5 will contain the newly added hidden state. In the next step J will be 7 and K will be 5. so from j-k to j-1 or j-k:j will be from 2 to 6...and so on. In this way only the last 5 or last K hidden states which will be located in the last 5 filled indices of hidden_residuals will be used for RRA.

hack1234567 commented 6 years ago

what to do with the trained model, is testing direct enough? how to use it for testing?

JRC1995 commented 6 years ago

Use a test data (which is not used for validation and training) for testing. Feed it to the network like you fed the training data. Calculate avg. cost and accuracy for the testing data. It is recommended to use BLEU, ROUGE or some metric like that for better evaluation of the quality of summarization. Then you can use the model for prediction by feeding user input to the network and simply printing the predicted result.

hack1234567 commented 6 years ago

I've noticed that you have used tf.saver but why are you not saving it to a path? so that the learned weights and biases could be restored if needed? can testing be done without saving the weights, no rit? And in your lstm code(since your not using the function), how exactly does it remember the previous(distant) words? is it using the formula tanh(wx1+us1) where s1 represents previous state

JRC1995 commented 6 years ago

I've noticed that you have used tf.saver but why are you not saving it to a path?

It seems I haven't implemented the save feature. Me using tf.saver seems to be redundant here. That line of code involving tf.saver doesn't really do anything substantial here.

I guess, I didn't implement saving because I didn't use validation and testing data. I usually make it such that the weights and biases are saved when validation accuracy improves or something like that. In this case, I haven't even implemented validation accuracy.

If you want an example of save feature, I used it here: https://github.com/JRC1995/Wide-Residual-Network/blob/master/Model(WRN)(NEW).ipynb

saver = tf.train.Saver() . . . saver.save(sess, 'Model_Backup/model.ckpt')

so that the learned weights and biases could be restored if needed?

Yes saving to a path is needed for that.

can testing be done without saving the weights, no rit?

Testing can be done. You can do pretty much everything. But once you terminate the programs all the biases and weights that the program learned will be lost and forgotten. The program has to be retrained again on the next run.

hack1234567 commented 6 years ago

`
def body(i,j,hidden,cell,hidden_backward,hidden_residuals):

    x = tf.reshape(inp[i],[1,inp_dim])

    hidden_residuals_stack = hidden_residuals.stack()

    RRA = tf.reduce_sum(tf.multiply(hidden_residuals_stack[j-K:j],Wattention),0)
    RRA = tf.reshape(RRA,[1,hidden_size])

    # LSTM with RRA
    fg = tf.sigmoid( tf.matmul(x,wf) + tf.matmul(hidden,uf) + bf)
    ig = tf.sigmoid( tf.matmul(x,wi) + tf.matmul(hidden,ui) + bi)
    og = tf.sigmoid( tf.matmul(x,wo) + tf.matmul(hidden,uo) + bo)
    cell = tf.multiply(fg,cell) + tf.multiply(ig,tf.sigmoid( tf.matmul(x,wc) + tf.matmul(hidden,uc) + bc))
    hidden = tf.multiply(og,tf.tanh(cell+RRA))

    hidden_residuals = tf.cond(tf.equal(j,seq_len-1+K),
                               lambda: hidden_residuals,
                               lambda: hidden_residuals.write(j,tf.reshape(hidden,[hidden_size])))

    hidden_backward = hidden_backward.write(i,tf.reshape(hidden,[hidden_size]))

    return i-1,j+1,hidden,cell,hidden_backward,hidden_residuals

_,_,_,_,hidden_backward,hidden_residuals = tf.while_loop(cond,body,[i,j,hidden,cell,hidden_backward,hidden_residuals])`

This code is the most troublesome for me. Fromt he paper which you gave me i understood that fg, ig,og are the forget gates , input gate and the output gate? When is the body function executed? Is it when forward_encoder function is called? x contains the inp[i], ie for 0....end? In the model function wf,uf etc are just initlialised, but how are their values set?

"cell" in this code follows the cell formula from the paper which is partially forgetiing the existing content and adding new memory? hidden = tf.multiply(og,tf.tanh(cell+RRA)) hidden over here, is it the hidden state corresponding to each input? what exactly can the tanh activation function do to the values, it makes it non linear...but in the context of text, what does tanh do? Thank you

JRC1995 commented 6 years ago

From the paper which you gave me i understood that fg, ig,og are the forget gates , input gate and the output gate

Yes.

When is the body function executed? Is it when forward_encoder function is called?

No. This line within the forward_encoder will execute cond and body.

,,,,hidden_backward,hidden_residuals = tf.while_loop(cond,body,[i,j,hidden,cell,hidden_backward,hidden_residuals])`

This is just the 'Tensorflow's' way of implementing dynamic loops (as in loops that depends on some dynamic value i.e values that are specified during runtime)

Check: https://www.tensorflow.org/api_docs/python/tf/while_loop Also, check this: https://stackoverflow.com/questions/37441140/how-to-use-tf-while-loop-in-tensorflow

x contains the inp[i], ie for 0....end?

Yes. X will contain the input word vector value. In the first loop it will start with the 1st word (i=0), in the next loop i will be increased (i++) and so it will contain the 2nd word vector and so on till the end. The code you quoted is from the backward encoder where i starts from the end and continues to 0, i's value being decremented every iteration.

In the model function wf,uf etc are just initlialised, but how are their values set

wf_f = tf.Variable(tf.truncated_normal(shape=[word_vec_dim,hidden_size],stddev=0.01)) uf_f = tf.Variable(np.eye(hidden_size),dtype=tf.float32) bf_f = tf.Variable(tf.zeros([1,hidden_size]),dtype=tf.float32)

These codes don't actually initialize the value; instead, they specify how to be initialized. Or you can say it specifies the structure of the variable too. For example, tf.truncated_normal will initialize the values using random normal distribution of stddev=0.01 while truncating some values. np.eye will initialize a matrix identity of the given shape. And so on.

This code in the training section:

init = tf.global_variables_initializer()

with tf.Session() as sess: # Start Tensorflow Session sess.run(init) #initialize all variables

Will actually initialize i.e set the initial values of the variables.

The values are then adjusted by Tensorflow during backpropagation.

cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output, labels=tf_summary)) optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

The optimizer defines how to adjust the values. The minimize(cost) part will differentiate the cost and backpropagate to find the gradients of all trainable parameters, and then adjust those parameters using the specific optimization method trying to minimize the cost. All done with Tensorflow. Note, the above code is more like a definition of how optimization is to be done. Executing it does not initiate the optimization process or backpropagation.

This code starts the backpropagation. Not how 'optimizer' is present within the sess.run.

        # Run optimization operation (backpropagation)
        _,loss,pred = sess.run([optimizer,cost,prediction],feed_dict={tf_text: train_texts[i], 
                                                tf_seq_len: len(train_texts[i]), 
                                                tf_summary: train_out,
                                                tf_output_len: len(train_out)})

"cell" in this code follows the cell formula from the paper

Yes.

hidden over here, is it the hidden state corresponding to each input? Yes.

it makes it non linear...but in the context of text, what does tanh do?

Well, within the network, we are dealing with word vectors not raw texts. A word vector is a vector - which is like an array of numbers. All the previous equations are just linear algebras. So even if we are dealing with texts it's not all too different from dealing with an image (which is represented as a tensor). So, in the end we are dealing with numbers. Tanh is still used for non-linearity here.

hack1234567 commented 6 years ago

RRA = tf.reduce_sum(tf.multiply(hidden_residuals_stack[j-K:j],Wattention),0) RRA = tf.reshape(RRA,[1,hidden_size])

what does this rra do? It finds the k hidden states and calculates the weighted sum but what does it mean? I thought rra was used for local attention during comparing decoder with encoder states. It had D as window size. But here k hidden states are taken. How does this relate to local attention model specified later? also how many hidden layers are there?

JRC1995 commented 6 years ago

RRA is Recurrent Residual Attention (https://arxiv.org/abs/1709.03714). It is different from local attention. It doesn't have any special relation with local attention. I recommended earlier that you can avoid RRA and any codes associated with it, completely. I discovered the concept of RRA in the aforementioned paper. The paper isn't very highly cited, and I am overall suspicious of the quality of the method.

Local Attention is in a sense an interlayer attention. It's something that can be said to happening between encoder and decoder. Yes, local attention is being used during decoding, but the local attention takes both encoder states and decoder states in account. That is two separate layers are in communication for local attention. Which is why it can be considered as an interlayer attention.

RRA, in contrast, is more of an intra-layer attention. Here attention is happening among the states of the same layer. In encoder, RRA is used for attending to previous hidden states of the same encoder layer. In the decoder, RRA is again used for attending to previous decoder hidden states of the same layer.

However the concept of inter-layer attention, seems to me, to be more fleshed out in the LSTMN model. https://arxiv.org/pdf/1601.06733.pdf

So I will recommend against using RRA. If you really want to use some intralayer attention you can try LSTMN.

It finds the k hidden states and calculates the weighted sum but what does it mean

It finds the 'last' K hidden states. The K is a predefined hyperparameter. The K hidden states are then multiplied by K weights (trainable parameters). (i.e each of the K hidden states are multiplied with one of the K weights). The result of the multiplication is a list of 'weighted hidden states'. The same of these weighted hidden states is the weighted sum.

Example:

list of 1D vectors = [0],[2],[3] list of weights = 99, 5, 10 Weighted sum of the 1D vectors = 99[0] + 5[2] + 10*[3] = 40

hack1234567 commented 6 years ago

hidden_residuals_d = tf.TensorArray(size=K,dynamic_size=True,dtype=tf.float32,clear_after_read=False) hidden_residuals_d = hidden_residuals_d.unstack(tf.zeros([K,2*hidden_size],dtype=tf.float32)) The rseiduals are stacked layer after another, so as to get the most immediate output? But when its unstacked it returns a list? so what exactly does unstack mean in this context?

would you classify this project as part of deep learning?

All the methods for finding out pt and context vector can be understood from the paper ri?

but what is y used here for y = tf.convert_to_tensor(SOS) #inital decoder token <SOS> vector y = tf.reshape(y,[1,word_vec_dim]) then its used as ` y = tf.matmul(attended_hidden,Ws)

    output = output.write(i,tf.reshape(y,[vocab_len]))

    y = tf.nn.softmax(y)

    y_index = tf.cast(tf.argmax(tf.reshape(y,[vocab_len])),tf.int32)
    y = tf_embd_limit[y_index]
    y = tf.reshape(y,[1,word_vec_dim])`

also what is the concept behind alignment? How is allignment significant?

Thank you

JRC1995 commented 6 years ago

The rseiduals are stacked layer after another, so as to get the most immediate output? But when its unstacked it returns a list? so what exactly does unstack mean in this context?

Hidden residuals is a part of RRA. It is used to store all pervious hidden states, and from this list last K hidden states are obtained. If you don't want to bother with RRA, you can ignore hidden_residuals.

Read about what stack and unstack is from here: https://www.tensorflow.org/api_docs/python/tf/TensorArray

Remember hidden_residuals is a Tensorarray. It's a kind of dynamic list with some specific rules to read or write data. .stack() simply obtains a Tensor made of the data in Tensorarray. A Tensor can be more conviniently handled for accessing data. I don't remember the exact reason I used stack() to obtain a Tensor. May be I faced some issue when using Tensorarray to access data, or may be I did it because I was more convenient.

Unstack does the opposite. It puts the values from a Tensor into the indices of the Tensorarray.

I used: hidden_residuals_d = hidden_residuals_d.unstack(tf.zeros([K,2*hidden_size],dtype=tf.float32))

to basically initialize the Tensorarray with 0. tf.zeros(K,2*hidden_size) is a tensor of K elements, each having a dimensions of the decoder hidden state, and all the values are 0.
At the beginning the Tensorarray hidden residuals will thus contain a list of K elements all having 0 throughout its dimension.

So the first time the program requires the last K hidden states, it will only get 0 valued hidden states since there is no hidden state during the first time. Being zero they won't have any effect on the calculation of RRA.

would you classify this project as part of deep learning?

Yes. Pretty much anything with a multilayered neural net is deep learning. Recurrent Neural Network (RNN) is a type of Neural Network, and LSTM which I used is a type of RNN. Like a typical deep learning project it contains stacked layers of neurons arranged in a certain way. Here you find how decoder layers are stacked after encoder layers with some attention in between.

All the methods for finding out pt and context vector can be understood from the paper ri?

If you mean the context vector and pt from local attention, then yes. It's here: https://arxiv.org/abs/1508.04025

but what is y used here for y = tf.convert_to_tensor(SOS) #inital decoder token vector y = tf.reshape(y,[1,word_vec_dim]) then its used as ` y = tf.matmul(attended_hidden,Ws)

y signifies the decoder output of a decoder time step. SOS is a word vector that signifies the start of the sentence.

You missed the use of y after initializing y with SOS.

Y was used here before y = tf.matmul(attended_hidden,Ws):

decoded_hidden_next,cell_d = decoder(y,decoded_hidden,cell_d, wf_d,uf_d,bf_d, wi_d,ui_d,bf_d, wo_d,uo_d,bf_d, wc_d,uc_d,bc_d, RRA)

Here y which is SOS in the first time step provides the initial context for predicting the next word (which will be the actual first word).

After this, the work of SOS is done. Y is updated with the newly predicted word-vector.

Take note, the decoder function uses the previous decoder predicted word to predict the next word. Y provides this previously predicted word. So at each time step Y is updated with the last predicted word. Since in the beginning there was no predicted word, Y was initialized with SOS which provides the start of sentence context.

also what is the concept behind alignment? How is allignment significant?

I don't understand what you mean by aligment? Are you talking about this?:

G,pt = align(encoded_hidden,decoded_hidden,Wp,Vp,Wa,tf_seq_len)

Alignment here is basically a part attention. The align function determines the pt i.e the center position of the window when attention will be given, and it also determines G which has the weights of the encoded words within the window - the weights signifies how much attention is to be given. It's named align because in a sense it is aligning the encoded words with the decoder by determining where to attend and how much to attend to the encoder in order to get the appropriate context for decoding.

hack1234567 commented 6 years ago

how many hidden layers are there? for vec in train_texts[i]: if vec2word(vec) in string.punctuation or flag==0: print(str(vec2word(vec)),end='') else: print((" "+str(vec2word(vec))),end='') flag=1

converting this vector to word, but what does the end and string.punctuation achieve in this context? Also, what is the use of transform_out function? also ` def cond_pred(i,pred): return i<tf_output_len

def body_pred(i,pred): pred = pred.write(i,tf.cast(tf.argmax(output[i]),tf.int32)) return i+1,pred

i,pred = tf.while_loop(cond_pred,body_pred,[i,pred])`

how is data fed into the pred? what exactly does the argmax of output[i] do? output would be the prediction rit, then what? thank you

JRC1995 commented 6 years ago

I don't know exactly what string.punctuations is; it's probably a list of punctuations in string format. I know that if you check " if string in string.punctuation" -it will return true if the string is a punctuation. So using this condition you can check wether the current word is a punctuation or not. This part of the code is probably not necessary here - because I filtered out punctuations IIRC. But if you are making a more sophisticated program that takes punctuations into account, then this code can be useful. Basically this code is for formatting the display of the output.

flag is 0 only for the first word, I believe.

So if vec2word(vec) in string.punctuation or flag==0 will return true only if the word is a punctuation or the word is the first word.

specifying end='' within the print() means, the printed element will end with an empty string '' i.e it will not end with anything. By default the end = '\n' or ' ', IIRC. That is if you simply code "print("xyz")" the actual printed output will be xyz followed by a newline or blankspace.

So if we are dealing with a first word or punctuation, it is followed by nothing and no space is given before it. Otherwise a space is put before the word.
print((" "+str(vec2word(vec))),end='') See how a space is concatenated before the word.

So what is happening?

Let us say we have a set of words\punctuations like this:

'Jack' ',' 'Jill' ',' 'and' 'I' 'are' 'three' 'guys' '.' ( Jack, Jill, and I are three guys.)

The program will start with Jack. At that time flag will be zero. So it will simply print Jack with nothing following it.

Output at this point: "Jack"

Next word will be actually not a word but a punctuation ",". So following the code it will again simply print "," without any space before or after.

Output at this point: "Jack,"

Next word is "Jill" which is neither a punctuation nor the first word, so it will concatenate a blank space before Jack and print it.

Output at this point: "Jack, Jill"

....and so on.

Now, you may understand why I did what I did.

what is the use of transform_out function

It changes the target output (or label) format. Initially the target output is a bunch of word vectors. But to use Tensorflow's sparse_softmax_cross_entropy_with_logits() function for loss calculation, the target output\labels needed to be in a special format. The output needed to be integers signifying the position of the vocabulary (which can be conceived as a sort of label no.).

A slightly more detailed explanation is given above the transform_out function in the readme file. You may have a easier time understanding if you know one-hot-encoding.

how is data fed into the pred?

Initially pred is an empty Tensorarray with a size defined. This tensorarray object - pred is initially fed into the loop.

WIthin the loop, pred is updated with information from which the predicted word can be deduced.

At the end of the loop pred will have information denoting all the predicted word.

argmax of output[i]

Output[i] contains the probability distribution of all the words in vocab_limit for position i of prediction.

In other words, output[i] contains the probability of each word in vocab_limit to be position i.

tf.argmax(output[i]) will return the index with output[i] that contains the maximum probability. This index also signifies the position of the word in vocab_limit.

(jth value in output[i] will have the probability of the jth word in vocab_limit to be in the ith position)

So if you know the index, you can easily know the word by extracting it from vocabulary.

Basically argmax is used to choose the index of the maximum probability value so that we can finally find the word that is given most probability to be in the ith position.

This index information is stored in pred. Later on, the word can be retrieved from the index to display the predicted word.

hack1234567 commented 6 years ago

So in decoder, using encoder hidden states, previous decoder state and the context are you trying find out the word in vocab_limit that contain the max probability and directly printing that word as predicted summary? train_out = transform_out(train_summaries[i][0:len(train_summaries[i])-1]) what is the use of train out, is it only to print which is the actual summary and its length? does it have any other use? cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output, labels=tf_summary))

here logits is softmaxed then compared with one hot encoded label....where did one hot encodin of tf_summaries take place(since its just a place holder) ?

JRC1995 commented 6 years ago

So in decoder, using encoder hidden states, previous decoder state and the context are you trying find out the word in vocab_limit that contain the max probability and directly printing that word as predicted summary?

Yes.

train_out = transform_out(train_summaries[i][0:len(train_summaries[i])-1]) what is the use of train out, is it only to print which is the actual summary and its length? does it have any other use?

train_out contains the value returned by transform_out. As I already said transform_out converts the target_output into a specific format such that it can be used in Tensorflow's sparse_softmax_cross_entropy_with_logits() function for loss calculation. That's why train_out is fed to the network for loss calculation.

# Run optimization operation (backpropagation)
        _,loss,pred = sess.run([optimizer,cost,prediction],feed_dict={tf_text: train_texts[i], 
                                                tf_seq_len: len(train_texts[i]), 
                                                tf_summary: train_out,
                                                tf_output_len: len(train_out)})

Check above. tf_summary is fed with the values in train_out.

here logits is softmaxed then compared with one hot encoded label....where did one hot encodin of tf_summaries take place(since its just a place holder) ?

One hot encoding was not used. But something similar was used. For example, let's say all the words in the vocabulary are possible labels. let's say there are 3 words. So if the target output is the 2nd word..in one hot encoding it will be 010. so in index 0 there is 0, in index 1, there is 1, and in index 2 there is 0. softmax_cross_entropy_with_logits() will be satisfied with this format.

But if you use sparse_softmax_cross_entropy_with_logits(), instead of 010 you have to feed it with 1. Why 1? because the 1 of the one-hot format is present in index 1. If the one hot encoding was 001, you had to feed 2. Basically, you are feeding the position of the word in the vocabulary i.e the position where the 1 would have been present if it was one hot encoded.

Target output words are transformed into this format (that contains the index of the word in vocab_limit) within the transform_out function.

hack1234567 commented 6 years ago

Thank you. Attention finds the maximum probable word or context form the window, then its used along with the decoder state....but how does vocab_limit comes into this? Since we can directly find the most probable word using attention..... earlier u mentioned about creating probability distribution in vocab_limit but we are already finidng the context vector using attention from the final encoded states. Vocab_limit would be haing way too many words ...like the full vocab... is the decoder using the context along with words in vocab_limit? can you please explain that concept? also how does attention window go beyond len-1, restricted to use ot-d,pt+d as long as pt is not len-1, so what would it contain if pt+d goes beyond len-1?

JRC1995 commented 6 years ago

Attention finds the maximum probable word or context form the window, then its used along with the decoder state

No.

Finding maximum probable word from the probability distribution is something that happens at the end - at the time of making final predictions. Attention doesn't have anything to do with it.

Local Attention here helps to direct attention towards encoder states for predicting the decoder word (probability distribution - from which much later the maximum probable word is chosen)

but how does vocab_limit comes into this

vocab_limit simply contains all the words in the vocabulary. It's actually a limited vocabulary, because vocab_limit only contains word present within the training data.

Since we can directly find the most probable word using attention I am not sure where you are getting that idea.

I was describing the selection of most probable word while describing the code with pred, and the work of argmax. Those codes happen after attention and decoding are complete. So I am not sure why you are associating attention with this.

earlier u mentioned about creating probability distribution in vocab_limit but we are already finidng the context vector using attention from the final encoded states

The context vector from attention is USED to predict the probability distribution OVER the words in vocab_limit. The probability distribution is the initial decoder output. This is also used for loss calculation. Later max probable word is chosen from the distribution as the final prediction to display.

Vocab_limit would be haing way too many words ...like the full vocab... is the decoder using the context along with words in vocab_limit?

Yes vocab_limit has too many words. The processing would be heavier if previously decoded words used for context were all probability distributions over vocab_limit. So first the probability distribution is saved in output tensorarray for loss calculation later on. Then the word vector of the max probable word is chosen to serve as the context for previous decoded word. So yeah, this is another place where the max probable word is used, before the final prediction. Check the codes:

  output = output.write(i,tf.reshape(y,[vocab_len]))
    #Save probability distribution as output

    y = tf.nn.softmax(y)

    y_index = tf.cast(tf.argmax(tf.reshape(y,[vocab_len])),tf.int32)
    y = tf_embd_limit[y_index]
    y = tf.reshape(y,[1,word_vec_dim])
also how does attention window go beyond len-1, restricted to use ot-d,pt+d as long as pt is not len-1, so what would it contain if pt+d goes beyond len-1?

The problem is coded in such a way that pt+d would never go beyond len-1.

Check this:

positions = tf.cast(tf_seq_len-1-2*D,dtype=tf.float32)

sigmoid_multiplier = tf.nn.sigmoid(tf.matmul(tf.tanh(tf.matmul(ht,Wp)),Vp))
sigmoid_multiplier = tf.reshape(sigmoid_multiplier,[])

pt_float = positions*sigmoid_multiplier

pt = tf.cast(pt_float,tf.int32)
pt = pt+D #center to window

sigmoid_multiplier's range is from 0 to 1 because sigmoid(x)'s range is 0-1. So positions*sigmoid_multiplier i,.e float version of pt ranges between 0 to positions.

positions is set up to be of size len-1-2*D

So finally the range of pt is 0 to len-1-2*D

after that there is this code pt=pt+D

That means the range of pt becomes D to len-1-D

So pt-D at minimum = 0 and pt+D at maximum = len-1

So pt-D can never be less than 0, and pt+D can never exceed len-1.

hack1234567 commented 6 years ago

The context vector from attention is USED to predict the probability distribution OVER the words in vocab_limit.

How is The context vector(average) from the encoded words compared with vocab_limit. That is what i am having trouble with. Since vocab_list can have anything...say context vector meaning is "lion".... the vocab_limit can have say "cat","tiger" etc..... how is the context vector used to find max probable word since its not similar to anything in vocab_list. Context word contain the useful data and the probabilty distribution is the initial decoder state. But vocab_list is just outside entity full of random words. How is it even important with the current text. Please bear with me, just not able to understand that concept.

JRC1995 commented 6 years ago

How is The context vector(average) from the encoded words compared with vocab_limit. That is what i am having trouble with. Since vocab_list can have anything...say context vector meaning is "lion".... the vocab_limit can have say "cat","tiger" etc..... how is the context vector used to find max probable word since its not similar to anything in vocab_list. Please bear with me, just not able to understand that concept.

I can't clearly understand your question. It seems to me you may be misunderstanding something critical, but I can't understand what it is from the question. Nevertheless, I will try to answer to the best I can.

I am assuming by context vector you mean this:

    weighted_encoded_hidden = tf.multiply(local_encoded_hidden,G)
    context_vector = tf.reduce_sum(weighted_encoded_hidden,0)
    **context_vector** = tf.reshape(context_vector,[1,2*hidden_size])

This context vector isn't exactly an average. It's a 'weighted summation'. Check back on my description of the weighted summation, if you need.

Context_vector typically won't have any concrete meaning such as 'lion'. We can't really think in precise human terms of meaning of words since we start encoding. Even during encoding we are associating context information. The encoded version of the word 'lion' won't simply mean 'lion', it will be lion+some context information. Things get too abstract here. Context vector is going even further than that - it's the weighted summation of these already contextually encoded versions of the input words. Let's say we have encoded version of 'lion', and 'tortoise'...the attention process may assign .75 attention weight to lion and .25 attention weight to 'tortoise'. Then the context vector will be .75 encoded vector of 'lion' + .25 encoded vector of 'tortoise'....so at best you can say that the context vector is mostly lionish and slighlty tortoisish.... The thing is a well trained network will give higher attention weights to important encoded words - and thus the context vector will be closer in meaning to the important words...the idea is to get the context vector to represent the essence or 'gist' of the encoded info.

the vocab_limit can have say "cat","tiger" etc

yes.

how is the context vector used to find max probable word since its not similar to anything in vocab_list.

I may be getting what you are trying to ask.

The context vector is not directly used to find the max probable word.

You can check the codes for what is exactly happening.

A brief overview of what's happening (without maths\codes):

Some linear algebra and non-linearity is used (multiplying with weights and all that) to transform the context vector, and other important information (previous decoder word, previous decoder hidden state etc.) into a data that is actually a vector of same dimensions as the no. of words in the vocabulary. This data (after softmax) is then treated as the probability distribution over the words in the vocabulary.

That's the beauty of linear algebra operations. You can transform data of any dimension to another data of any dimension.

For example, handwritting recognition using neural nets. You give in high dimensional data (tensor representation of the images) and it is then converted into a single digit representing its class.

Note, we are not directly predicting 'cat' or 'dog'. We are initially predicting a probability distribution over the words. So instead of cat or dog, the program may predict 0.7, 0.3 - this data then can be treated as if there is a 0.7 probability for it to be a cat, and 0.3 probability for it to be dog in the given position.

Before training the prediction will be just random numbers. After training ideally the network will learn to use the important information from context vector, previous decoder prediction etc. to result in a set of data that can better serve as a probability distribution.

Now, once you have the probability distribution finding the max probable word is a piece of cake. You can find the index where the probability value is maximum, and then just output the word in the same index within vocabulary. (The data is here treated in such a way that the value in ith position is considered as the probability for the ith word in vocab).

Please bear with me, just not able to understand that concept.

It's fine.

hack1234567 commented 6 years ago

Thank you. Also tarining_iter =5 but iteration goes on till 25000 approx...so what does training_iter do

while step < training_iters:

    total_loss=0
    total_acc=0
    total_val_loss = 0
    total_val_acc = 0

    for i in xrange(0,train_len):

        train_out = transform_out(train_summaries[i][0:len(train_summaries[i])-1])

        if i%display_step==0:
            print("\nIteration: "+str(i))

also in the following line from code for forward_encoder

hidden_forward.stack()

how do i print the values in hidden_forward inside that function itself. I tried printing it usually like how you print a list but its returning error.

JRC1995 commented 6 years ago

I used training_iter as an epoch, I guess. My phrasing was poor. That is, 1 training_iter is completed when all training data is used for training for once. Since there are lots of data, actually training iterations will be above 1000s. 5 training_iters should mean all training data will be fed for training for 5 times.

JRC1995 commented 6 years ago

didn't understand the output[i] part...if context vector is found and say the decoder output for that timestep is found. then how does the output[i] access these......what exactly does " i "mean in this context...how can we know ith element in vocab_limit exactly..

Context vector is not decoder output, nor is it the sole determining factor for the decoder output. You can at best say that context vector carries the gist of an attended portion of the encoded vectors or something like that. As you can see the code for details.

The decoder output is put into y. The value of y gets updated every timestep. We need something to STORE the predicted values (probability distributions)'decoder outputs to use it later for loss calculation and prediction, Output serves this purpose. Output is a tensorarray where the new value of y ( new decoder output_ of a timestep) is stored.

The deocder may output multiple probability distributions, since the output can have multiple words. At each timestep a probability distribution for a new position in the output sentence is predicted.

outpout[i] simply signifies the probability distribution for predicting the word to put in position i in output sentence.

how can we know ith element in vocab_limit exactly..

use vocab_limit[i]

JRC1995 commented 6 years ago

Are you aligning the pred output with the actual summary? In translation i think they align target output with actual output....or is the actual summary just used to find out loss?

By aligning do you mean comparing between the predicted output and actual summary to find the loss so that the program can optimize the trainable weights in order to make predicted output be closer to the actual output?

I am not sure what you mean by 'actual output', and 'target output'. There's a predicted output that the model predicts. The prediction is the actual predicted output. There's a target output which is what we want the machine to predict. It's can be considered as the true output or actual output as it is what we think to be the correct desirable output. This is what the machine strives towards. If by actual output you mean target output, what do you mean by aligning target output with actual output? Wouldn't they be the same?

So I don't get what you are asking for.

But, all in all, things here aren't different from translation using encoder-decoder.

In translation we compare the model-predicted probability distributions of the translation with the actual target translation. In summarizsation, we compare the model-predicted probability distributions of the summarization with the actual target summarization.

If the probabily distribution of each decoder timestep is found..then why are some predicted summaries having only one word?

Those output is the result of one decoder timestep. One probability distribution was determined obnly. So only one max probable output was retrieved. The putput size is predefined by user in this program. But a better program should learn to predict the end of the output by itself. In those case the end is marked by a vector representation of another symbol (end of sentence). If the program predicts an everything afterwards should be ignored while displaying or loss calculation.

also, I didn't properly understand the output[i] part...you are using write(i,tf.cast(y....)) are you putting y to ith location and later accesing it as output[i].....

Yes.

hack1234567 commented 6 years ago

sorry i unknowingly repeated 2 questions...it happened due to internet error. also i meant predicted instead of target ... but is allignment done like in translation? I think the word order is pretty imprtant in translation..

JRC1995 commented 6 years ago

does the size of actual summary have any consequence on the output length or its characteristic.

Normally it shouldn't. But here I used the size of the target summary to predefine the size\length of the predicted output which is also the no. of decoder timesteps. It was for testing purposes - to make the program simpler.

hack1234567 commented 6 years ago

hidden = tf.multiply(og,tf.tanh(cell+RRA))

    hidden_residuals = tf.cond(tf.equal(j,seq_len-1+K),
                               lambda: hidden_residuals,
                               lambda: hidden_residuals.write(j,tf.reshape(hidden,[hidden_size])))

    hidden_forward = hidden_forward.write(i,tf.reshape(hidden,[hidden_size]))

here hidden is selecting how much to show...say "og" gave a value 0.5 and tanh gave a value 0.2 so multiplying would give hidden as "0.1". I assume this happens for the entire list, so like in the end hidden would be having something like [0.1 , 0.2 , 0.5........ till hidden_size ie 500 elements] . But how would this value ie hidden get used further. How does it "select" a certain part of content.? these are just float values ..... how does it relate with word vectors? thank you

JRC1995 commented 6 years ago

"og" gave a value 0.5 and tanh gave a value 0.2 so multiplying would give hidden as "0.1". I assume this happens for the entire list, so like in the end hidden would be having something like [0.1 , 0.2 , 0.5........ till hidden_size ie 500 elements]

since hidden size is 500...og will have 500 values. so og will be more like [0.5 0.3 0.7.....500 elements]. We are dealing with high dimensional vectors here. og could be 0.5 if we were dealing with 1 dimensional vectors and a hidden layer with one neuron. Tanh part should similarly have 500 elements like [0.2 0.5 0.6....500 elements]

The result of tf.multuiply (element wise multiplication) will be like this [0.50.2 0.30.5 0.7*0.6.....500 elements]

= [1 1.5 4.2 .......500 elements]

So any hidden state will have 500 elements at one go not 'in the end'. The machine will parallely deal with 500 elements\dimensions and their operations.

Each hidden state will be like this...a vector of 500 dimensions (500 elements to represent it)

There will be as many hidden states as there are inputs (in case of encoding, the no. of words in the sequence).

So the LIST of all hidden states will be like this:

[ [1 1.5 4.2 ....500 elements] [ 7 8.2 9.....500 elements]........seq_len no. of elements ]

But how would this value ie hidden get used further

If these hidden states are the product of the encoder, then attention will be used on these states to determine the context vector which will be used along with decoder hidden state to determine the final output.

How does it "select" a certain part of content.?

I am not totally sure about what you are asking. You mean certain part of 500 elements? It doesn't select a certain part. The list of 500 elements is a 500 dimensional vector - it is treated as 'one thing'. Mathematical operations on it are done following the rules of linear algebra (matrix multiplication, sometimes element wise multiplication, transpose etc.). Sometimes it may be necessary to access certain parts of a vector. You can then access it using its index. For example in classification using NN, the result will probably be a vector with as many elements in it as there are labels\classes. The elements are usually treated as probabilities for the class they represent. In that case, if there is any need to access a specific probability for a specific class, then you can simply use vector[i] to access the ith value of the vector, which may represent the probability of class no. i.

Otherwise, it isn't really required to select a certain part.

But how would this value ie hidden get used further.

Well if you find some mathematical operation like hidden_state * weights or so on...you know the value is getting used.

these are just float values ..... how does it relate with word vectors?

Word vectors ARE float values. Algorthims like Glove or Word2vec TRANSFORMS words into VECTORS. So "Lion" may be transformed into [0.2 0.5 0.7] (a 3 dimensional vector). They relate to hidden states through the language of linear algebra.
You can check out those algorithms to understand how words get transformed to vectors, and about some interesting properties of such transfomed vectors.

hack1234567 commented 6 years ago

so LSTM is doing its remembering and forgetting part by transforming the values ..ie

cell = tf.multiply(fg,cell) + tf.multiply(ig,tf.sigmoid( tf.matmul(x,wc) + tf.matmul(hidden,uc) + bc))

by multiplying it with fg,ig.... the original values are transfomed to represent something else?

hidden_forward = hidden_forward.write(i,tf.reshape(hidden,[hidden_size]))

here write would write the value of hidden(say [0.2 0.4 0.5....] to i ? so 0th element would represent a particular word which is a 500 dimensional vector? So if it was Lion, now its transformed to a new entity and placed on 0th index?

also

hidden_forward = forward_encoder(...................)

how to print hidden_forward ? its a stack, so running it from sess.run wont work , i think.

also how to find accuracy of each training iteration?

JRC1995 commented 6 years ago

the original values are transfomed to represent something else?

Yes. But it doesn;'t necessarily represent something else entirely. It can be hard to say what exactly it represents. It's supposed to represent mixture of various data (input data, previous cell state data, hidden state data etc).

here write would write the value of hidden(say [0.2 0.4 0.5....] to i ?

Yes.

so 0th element would represent a particular word which is a 500 dimensional vector? So if it was Lion, now its transformed to a new entity and placed on 0th index?

Note 'lion' was transformed to a vector (in this case 50 dimensional vector) long before the network starts. IT was done in early preprocessing stage using GloVe.

The encoder LSTM transforms it again into a 500 dimensional vector using all the formulas with ig, og etc. This 500 dimensional vector can be said to be the encoded version of the 50 dimensional word vector. Usually context information (from previous words) are encoded with the 50 dimensional word during its transformation to the 500 dimensional vector.

So the 0th index will contain the context encoded 500 dimensional vector version of the 50 dimesnional GloVe word vector representation of lion.

how to print hidden_forward ? its a stack, so running it from sess.run wont work , i think.

I don't know what you mean by 'it's a stack'. Stack is a function for Tensorarrays to retrieve values from a Tensorarray and make a tensor out of it. A tensor is like a n-dimensional array. I think you can print with sess.run. At the end of the day, hidden_forward becomes a tensor.

also how to find accuracy of each training iteration?

Typical accuracy should signify what % of the words in the prediction are same (and in the same position) as the target output, if we are to calculate accuracy similar to how we calculate accuracy for standard classification problems and such.

So you have to manually compare words in prediction and the target output for each position, to determine the accuracy.

But, that is not a good measurement of the quality of summary in general.

A summary is a good summary if it can represent the essential meaning of the text in a concise manner despite having different words, in different positions than what it is in the target output (there can be multiple possible good summaries of a text. Target output is only one. )

So evaluation metrics like BLEU, ROUGE, METEOR etc. are used to measure the quality of summary. BLEU is pretty popular, but by no means they are perfect evaluative metrics. Still, they should be better than simply checking what % of prediction has the same word and position as the words in target output.