ParikhKadam / bidaf-keras

Bidirectional Attention Flow for Machine Comprehension implemented in Keras 2
GNU General Public License v3.0
64 stars 21 forks source link

Hi, I have an issue with checking the model dimensions #2

Closed boslbi92 closed 5 years ago

boslbi92 commented 5 years ago

Hello, I would like to first thank you all for reproducing bidaf in keras framework. It is a great resource, and I successfully managed to start training without issues. However, I have hard time understanding the dimensions of each layers.

According to my version, the model starts with (None, None, 400) for both passage and question input. Since this model is based on word embeddings, I am expecting 3d input of (None, seq_len, embedding_dim) input. These "None" dimensions propagate through all the subsequent layers, and especially for the similarity_layer, the dimension is (None, None, None). Because of this ambiguity, I am having hard time trying to tweak or update some model structures.

Is this a known issue?

I would love to hear back from you, thanks!

ParikhKadam commented 5 years ago

Hi @boslbi92 .. I understood your query. First of all, thank you for providing your feedback here.. The project is still in development phase and that's the reason why feedback from users means a lot.

I am expecting 3d input of (None, seq_len, embedding_dim) input.

Yes.. you are right here. And the shape in input layers i.e. (None, None, 400) are the same. Let me say as, a, b, c = (None, None, 400) ... Then, a corresponds to batch_size, b corresponds to ncw for context or nqw query and c is the embedding dimension.

Here, ncw = number of words/tokens in context and nqw = number of words/tokens in query

We can do the same for similarity_layer. The similarity layer will output a similarity value for every pair of context and query word. Hence, it will be a 2-D matrix (in naive programming) such as:

sim = 2-D matrix
for cw in cws:
    for qw in qws:
        sim[cw,qw] = some_value

where, cws = list of context words cw = context word qws = list of query words qw = query word

And if we write the same of a whole batch, it will have 3 dimensions. Hence, if: a, b, c = (None, None, None) then a corresponds to batch_size, b corresponds to ncw and c corresponds to nqw.

NOTE: I have seen many codes related to NLP tasks and specifically for QA or Machine Comprehension. Most of them had a fix length for context and query i.e. the number of words in context and query are fixed. Either padding or truncation is used to get fixed length context and query. But I have tried to add flexibility to this model to take inputs of any length. As I said, we are still in development, you might face issues with such code. But we are always here to fix problems. That's the way we learn about our mistakes. Your feedback is important!

I don't know if adding such flexibility might harm the model in some way.. We don't have high spec setups to test. We will come to know the truth as soon as training gets complete.

Thank you...

ParikhKadam commented 5 years ago

@boslbi92

Because of this ambiguity, I am having hard time trying to tweak or update some model structures.

Can you elaborate on this please? Also, if needed, we can add more flexibility to the code so that the user can select if he/she wants fixed-size inputs or any-size inputs.

Thank you..

boslbi92 commented 5 years ago

Hey @ParikhKadam, thanks for the fast reply.

From my knowledge, I know many models allow users to specify input sizes with padding (mostly fixed because neural network cannot work with varying inputs). I believe you meant to take input of any size but don't you have to fix the size (for #context_words and #query_words) with a predefined size before feeding into the network? It is not clear to me how to handle different size of samples during computation, and I was expecting to see "resolved" total #words for both context and query before training.

To follow up on your post, you are completely right on a, b, c = (None, None, 400) example. Batch size can be None because it is variable-sized input, but I thought "b" has to be fixed when you print model.summary() before you actually feed the samples. From my opinion, it looks much clear if the dimensions of each layer are displayed correctly (even with freedom) so that it is easier to debug and understand what is going on exactly. I think it can also create future issues if any of the dimension except batch_size remains unresolved when you want to load specific parts of the model after training. Also, I don't think it will hurt the model performance cause different input sizes are always open to experiment in ML and there is nothing wrong with specifying the input dimension. For example, (python main.py -num_context_words 100 -num_query_words 100 ...)

I am currently working on improving the state-of-the-art machine reading models and I was planning to use bidaf similarity layer (c2q and q2c attention) as part of the input layers to my model.

Just for curiosity, what computing specs do you guys have and how long does it take for you to train bidaf in full capacity?

Thanks!

ParikhKadam commented 5 years ago

Hi @boslbi92 .. Thanks for the reply.. It made my mind clear on some parts.

From my knowledge, I know many models allow users to specify input sizes with padding (mostly fixed because neural network cannot work with varying inputs)

Yes.. you are right. That's because you can't combine two tensors with shapes (x, 400) and (y, 400) to form a new tensor of shape (2, z, 400), where x!=y. Hence, "we" use a fixed size input to model so that model can train on batches instead of single example at a time. This is the main reason. But as I said, we don't own any high spec setup. So, my main aim when writing the basic code was to make this model memory efficient. Let me explain you this with an example.

Suppose, I have a dataset of 50 context-question-answer pairs (data points) . Now, if I set the num_context_words and num_query_words to the maximum values found in the dataset, which is suppose 100 for both, and on analyzing the data, I come to know that most of the data points have on average 40 context and question words, this method will use more memory causing zero benefit -- memory waste.

Now, suppose I set num_context_words and num_query_words to be 40 but I know that there are a few data points whose values for these variables are greater than 40. I need to trim them. -- data loss.

Due to these two reasons, I planned on having a middle way which causes zero data loss as well as less memory consumption as much as possible. For that, I did:

  1. Don't fix num_context_words and num_query_words until you start generating batches.
  2. When you generate a batch, set num_context_words and num_query_words to the maximum value of data points in the batch.
  3. Provide the batch as input to the network.

With such a technique, the value of num_context_words and num_query_words will be fixed for a single batch but vary with different batches. Now, the other part..

From my opinion, it looks much clear if the dimensions of each layer are displayed correctly (even with freedom) so that it is easier to debug and understand what is going on exactly. I think it can also create future issues if any of the dimension except batch_size remains unresolved when you want to load specific parts of the model after training.

For debugging, you can better pass input into the model and analyze the shapes at each layer. Also, tensorflow's eager execution may help but that needs some extra work. But still, there is nothing to worry as the model is same as the one in paper. And we are going to provide proper documentation as soon as development phase is complete.

Yes... you are right. It may cause issues when porting this model or its layers for external use. The users who ports the model may need to do some extra work. This is considered as an issue and will be solved as soon as possible. Also, just two days ago, we noticed that to train this model on TPU, we need to fix the dimensions except batch_size. So, when we planned adding TPU support, this was considered as an issue which we needed to solve. You just gave us another reason to add this functionality. Thank you :)

I am currently working on improving the state-of-the-art machine reading models and I was planning to use bidaf similarity layer (c2q and q2c attention) as part of the input layers to my model.

That's good.. Actually I would like to know more about your work. Seems interesting..

Just for curiosity, what computing specs do you guys have and how long does it take for you to train bidaf in full capacity?

We don't own any setup. We train the model on two machines - one provided by our college which is shared among teams working on ML/DL projects and the other is google colab. The setup at our college is having 12GB Quadro P5000 GPU processor and I don't think there's a need to mention about colab. If no one else is using the college setup, our model completes a single epoch in maximum 4 hours on SQUAD-v1.1.

We have already modified our code to work with SQUAD-v2.0 which is going to be released soon as some testing is to be done on it first.

Thank you.. We are adding our current status in the README.md file so that users can know..

ParikhKadam commented 5 years ago

Your issue is solved now.. Here's the latest model.summary():

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
passage_input (InputLayer)      (None, 384, 400)     0                                            
__________________________________________________________________________________________________
question_input (InputLayer)     (None, 64, 400)      0                                            
__________________________________________________________________________________________________
highway_1_ptd (TimeDistributed) (None, 384, 400)     320800      passage_input[0][0]              
__________________________________________________________________________________________________
highway_1_qtd (TimeDistributed) (None, 64, 400)      320800      question_input[0][0]             
__________________________________________________________________________________________________
bidirectional_encoder (Bidirect multiple             2563200     highway_1_qtd[0][0]              
                                                                 highway_1_ptd[0][0]              
__________________________________________________________________________________________________
similarity_layer (Similarity)   (None, 384, 64)      2401        bidirectional_encoder[1][0]      
                                                                 bidirectional_encoder[0][0]      
__________________________________________________________________________________________________
context_to_query_attention (C2Q (None, 384, 800)     0           similarity_layer[0][0]           
                                                                 bidirectional_encoder[0][0]      
__________________________________________________________________________________________________
query_to_context_attention (Q2C (None, 384, 800)     0           similarity_layer[0][0]           
                                                                 bidirectional_encoder[1][0]      
__________________________________________________________________________________________________
merged_context (MergedContext)  (None, 384, 3200)    0           bidirectional_encoder[1][0]      
                                                                 context_to_query_attention[0][0] 
                                                                 query_to_context_attention[0][0] 
__________________________________________________________________________________________________
bidirectional_decoder_0 (Bidire (None, 384, 800)     11523200    merged_context[0][0]             
__________________________________________________________________________________________________
span_begin (SpanBegin)          (None, 384)          4001        merged_context[0][0]             
                                                                 bidirectional_decoder_0[0][0]    
__________________________________________________________________________________________________
span_end (SpanEnd)              (None, 384)          19207201    bidirectional_encoder[1][0]      
                                                                 merged_context[0][0]             
                                                                 bidirectional_decoder_0[0][0]    
                                                                 span_begin[0][0]                 
__________________________________________________________________________________________________
combine_outputs (CombineOutputs (None, 2, 384)       0           span_begin[0][0]                 
                                                                 span_end[0][0]                   
==================================================================================================
Total params: 33,620,803
Trainable params: 33,620,803
Non-trainable params: 0
__________________________________________________________________________________________________