churchlab / UniRep

UniRep model, usage, and examples.
338 stars 96 forks source link

Understanding b._output conceptually #5

Closed spark157 closed 5 years ago

spark157 commented 5 years ago

Hello,

As an example say I run the babbler on the the sample 'formatted.txt' file with a batch size of 12 and I use the 64 dim babbler. Then b._output will return the final hidden output with dimensions (12, 265, 64) (assuming that the longest sequence is 265 and so all the other sequences are padded out to this length).

It is clear that 12 is the number in the batch and 64 is the dimension of the representation. The 265 reflects the length of the longest sequence but what is the 265 conceptually? I was originally thinking that each element of the 265 would reflect a representation of one of the amino acid sequence inputs (ie. that you would have a 64 dim representation for each amino acid in the original input sequence). So, for example if the input sequence was 'MEAFL...' that b._output[loc, 1, ] would somehow be the 64 dim representation for the input amino acid 'E' in the output.

Is this correct?

Now I'm wondering whether it represents something totally different - like the number of steps the sequence has been unfolded in the rnn or something along those lines. I started to get this idea/confusion when reading some documentation about what tf.nn.dynamic_rnn is returning. Also, the slicing into self._output to create self._top_final_hidden is confusing me. The way it is being sliced seems to imply that _top_final_hidden is the representation of the slice along the length of the sequence (and so corresponds to the last amino acid in the input sequence).

Thanks.

Scott

sandias42 commented 5 years ago

Hi Scott,

UniRep is trained by predicting the next amino acid. In doing so, it is forced to "summarize" everything it has seen so far into a single fixed vector, typically called h, which is what makes up b._output.

As you said, the 0th dimension (size 12 in your case) is the batch size. The 2nd dimension (size 64 in your case) is the representation dimension. And the first dimension, (size 265 in this case), as you said, is the "position" dimension (also called the "time" dimension by NLP folks, which may be part of your confusion). The position dimension is padded, so there will be junk in some entries past the length of the input sequence.

The best way to think about a given index [i, j, :] (using the numpy colon syntax) is as the model's best guess for a summary of what it has seen so far for sequence i in the batch up to a position j along the sequence. Mechanistically, the vector self._outputs[i, j, :] would have been used in training to predict the amino acid in position j+1 in sequence i. So it must be a summary of what the model knows up to that point.

The reason that final_hidden is a slice for the last position (the last amino acid in the input sequence, as you say) is because we think that this final slice should represent the model's best guess summarization of what it has seen before, e.g. the entire sequence. Of course, as we describe in the paper, we find that the best performing representation is actually the average of the representation vectors along the position dimension (averaging the models best guess of what it has seen so far along the whole sequence).

I would strongly recommend understanding the code for the 1900 dimensional babbler before thinking too hard about the 256 or 64 dim ones- they have the added complexity of multiple layers. I've done some fancy indexing and overwriting of default hidden states so the output ends up looking the same for the 256 and 64 D babblers, but the best way to understand what is going on is probably following the babbler1900 implementation through the dynamic_rnn call and pass to the mLSTM cell.

I hope this makes sense. Closing the issue but please reopen if still confused. Ethan