Processing Reference Corpus [details]

nilinykh commented 3 years ago

Hej,

thank you for your excellent work! Could you please tell a little more about the process of getting embeddings for the reference corpus? It was not clear from the paper how your models 'processed this corpus' and gave you embeddings. Also, maybe you can point me to the code where I can see this process happening?

My interpretation is the following one: so, to get both token and context embeddings, you are basically:

Running the GPT model (for example).
And force every next word to be a word from the ground-truth text (otherwise, the following input would be the predicted word, which is not necessarily a correct word, and it would affect the embeddings)
Save concatenated heads or heads passed through the linear layer? The sentence 'The model then processes this corpus' from the paper was not fully clear to me - how is this processing conducted?

Thank you.

bhoov commented 3 years ago

Sure, I can clarify briefly. A "processed corpus" is a smaller corpus (typically not the training corpus) that can be fully tokenized and fed through a trained model (say GPT as you mentioned). The corpus is then fed sentence by sentence into GPT for inference, and we save a bunch of hidden states and information about each sentence. These pieces of information are:

The attention matrix for each input sentence at each head of each layer
The embedding of each token after each layer
The "context" (that is, the representation of each token from the perspective of each head, before the linear projection that will turn all the head information into the embedding for the next layer)
Linguistic metadata about each token in the model (e.g., Part of Speech, dependency... a bunch of metadata that Spacy provides)

As you can imagine, the HDF5 files that hold all this information can grow quite large in size for larger corpora and models. There is a README here that describes the code that runs to do this task.

Your assumption (pt 2) is correct: since we are not training the model, we don't need to force any kind of task on GPT, and we do not want to use any token predicted by GPT. We do, however, keep the attention mask for every token such that the embeddings for these autoregressive models can only be created from information in preceding word tokens.

nilinykh commented 3 years ago

Thank you a lot for the explanation! It makes sense =)

bhoov / exbert

Processing Reference Corpus [details] #21