Closed nilinykh closed 3 years ago
Sure, I can clarify briefly. A "processed corpus" is a smaller corpus (typically not the training corpus) that can be fully tokenized and fed through a trained model (say GPT as you mentioned). The corpus is then fed sentence by sentence into GPT for inference, and we save a bunch of hidden states and information about each sentence. These pieces of information are:
As you can imagine, the HDF5 files that hold all this information can grow quite large in size for larger corpora and models. There is a README here that describes the code that runs to do this task.
Your assumption (pt 2) is correct: since we are not training the model, we don't need to force any kind of task on GPT, and we do not want to use any token predicted by GPT. We do, however, keep the attention mask for every token such that the embeddings for these autoregressive models can only be created from information in preceding word tokens.
Thank you a lot for the explanation! It makes sense =)
Hej,
thank you for your excellent work! Could you please tell a little more about the process of getting embeddings for the reference corpus? It was not clear from the paper how your models 'processed this corpus' and gave you embeddings. Also, maybe you can point me to the code where I can see this process happening?
My interpretation is the following one: so, to get both token and context embeddings, you are basically:
Thank you.