allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.05k stars 276 forks source link

Embedding dimension #203

Open Nick9214 opened 3 years ago

Nick9214 commented 3 years ago

After I istantiated the model, I created the embeddings for my new corpus and then I extracted only the vectors of the CLS tokens: with torch.no_grad(): last_hidden_states = model(input_ids, attention_mask=attention_mask) features = last_hidden_states[0][:,0,:].numpy() (I got my input_ids in this way: input= sent.apply(lambda x: tokenizer.encode(x, add_special_tokens=True) ) where sent is the Series containing my corpus)

What I don't understand is why the vectors obtained for each CLS token have a dimensionality of 50256 (that is vocab size). Don't Bert-like models have a fixed dimensionality much lower than the vocabulary dimension?