No bos token - Githubissues

AlexeySorokin commented 3 years ago

Is it possible to generate text using an empty prompt or calculate the probability of the first token? There is no bos_token_id defined, therefore I do not see a natural way to do it. See https://colab.research.google.com/drive/1JvOSnKU4Mn1VtHoYXQ3rDXLRbOWwtRds#scrollTo=BjHAdxfyYw13 for an example.

If I add <s> as the first token, the probability distribution for the first word is weird (see below), which means it does not solve the problem.

import torch
from torch.nn import Softmax

softmax = Softmax(dim=1)
text = "В Москве стоит довольно хорошая погода."
with torch.no_grad():
    encoded_input = tokenizer(text, return_tensors='pt', add_special_tokens=True)["input_ids"]
    encoded_input = torch.cat([torch.LongTensor([[tokenizer.bos_token_id]]), encoded_input, torch.LongTensor([[tokenizer.eos_token_id]])], dim=1)
    print(*[tokenizer.decode(x) for x in encoded_input[0]])
    outputs = model(encoded_input)
    probs = softmax(outputs.logits[0])

values, indexes = torch.topk(probs, dim=1, k=20)
values, indexes = values.numpy(), indexes.numpy()

for i, (curr_probs, curr_indexes) in enumerate(zip(values, indexes)):
    print(i)
    for index, prob in zip(curr_indexes, curr_probs):
        print(f"{tokenizer.decode([index]).rstrip()}:{prob:.2f}", end=" ")
    print("")

The output is

0
:0.05 .:0.05 :0.05 ;:0.03 ,:0.02  in:0.02 :0.02  and:0.01  the:0.01 {:0.01 \:0.01 [:0.01  a:0.01 ':0.01  of:0.01  \:0.01  to:0.01 :0.01 ::0.01  as:0.01 
1
.:0.02  этом:0.02 нимание:0.01 первые:0.01  общем:0.01 спом:0.01  конце:0.01 роде:0.01 зя:0.01  том:0.01 месте:0.01  этой:0.01  России:0.01 нутри:0.01 ход:0.01 торая:0.00  связи:0.00 сю:0.00 &:0.00  начале:0.00 
2
,:0.08  в:0.05  на:0.04 .:0.02  и:0.02  есть:0.01  с:0.01 ::0.01  -:0.01  не:0.01  у:0.01  уже:0.01  был:0.01  я:0.01 &:0.01  по:0.01  было:0.01  (:0.01  за:0.00  была:0.00 
3
 памятник:0.09  на:0.03 ,:0.02  в:0.02  жара:0.01  стол:0.01  не:0.01  такая:0.01  очередь:0.01  только:0.01  по:0.01  такой:0.01  один:0.00 .:0.00  очень:0.00  новый:0.00  &:0.00  гроб:0.00  прекрасная:0.00  тишина:0.00 
4
 много:0.11  высокая:0.06  большой:0.04  большая:0.03  высокий:0.02  большое:0.02 -:0.01  сильный:0.01  хорошая:0.01  странная:0.01  высокое:0.01  низкая:0.01  прилич:0.01  прохлад:0.01  дорого:0.01  внуш:0.01  странное:0.01  теплая:0.01  сильное:0.01  сложная:0.01 
5
 погода:0.22 ,:0.07  гостиница:0.03  мебель:0.02  церковь:0.01  и:0.01  выставка:0.01  (:0.01  тишина:0.01  стол:0.01  цена:0.01  квартира:0.01  традиция:0.01  статуя:0.01  архитек:0.01  картина:0.01  &:0.01 .:0.01  очередь:0.01  русская:0.00 
6
,:0.46 .:0.31  и:0.05 ::0.03  для:0.02  -:0.01 ;:0.01  (:0.01 &:0.01  —:0.01  в:0.01  с:0.01  –:0.00 ...:0.00  &:0.00  на:0.00 :0.00 !:0.00 …:0.00 :0.00 
7
:0.24 :0.08  В:0.03  Но:0.03  И:0.02 :0.02  А:0.02  Я:0.01  На:0.01  Это:0.01 &:0.01  С:0.01  Не:0.01  У:0.01  По:0.01 В:0.01  Так:0.01  Как:0.01  Если:0.00  Он:0.00 
8
В:0.06 .:0.06 ,:0.05 в:0.01 ::0.01 С:0.01 &:0.01 :0.01 (:0.01 =:0.01  в:0.01  В:0.01  и:0.01 По:0.01 О:0.01 ++:0.01 Но:0.01 пол:0.00 Из:0.00 Пол:0.00

ollmer commented 3 years ago

Try <|endoftext|> token

AlexeySorokin commented 3 years ago

@ollmer I have tried already for the small model, there is no such token in the vocabulary. If I take its index (50256), it corresponds to the word бросать.

king-menin commented 3 years ago

for some models we don't provide special eos token.

ai-forever / ru-gpts

No bos token #51