allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.04k stars 274 forks source link

index out of range: Tried to access index 50265 out of table with 50264 rows. #160

Open SefaZeng opened 3 years ago

SefaZeng commented 3 years ago

I try to run the demo code of this

import torch
from longformer.longformer import Longformer, LongformerConfig
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer

config = LongformerConfig.from_pretrained('longformer-base-4096/') 
# choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
# 'n2': for regular n2 attantion
# 'tvm': a custom CUDA kernel implementation of our sliding window attention
# 'sliding_chunks': a PyTorch implementation of our sliding window attention
config.attention_mode = 'sliding_chunks'

model = Longformer.from_pretrained('longformer-base-4096/', config=config)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.model_max_length = model.config.max_position_embeddings

SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document

input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1

# TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
# model = model.cuda(); input_ids = input_ids.cuda()

# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                     # classification: the <s> token
                                     # QA: question tokens

# padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
input_ids, attention_mask = pad_to_window_size(
        input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)

output = model(input_ids, attention_mask=attention_mask)[0]

And I download the pretrain model, vocab and merge_file for roberta_base. But it raise a error like

index out of range: Tried to access index 50265 out of table with 50264 rows.

I found the input ids which had been tokenized by the Robertatokenizer had id 50265, 50267 but the max id of vocab is 50264. Is there something I missed like some other vocab file or whatever? Any help is appreciate! Thx.

SefaZeng commented 3 years ago

I found the tokenizer_utils_base.py will add 4 special token to vocab which are , , , . But these special tokens are in the vocab.json already. I can fix these 4 tokens id to the original id in the vocab e.g 0, 2, 1, but I am not sure if this operation would meets some other problems.