allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.04k stars 273 forks source link

index out of range in self! #163

Open MarwaEssam opened 3 years ago

MarwaEssam commented 3 years ago

I want to use the pre-trained long-former model to get embeddings for long documents (maximum 5000 words). I am trying to execute the demo on relatively long documents to test the library. The code works if the input text is small. But it gives the following error when executing it for long documents. any help?

P.S I am running it on a regular CPU

    config = LongformerConfig.from_pretrained('longformer-encdec-base-16384/')
    config.attention_mode = 'sliding_chunks'
    model = LongformerModel.from_pretrained("longformer-encdec-base-16384/")
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    SAMPLE_TEXT = ' '.join(['Hello to this world'] * 2000)  # long input document

    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)

    attention_mask = torch.ones(input_ids.shape, dtype=torch.long,
                                device=input_ids.device)  # initialize to local attention

    input_ids, attention_mask = pad_to_window_size(
        input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    x = model(input_ids, attention_mask=attention_mask)[0]
    print(x)

x = modelLong(input_ids, attention_mask=attention_mask) File "/Users/omarsayed/PycharmProjects/testLongFormer/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/Users/omarsayed/PycharmProjects/testLongFormer/venv/lib/python3.8/site-packages/transformers/modeling_longformer.py", line 1070, in forward embedding_output = self.embeddings( File "/Users/omarsayed/PycharmProjects/testLongFormer/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/Users/omarsayed/PycharmProjects/testLongFormer/venv/lib/python3.8/site-packages/transformers/modeling_roberta.py", line 81, in forward return super().forward( File "/Users/omarsayed/PycharmProjects/testLongFormer/venv/lib/python3.8/site-packages/transformers/modeling_bert.py", line 208, in forward position_embeddings = self.position_embeddings(position_ids) File "/Users/omarsayed/PycharmProjects/testLongFormer/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/Users/omarsayed/PycharmProjects/testLongFormer/venv/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 124, in forward return F.embedding( File "/Users/omarsayed/PycharmProjects/testLongFormer/venv/lib/python3.8/site-packages/torch/nn/functional.py", line 1814, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in self

matt-peters commented 3 years ago

Not sure exactly, but the maximum sequence length for the Longformer RoBERTa pretrained model is 4096, and it will probably raise an exception if you use a longer sequence then that.

MarwaEssam commented 3 years ago

Not sure exactly, but the maximum sequence length for the Longformer RoBERTa pretrained model is 4096, and it will probably raise an exception if you use a longer sequence then that.

I tried with setting truncation=true while tokenizing. Still the same error.

And I tested also with this model longformer-encdec-base-16384. Isn't it supposed to work on 16k input?

matt-peters commented 3 years ago

You may need to switch the tokenizer - the longformer-encdec-base-16384 doesn't necessarily use the same tokenizer as longformer-base-4096. See here's how to get the correct tokenizer: https://github.com/allenai/longformer/blob/master/scripts/summarization.py#L86

MarwaEssam commented 3 years ago

You may need to switch the tokenizer - the longformer-encdec-base-16384 doesn't necessarily use the same tokenizer as longformer-base-4096. See here's how to get the correct tokenizer: https://github.com/allenai/longformer/blob/master/scripts/summarization.py#L86

this error again occurred with me when I used longformer-base-4096 and passed a text that has more than these tokens even when setting truncation=true during tokenization and when ensuring that the input ids are of size 4096 after encoding.

MarwaEssam commented 3 years ago

I made the model work by setting the max_position_embeddings in the config file myself to 16384. Not sure if this is correct though

arame commented 3 years ago

I found that the problem was caused by my data containing invalid label values. When I fixed that, I stopped getting this error.